0% found this document useful (0 votes)
6 views

Undestanding Data Module-3

Uploaded by

ranju18102002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Undestanding Data Module-3

Uploaded by

ranju18102002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Topic-02: Understanding Data

WHAT IS DATA?
• All facts are data. In computer systems, bits encode facts present in numbers, text, images,
audio, and video.
• Data can be directly human interpretable (such as numbers or texts) or diffused data such
as images or video that can be interpreted only by a computer.
• Data is available in different data sources like flat files, databases, or data warehouses. It
can either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and
processes. For example, daily sales data is operational data, on the other hand,
non-operational data is the kind of data that is used for decision making.
• Data by itself is meaningless. It has to be processed to generate any information. A string
of bytes is meaningless. Only when a label is attached like height of students of a class, the
data becomes meaningful.
• Processed data is called information that includes patterns, associations, or relationships
among data. For example, sales data can be analyzed to extract information like which
product was sold larger in the last quarter of the year.
Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale computer is called
‘small data’. These data are collected from several sources, and integrated and processed by a
small-scale computer.
Big data, on the other hand, is a larger data whose volume is much larger than ‘small data’ and
is
characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes (GB)
and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and exabytes
(EB). One exabyte is 1 million terabytes.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate. Velocity helps to understand the relative growth of big data and
its accessibility by users, systems and applications.
3. Variety – The variety of Big Data includes:
• Form – There are many forms of data. Data types range from text, graph, audio, video, to
maps. There can be composite data too, where one media can have many other sources of
data, for example, a video can have an audio song.
• Function – These are data from various sources like human conversations, transaction
records, and old archive data.
• Source of data – This is the third aspect of variety. There are many sources of data.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data.
• Some of the other forms of Vs that are often quoted in the literature as characteristics of
big data are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts,
truthfulness,
believability, and confidence in data. There may be many sources of error such as technical
errors,
typographical errors, and human errors. So, veracity is one of the most important aspects of
data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed
by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is
extracted
• from the data and its influence on the decisions that are taken based on it. Thus, these 6
Vs are helpful to characterize the big data. The data quality of the numeric attributes is
determined by factors like precision, bias, and accuracy.
• Precision is defined as the closeness of repeated measurements. Often, standard deviation
is used to measure the precision.
• Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
Accuracy is the degree of measurement of errors that refers to the closeness of
measurements to the true value of the quantity. Normally, the significant digits used to
store and manipulate indicate the accuracy of the measurement.

Types of Data

In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi
structured data.
Structured Data
In structured data, data is stored in an organized manner such as a database where it is available
in the form of a table. The data can also be retrieved in an organized manner using tools like
SQL. The structured data frequently encountered in machine learning are listed below:
Record Data
A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The measurements
can be arranged in the form of a matrix. Rows in the matrix represent an object and can be
called as entities, cases, or records. The columns of the dataset are called attributes, features,
or fields. The table is filled with observed data. Also, it is better to note the general jargons that
are associated with the dataset. Label is the term that is used to describe the
individual observations.
Data Matrix
It is a variation of the record type because it consists of numeric attributes. The standard matrix
operations can be applied on these data. The data is thought of as points or vectors in the
multidimensional space where every attribute is a dimension describing the object.
Graph Data
It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink
is an edge that connects the nodes.
Ordered Data
Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
• Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a
special type of sequence data where the data is a series of measurements over time.
• Sequence data – It is like sequential data but does not have time stamps. This data
involves the sequence of words or letters. For example, DNA data is a sequence of four
characters – A T G C.
• Spatial data – It has attributes such as positions or areas. For example, maps are spatial
data where the points are related by location.

Unstructured Data

Unstructured data includes video, image, and audio. It also includes textual documents,
programs, and blog data. It is estimated that 80% of the data are unstructured data.

Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.

Data Storage and Representation


Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis.
The goal of data storage management is to make data available for analysis. There are different
approaches to organize and
manage data in storage files and systems from flat file to data warehouses. Some of them are
listed below:
Flat Files These are the simplest and most commonly available data source. It is also the
cheapest way of organizing the data. These flat files are the files where data is stored in plain
ASCII or EBCDIC format. Minor changes of data in flat files affect the results of the data
mining
algorithms.
Hence, flat file is suitable only for storing small dataset and not desirable if the dataset becomes
larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows
represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab.
Both
CSV and
TSV files are generic in nature and can be shared. There are many tools like Google Sheets and
Microsoft Excel to process these files.

BIG DATA ANALYTICS AND TYPES OF ANALYTICS


• The primary aim of data analysis is to assist business organizations to take decisions. For
example, a business organization may want to know which is the fastest selling product,
in order for them to market activities.
• Data analysis is an activity that takes the data and generates useful information and
insights for assisting the organizations.
• Data analysis and data analytics are terms that are used interchangeably to refer to the
same concept.
• However, there is a subtle difference. Data analytics is a general term and data analysis is
a part of it. Data analytics refers to the process of data collection, preprocessing and
analysis. It deals with the complete cycle of data management. Data analysis is just
analysis and is a part of data analytics. It takes historical data and does the analysis.
Data analytics, instead, concentrates more on future and helps in prediction.
There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

BIG DATA ANALYSIS FRAMEWORK


For performing data analytics, many frameworks are proposed. All proposed analytics
frameworks
have some common factors. Big data framework is a layered architecture. Such an architecture
has many advantages such as genericness.
A 4-layer architecture has the following layers:
1. Date connection layer
2. Data management layer
3. Data analytics later
4. Presentation layer
• Data Connection Layer It has data ingestion mechanisms and data connectors. Data
ingestion means taking raw data and importing it into appropriate data structures. It
performs the tasks of ETL process. By ETL, it means extract, transform and load
operations.
• Data Management Layer It performs preprocessing of data. The purpose of this layer is to
allow parallel execution of queries, and read, write and data management tasks. There may
be many schemes that can be implemented by this layer such as data-in-place, where the
data is not moved at all, or constructing data repositories such as data warehouses and pull
data on-demand mechanisms.
• Data Analytic Layer It has many functionalities such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models. This layer
implements many model validation mechanisms too.
• The processing is done as shown in Box 2.1.
• Presentation Layer It has mechanisms such as dashboards, and applications that display the
results of analytical engines and machine learning algorithms.
• Thus, the Big Data processing cycle involves data management that consists of the
following steps.
1. Data collection
2. Data pre-processing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
• This is an iterative process and is carried out on a permanent basis to ensure that data is
suitable for data mining.
• Application and interpretation of machine learning algorithms constitute the basis for the
rest of the book. So, primarily, data collection and data preprocessing are covered as part
of this chapter.

Data Collection
The first task of gathering datasets is the collection of data. It is often estimated that most of
the
time is spent for collection of good quality data. A good quality data yields a better result. It is
often difficult to characterize a ‘Good data’. ‘Good data’ is one that has the following
properties:
1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine learning or data mining
algorithms. All
the necessary information should be available and there should be no bias in the data.
3. Knowledge about the data – The data should be understandable and interpretable, and should
be self- sufficient for the required application as desired by the domain knowledge engineer.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data.
1. Open or public data source – It is a data source that does not have any stringent copyright
rules
or
restrictions. Its data can be primarily used for many purposes. Government census data are
good
examples of open data:
• Digital libraries that have huge amount of text data as well as document images • Scientific
domains
with a huge collection of experimental data like genomic data and biological data
• Healthcare systems that use extensive databases like patient databases, health insurance data,
doctors’
information, and bioinformatics information
2. Social media – It is the data that is generated by various social media platforms like Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.
Data Preprocessing
In real world, the available data is ’dirty’. By this word ’dirty’, it means:
• Incomplete data • Inaccurate data
• Outlier data • Data with missing values
• Data with inconsistent values • Duplicate data
• Data preprocessing improves the quality of the data mining techniques. The raw data must
be preprocessed to give accurate results. The process of detection and removal of errors in
data is called data cleaning.
• Data wrangling means making the data processable for machine learning algorithms. Some
of the data errors include human errors such as typographical errors or incorrect
measurement and structural errors like improper data formats. Data errors can also arise
from omission and duplication of attributes.
• Noise is a random component and involves distortion of a value or introduction of spurious
objects. Often, the noise is used if the data is a spatial or temporal component. Certain
deterministic distortions in the form of a streak are known as artifacts.
• Consider, for example, the following patient Table 2.1. The ‘bad’ or ‘dirty’ data can be
observed in this table.

You might also like