Chapter 2 - Intro To Data Sciences
Chapter 2 - Intro To Data Sciences
Chapter two
Introduction to Data Science
2. Introduction to Data Science
What is data?
Data Information
Described as unprocessed or raw facts Described as processed data
and figures
‘groups of non-random’ symbols in the Processed data in the form of text, images,
form of text, images, and voice and voice representing quantities, action and
representing quantities, action and objects'.
objects'.
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or machine to increase
their usefulness and add values for a particular purpose.
Data processing consists of the following basic steps - input, processing, and output. These
three steps constitute the data processing cycle.
Input step − the input data is prepared in some convenient form for processing.
The form depends on the processing machine.
For example - when electronic computers are used – input medium options include
magnetic disks, tapes, and so on.
Processing step − the input data is changed to produce data in a more useful form.
For example - pay-checks can be calculated from the time cards, or a summary of sales for
the month can be calculated from the sales orders.
Output step − the result of the proceeding processing step is collected.
The particular form of the output data depends on the use of the data.
For example - output data may be pay-checks for employees.
2.1.2 Data types and its representation – based on programming language
Data type or simply type is an attribute of data which tells the compiler or
interpreter how the programmer intends to use the data.
Almost all programming languages explicitly include the notion of data type.
Common data types include:
Integers
Booleans
Characters
floating-point numbers
alphanumeric strings
A data type constrains the values that an expression, such as a variable or a function,
might take.
This data type defines the operations that can be done on the data, the meaning of
the data, and the way values of that type can be stored.
On other hand, for the analysis of data, there are three common types of data types
or structures: Structured data, unstructured data, and semi-structured data.
Data types/structure – based on analysis of data
Structured Data, unstructured data, semi-structured data, and metadata
Structured Data
Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with relationship between the different rows and columns.
Common examples are Excel files or SQL databases.
Each of these have structured rows and columns that can be sorted
Structured data depends on the existence of a data model – a model of how data can be stored, processed
and accessed.
Structured data is considered the most ‘traditional’ form of data storage, since the earliest versions of
database management systems (DBMS) were able to store, process and access structured data.
Unstructured Data
Unstructured data is information that either does not have a predefined data model or is not organized in a
pre-defined manner. It is without proper formatting and alignment
Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as
well.
This results in irregularities and ambiguities that make it difficult to understand using traditional programs
as compared to data stored in structured databases.
Common examples include: audio, video files or No-SQL databases.
The ability to store and process unstructured data has greatly grown in recent years, with many new
technologies and tools coming to the market that are able to store specialized types of unstructured data.
For example:
MongoDB is optimized to store documents.
Apache Graph - is optimized for storing relationships between nodes.
The ability to analyze unstructured data is especially relevant in the context of Big Data, since a large part of
data in organizations is unstructured. Think about pictures, videos or PDF documents.
The ability to extract value from unstructured data is one of main drivers behind the quick growth of Big
Data.
Semi-structured Data
Semi-structured data is a form of structured data that does not conform
with the formal structure of data models associated with relational
databases or other forms of data tables,
but nonetheless contain tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within the data.
Therefore, it is also known as self-describing structure.
Fore example: JSON and XML are forms of semi-structured data.
The reason that this third category exists (between structured and
unstructured data) is because semi-structured data is considerably easier to
analyze than unstructured data.
Many Big Data solutions and tools have the ability to ‘read’ and process
either JSON or XML. This reduces the complexity to analyze structured
data, compared to unstructured data.
Metadata – Data about Data
A last category of data type is metadata. From a technical point of
view, this is not a separate data structure, but it is one of the most
important elements for Big Data analysis and big data solutions.
Metadata is data about data.
It provides additional information about a specific set of data.
In a set of photographs, for example, metadata could describe when
and where the photos were taken. The metadata then provides fields
for dates and locations which, by themselves, can be considered
structured data.
Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
2.2 Data value Chain
Data Analysis
It is concerned with making the raw data acquired amenable to use in decision-mak-
ing as well as domain-specific usage.
Data analysis involves exploring, transforming, and modelling data with the goal of
highlighting relevant data, synthesizing and extracting useful hidden information
with high potential from a business point of view.
Related areas include data mining, business intelligence, and machine learning
(covered in Chapter 4).
Data Curation
It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation, vali-
dation, and preservation.
Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
Data curators (also known as scientific curators, or data annotators)
hold the responsibility of ensuring that data are trustworthy, discover-
able, accessible, reusable, and fit their purpose.
A key trend for the curation of big data utilizes community and crowd
sourcing approaches.
Data Storage
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
However, the ACID (Atomicity, Consistency, Isolation, and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable for
big data scenarios.
NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
Data Usage
It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the
business activity.
Data usage in business decision-making can enhance competitiveness
through reduction of costs, increased added value, or any other pa-
rameter that can be measured against existing performance criteria
2.3 Basic concepts of big data
Big data is a blanket term for the non-traditional strategies and technologies needed to
gather, organize, process, and gather insights from large datasets.
While the problem of working with data that exceeds the computing power or storage of a
single computer is not new, the pervasiveness, scale, and value of this type of computing
has greatly expanded in recent years.
We will also take a high-level look at some of the processes and technologies currently
being used in this space.
What Is Big Data?
An exact definition of “big data” is difficult to nail down because projects, vendors,
practitioners, and business professionals use it quite differently. With that in mind,
generally speaking, big data is:
1. large datasets
2. the category of computing strategies and technologies that are used to handle large datasets
In this context, “large dataset” means a dataset too large to reasonably process or store
with traditional tooling or on a single computer.
This means that the common scale of big datasets is constantly shifting and may vary
significantly from organization to organization.
Why Are Big Data Systems Different?
The basic requirements for working with big data are the same as the
requirements for working with datasets of any size.
However, the massive scale, the speed of ingesting and processing, and
the characteristics of the data that must be dealt with at each stage of
the process present significant new challenges when designing
solutions.
The goal of most big data systems is to surface insights and
connections from large volumes of heterogeneous data that would not
be possible using conventional methods.
In 2001, Gartner’s Doug Laney first presented what became known as
the “three Vs of big data” to describe some of the characteristics that
make big data different from other data processing:
Characteristics of Big Data – 3V’s
Volume
large amounts of data Zeta bytes/Massive datasets
These datasets can be orders of magnitude larger than traditional datasets, which demands more
thought at each stage of the processing and storage life cycle.
Cluster management and algorithms capable of breaking tasks into smaller pieces become
increasingly important.
Velocity
Another way in which big data differs significantly from other data systems is the speed that
information moves through the system.
Data is frequently flowing into the system from multiple sources and is often expected to be
processed in real time to gain insights and update the current understanding of the system.
Data is constantly being added, massaged, processed, and analyzed in order to keep up with the
influx of new information and to surface valuable information early when it is most relevant.
These ideas require robust systems with highly available components to guard against failures along
the data pipeline.
Variety
Data comes in many different forms from diverse sources.
The formats and types of media can vary significantly as well. Rich
media like images, video files, and audio recordings are ingested
alongside text files, structured logs, etc.
Clustered Computing and Hadoop Ecosystem
Because of the quantities of big data, individual computers are often inadequate for handling the data at most stages.
Therefore, to address the high storage and computational needs of big data, computer clusters are a better fit.
Big data clustering software combines the resources of many smaller machines, to provide a number of benefits:
Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling is also
extremely important.
High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and processing. This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.
Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. This means the
system can react to changes in resource requirements without expanding the physical resources on a machine.
Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual
work on individual nodes. Solution for cluster membership and resource allocation include:
software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator) or Apache Mesos.
The assembled computing cluster often acts as a foundation which other software interfaces with to process the data. The
machines involved in the computing cluster are also typically involved with the management of a distributed storage system
(discuss in data persistence).
Clustered Computing and Hadoop Ecosystem
Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on individual
nodes.
Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).
The machines involved in the computing cluster are also typically
involved with the management of a distributed storage system
Hadoop and its Ecosystem
It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
It is inspired by a technical document published by Google.
The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is resistant to
hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
Hadoop and its Ecosystem
Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and
storage.
comprises the following components and many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
End of Data Science!!!