EmgTech Chapter 02
EmgTech Chapter 02
CHAPTER TWO
An Overview of Data Science
Data science is called data-driven science
It is multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and
insights from structured, semi-structured and unstructured data.
o Is a blend of various tools, algorithms, and machine learning
principles
It offers a range of roles and requires a range of skills.
It is primarily used to make decisions and predictions.
It is a process of using raw data to explore insight and deliver a
data product.
2
What are data and information?
• Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner
• Data is unprocessed facts and figures
• Data is a symbol or any row material (it can
be text, number, image and diagram.)
o Can be represented with:
alphabets (A-Z, a-z)
digits (0-9) or
special characters (+, -, /, *, <,>, =, etc.).
4
Data Processing Cycle
Data processing is the re-structuring or
re- ordering of data by people or machines
o In order to increase their usefulness and add values
for a particular purpose.
A raw data is fed to computer systems to generate the final
output which is information.
Information can be presented into the forms of diagrams,
chart, graph, etc.
5
Cont.…
• Data processing consists of the following basic steps:
• Input,
• processing and
• output.
• These three steps constitute the data processing cycle.
•
Data processing cycle 6
Input:
• Inthis step, the input data is prepared in some convenient
form for processing.
• The form will depend on the processing machine.
• For example, when electronic computers are used, the input data
can be recorded on any one of the several types of storage
medium, such as hard disk, CD, flash disk and so on.
Processing:
• Theinput data is changed to produce data in a more
useful form.
Output:
• The result of the proceeding processing step is collected.
7
DATA SCIENCE APPLICATIONS AND
EXAMPLES
• Identifying and predicting disease
• Personalized healthcare recommendations
• Optimizing shipping routes in real-time
• Getting the most value out of soccer rosters
• Automating digital ad placement
• Predicting incarceration/prediction rates
Data types and their representation
Data types can be described from different perspectives.
1. In computer science and computer programming, for
instance,
A data type is simply an attribute of data that tells the
compiler or interpreter how the programmer intends to use the
data.
A data type makes the values that expression, such as a variable or
a function, might take.
This data type defines the operations that can be done on the data,
the meaning of the data, and the way values of that type can be
stored.
9
Data types from Computer programming
perspective
Integers(int)- is used to store whole
numbers, mathematically known as integers
Booleans(bool)- is used to represent restricted to one of
two values: true or false
characters(char)-is used to store a single character
Floating-point numbers(float)- is used to
store real numbers
Alphanumericstrings(string)- used to storea
combination of characters and numbers
10
Data types from Data Analytics
perspective
From a data analytics point of view,
It is important to understand that there are three common
types of data types or structures:
1. Structured,
2. Semi-structured, and
3. Unstructured data types.
11
1. Structured Data
Structured data are those that can be easily organized,
stored and transferred in a defined data model.
Easily searchable by basic algorithm like spread
sheets.
Easily processed by computers.
Structured data conforms to a tabular format with a
relationship between the different rows and columns.
Example:
o Excel files or SQL databases
12
Example -------- Database
ID Name Age Department CGPA
13
2. Semi-structured Data
Their structures are irregular, implicit, flexible and often
nested hierarchically.
Is a form of structured data that does not conform with
the formal structure of data models associated with
relational databases
It has some organizational properties like tags and
other markers to separate semantic elements that
makes it easier to analyze.
It is also known as a self-describing structure.
o Examples: include JSON and XML 13
.
15
3. Unstructured Data
Is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
They are not easily combined or computationally analyzed
Unstructured information is typically text-heavy but may
contain data such as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it
difficult to understand using traditional programs as compared
to data stored in structured databases.
o Examples: include text documents, audio, video files , or
PDFs
16
.
17
Metadata
Metadata – Data about Data
From technical point of view, this is not a separate data
structure, but it is one of the most important elements
for Big Data analysis and big data solutions.
Metadata is data about data
it is meaning of data
It provides additional information about a specific set of
data.
Metadata is considered as processed data, used by
Big data solutions for initial analysis.
o Example: In a set of photographs, metadata could
18
Data value Chain
Describe the process of data creation and use; from first
identifying a need for data to its final use and possible
reuse.
The Data Value Chain is introduced to describe the information
flow within a big data system as a series of steps needed to
generate value and useful insights from data.
Data chain: is any combination of two or more data element/
data item.
Data value: is average of set of data value
19
The Big Data Value Chain identifies the following key high-level
activities:
for more, clickhere
20
1. Data Acquisition
It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other storage
on which data analysis can be carried out.
Later used on data analysis
Data acquisition is one of the major big data
challenges in terms of infrastructure requirements.
Data acquisition answers the following:
• How do we get the data
• What kind of data do we need
• Who owns the data
21
2. Data Analysis
Making the raw data acquired amenable to use in decision-
making as well as domain-specific usage.
Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden information with
high potential from a business point of view.
Related areas include data mining, business intelligence,
and machine learning.
22
3. Data Curation
It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for
its effective usage.
Is a process of extraction important information from
scientific task
o e.g. research
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
23
4. Data Storage
It is the persistence and management of data in
a scalable way that satisfies the needs of
applications that require fast access to the data.
o E.g.Relational Database Management Systems
(RDBMS)
RDBMS: the main, and almost unique, a solution to
the storage paradigm for nearly 40 years.
The RDBMS: Not used for Big data
2
3
5. Data Usage
It covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.
Data usage in business decision-making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be measured
against existing performance criteria.
Interpreting output data’s
25
What Is Big Data?
Big data is the term for a collection of data sets so large and
complex.
It can not handle with a single computer
It becomes difficult to process using on-hand database
management tools or traditional data processing applications.
“large dataset” means a dataset too large to reasonably process or
store with traditional tooling or on a single computer.
Big data is characterized by 3V(5V) and more:
26
Big data is characterized b y 5V and more:
Volume:
• Refer to the vast amount of data generated every second.
• Data generated from emails, social networking sites, photos,
videos, sensor data etc.
• Now with big data Technology we can store and use data with
help of distributed system.
27
Big data is characterized b y 5V and more:
Variety
Refer the different type of data we can now use.
29
Big data is characterized b y 5V and more:
Value: most important V.
Having access to big data is no good unless we can turn
it into value.
32
Big data is characterized b y 5V and more:
33
Variety
• Variety: data comes in many different forms from diverse
sources
34
Velocity: Data is live streaming or
in motion
35
Value: a mechanism to bring the correct meaning out of the
data
36
Veracity: can we trust the data? How accurate is it?
37
Clustered Computing and Hadoop Ecosystem
Clustered Computing
Because of the qualities of big data, individual
computers are often inadequate for handling the
data at most stages.
To better address the high storage and
computational needs of big data, computer clusters
are a better fit.
Giving different task to different
computers
38
Cont’d…
Big data clustering software combines the resources
of many smaller machines,
Seeking to provide a number of benefits:
o Resource Pooling/sharing
o High Availability:
o Easy Scalability:
39
.
40
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction
with big data easier.
It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
Hadoop is a software which manage different computers which are found
on different locations but they are connected each other using computer
network
It is inspired by a technical document published by Google.
5
7
The four key characteristics of
Hadoop are:
Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
Reliable: It is reliable as it stores copiesof the
data on different machines and is resistant to hardware
failure.
Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
42
Cont’d…
Hadoop has an ecosystem that has evolved
from its four core components:
o Data management,
o Access,
o Processing, and
o Storage.
It is continuously growing to meet the needs
of Big Data.
43
It comprises the following components and
many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
44
Hadoop
Ecosystem
45
HDFS
HDFS: Is specially designed for storing huge
dataset in commodity hardware
Data is stored in a distributed manner
Enables fast data transfer among the nodes
It is all about storing and managing huge dataset
in a cluster
It is highly fault tolerance and efficient enough
to process huge amount of data
46
• HDFS Has two core
components
2. Data master
1. Name node and
node
• Name node: also called master
• Is the brain of the system
slave slave
• There is only one name node slave
• Maintains and manage the data node and it also store the meta
data
• If this name node crashed the entire system will dead
6
3
Big Data Life Cycle with Hadoop
1.Ingesting data into the system
The first stage of Big Data processing is Ingest.
The data is ingested or transferred to Hadoop from
various sources such as relational databases,
systems, or local files.
Sqoop transfers data from RDBMS to HDFS,
whereas Flume transfers event data.
4
8
Big Data Life Cycle with Hadoop
2. Processing the data in storage
The second stage is Processing.
In this stage, the data is stored and processed.
The data is stored in the distributed file system,
HDFS, and the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing
4
9
Big Data Life Cycle with Hadoop
3. Computing and analyzing data
The third stage is to Analyze. Here, the data is
analyzed by processing frameworks such as Pig,
Hive, and impala.
Pig converts the data using a map and reduce and
then analyzes it.
Hive is also based on the map and reduce
programming and is most suitable for structured
data. 5
0
Big Data Life Cycle with Hadoop
4. Visualizing the results
The fourth stage is Access, which is performed by
tools such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed
by users.
5
1