Chapter - 2 - Data Science
Chapter - 2 - Data Science
Chapter-2
.
2
Unit objectives
Data science is much more than simply analyzing data; it plays wide range of
roles as follows;
Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive business
advantage
Can help you to detect fraud using advanced machine learning algorithms
It could also helps you to prevent any significant monetary losses
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer brand loyalty
It enables you to take better and faster decisions
Helps you to recommend the right product to the right customer to enhance
your business
Data Vs. Information 5
Input - the input data is prepared in some convenient form for processing.
Processing - in this step, the input data is changed to produce data in a more
useful form.
For example, a summary of sales for the month can be calculated from
the sales orders.
Output − at this stage, the result of the proceeding processing step is
collected.
Data types and their representations 8
This data type defines the operations that can be done on the data, the
meaning of the data, and the way values of that type can be stored.
9
Data types from computer programming
perspective;
From data analytics perspective there are three common types of data
types or structures: Structured, Semi-structured, and Unstructured data
types.
Structured data:-
Unstructured data:-
Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
Usually it is typically text-heavy but may contain data such as dates,
numbers, and facts as well
E.g. audio, video files or NoSQL databases
14
Data types from data analytics perspective
cont’d
Metadata :-
Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
It identifies the following key high-level activities:-
16
Data Acquisition
The infrastructure required for big data acquisition must deliver low,
predictable latency in both capturing data and in executing queries.
The infrastructure handle very high transaction volumes, often in a
distributed environment, & support flexible and dynamic data structures.
It also deals with making the raw data acquired amenable to use
in decision making process
18
Data Curation
Big data is the term for a collection of data sets so large and
complex that becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
Clustered Computing
Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.
To better address the high storage and computational needs of big data,
computer clusters are a better fit.
Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits such as:
Resource Pooling: combining storage space and cpu to process large
dataset
High Availability: Clusters can provide varying levels of fault
tolerance and availability
Easy Scalability: Clusters make it easy to scale horizontally by adding
additional machines to the group.
26
Clustered Computing cont’d
What is Hadoop?
is basically an open source framework based on the Java
programming language, that allows for the distributed processing and
storage of large data sets across clusters of computers
Hides underlying system details and complexities from user
Developed in Java
Flexible, enterprise-class support for processing large volumes of
data
Inspired by Google technologies (MapReduce, GFS, BigTable, …)
28
Hadoop and its Ecosystem
What is Hadoop?
Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
CPU + disks = “node”
Nodes can be combined into clusters
New nodes can be added as needed without changing
Data formats
How data is loaded
How jobs are written
29
Hadoop and its Ecosystem Cont’d
Hadoop has an ecosystem that has evolved from its four core components: data management,
access, processing, and storage.
Following are components that collectively form a hadoop ecosystem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
30
Hadoop and its Ecosystem Cont’d
31
Hadoop and its Ecosystem Cont’d