Chapter 2
Chapter 2
1
Chapter Objective
Chapter Objectives
At the end of this chapter student should be
able to :
Describe what data science is and the role of data
scientists.
Differentiate data and information.
Describe data processing life cycle .
Understand different data types from diverse
perspectives.
Describe data value chain in emerging era of big data.
Understand the basics of Big Data.
Describe the purpose of the Hadoop ecosystem
components.
2
Chapter Outline
Chapter Outline
• Overview of data science
• Data Processing Cycle
• Data types and their representation
• Data value Chain
• Basic concepts of big data
3
Overview Data science
What is data science?
4
What is data?
• Data can be defined as a representation of facts,
concepts, or instructions in a formalized
manner,
• It should be suitable for communication,
interpretation, or processing, by human or
electronic machines.
• It is unprocessed facts and figures
• Represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, , =, etc.) and picture ,sound
and video.
5 02/15/2025
What is information?
Information is the processed data on
which decisions and actions are based.
Data that has been processed into a form
that is meaningful to the recipient and real
value in the decision of recipient.
Information is interpreted data; created
from organized, structured, and processed
data in a particular context.
Data PROCESSING Information
6
Data Processing Cycle
Data processing is the re-structuring of data by
people or machines to increase their usefulness
and add values for a particular purpose.
Data processing cycle is a sequence of steps or
operations for processing data to make it usable
format.
Basic data processing steps are:- input,
processing, and output.
7 02/15/2025
Cont’d…
Input:- data is prepared in some
convenient form for processing
Processing:- input data is changed to
produce data in a more useful form
Output
• Result of the proceeding processing
step is collected
• Particular form of the output data
depends on the use of the data
8 02/15/2025
Data types and their representation
Data can be available in different
format and can be described from
different perspectives.
1. Data types from computer programming
perspective
2. Data types from data analytics
perspective
3. Metadata
9 02/15/2025
Data types from Computer
programming perspective
Almost all programming languages explicitly
include the notion of data type,
Common data types include:
• Integers(int)- is used to store whole numbers,
mathematically known as integers
• Booleans(bool)- is used to represent restricted to
one of two values: true or false
• Characters(char)- is used to store a single
character
• Floating-point numbers(float)- is used to store
real numbers
• Alphanumeric strings(string)- used to store a
combination of characters and numbers
10 02/15/2025
Data types from Data Analytics
perspective
From a data analytics point of view,
Three common types of data
1. Structured,
2. Semi-structured, and
3. Unstructured data type
11 02/15/2025
Cont’d…
Structured data
Pre-defined data model and is therefore
straightforward to analyze
Conforms to a tabular format with a relationship
between the different rows and columns
Example:- excel files or SQL databases
Name Sex Age Result Status
Abebe M 24 90 Pass
Almaz F 22 93 Pass
12 02/15/2025
Cont’d…
Semi-structured data
A form of structured data that does not conform with the
formal structure of data models associated with relational
databases or other forms of data tables.
Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within
the data
Example:- JSON and XML are forms of semi-structured data
<student><name> <student><name>
Abebe</name> Almaz</name>
<sex>Male</sex> <sex>Female</sex>
<age>24</age> <age>22</age>
<Result>90</Result> <Result>93</Result>
<Status>Pass</Status></ <Status>Pass</Status></
student> student
13 02/15/2025
Cont’d…
Unstructured data
Information that either does not have a
predefined data model or is not organized in
a pre-defined manner.
Typically text-heavy but may contain data
such as dates, numbers, and facts as well.
Examples: of unstructured data include audio,
video files or no-SQL databases
14 02/15/2025
Cont’d…
Metadata
• Metadata is data about data
• Provides additional information about
a specific set of data
• It is frequently used by Big Data
solutions for initial analysis
15 02/15/2025
Data value Chain
Describe the information flow within a big
data system as a series of steps needed
to generate value and useful insights
from data
It identifies the following key high-level
activities:
Data acquisition
Data analysis
Data curation
Data storage
Data usage
16 02/15/2025
Cont’d…
Data Acquisition
Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.
Data Analysis
Making the raw data acquired amenable to use in
decision-making as well as domain-specific usage
Involves exploring, transforming, and modeling
data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden
information
17 02/15/2025
Cont’d…
Data Curation
Active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its
effective usage
The process includes different activities such as content
creation, selection, classification, transformation,
validation, and preservation.
Data Storage
Persistence management of data in a scalable way that
satisfies the needs of applications that require fast access
to the data
RDBMS have been the main, and almost unique, a
solution to the storage paradigm for nearly 40 years
NoSQL technologies have been designed with the
scalability goal in mind and present a wide range of
18 solutions based on alternative data models02/15/2025
Cont’d…
Data Usage
It covers the data-driven business activities that
need access to data, its analysis, and the tools
needed to integrate the data analysis within the
business activity
enhance competitiveness through the reduction
of costs, increased added value
19 02/15/2025
Basic concepts of big data
• Due to the advent of new technologies, devices, and
communication means like social networking sites, IoT
and so on the amount of data produced by mankind is
growing rapidly every year.
20 02/15/2025
What Is Big Data?
Big data is the term for a collection of data sets so large
and complex
It becomes difficult to process using on-hand database
management tools or traditional data processing
applications
Big data is characterized by 3V and more:
1. Volume: large amounts of data Zeta bytes/Massive
datasets
2. Velocity: Data is live streaming or in motion
3. Variety: data comes in many different forms from diverse
sources
4. Veracity: can we trust the data? How accurate is it? etc.
21 02/15/2025
Clustered Computing and Hadoop
Ecosystem
Clustered Computing
• Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages.
• To better address the high storage and computational needs
of big data, computer clusters are a better fit.
• Clustered Computing: is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
22 02/15/2025
Cont’d…
Big data clustering software combines the
resources of many smaller machines,
seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage
space, CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of
all three of these resources.
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees to
prevent hardware or software failures from affecting
access to data and processing.
• Easy Scalability: Clusters make it easy to scale
horizontally by adding additional machines to the
group.
23 02/15/2025
Cont’d…
Using clusters requires a solution for
managing cluster membership,
coordinating resource sharing, and
scheduling actual work on individual
nodes.
Cluster membership and resource
allocation can be handled by software like
Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
The machines involved in the computing
cluster are also typically involved with the
management of a distributed storage
24 02/15/2025
system
Hadoop and its Ecosystem
Hadoop is an open-source software from
Apache Software Foundation to store and
process large non-relational data sets via a
large, scalable distributed model
Open-source framework intended to
make interaction with big data easier.
Allows for the distributed processing of
large datasets across clusters of computers
using simple programming models.
25 02/15/2025
Cont’d…
Characteristics of Hadoop Ecosystems
1. Economical: Its systems are highly economical
as ordinary computers can be used for data
processing.
2. Reliable: It is reliable as it stores copies of the
data on different machines and is resistant to
hardware failure.
3. Scalable: It is easily scalable both, horizontally
and vertically. A few extra nodes help in scaling
up the framework.
4. Flexible: It is flexible and you can store as
much structured and unstructured data as you
need to and decide to use them later.
26 02/15/2025
Cont’d…
It has an ecosystem that has evolved from its four core
components:
A. Data management,
B. Data Access
C. Data Processing
D. Data Storage
.
27 02/15/2025
Cont’d…
• It is continuously growing to meet the needs
of Big Data.
• It comprises the following components and
many others:
• HDFS: Hadoop Distributed File System
• HBase: NoSQL Database
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
28
• Oozie: Job Scheduling 02/15/2025
Big Data Life Cycle with Hadoop
31