Chapter 2

Chapter 2 provides an overview of data science, including its definition, the data processing cycle, and the distinction between data and information. It discusses various data types, the data value chain in the context of big data, and introduces the Hadoop ecosystem as a solution for managing large datasets. Key concepts such as big data characteristics, clustered computing, and the life cycle of big data processing with Hadoop are also covered.

Uploaded by

Ali Hussen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Chapter 2

Uploaded by

Ali Hussen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Chapter 2: Data Science

1
Chapter Objective
Chapter Objectives
At the end of this chapter student should be
able to :
Describe what data science is and the role of data
scientists.
Differentiate data and information.
Describe data processing life cycle .
Understand different data types from diverse
perspectives.
Describe data value chain in emerging era of big data.
Understand the basics of Big Data.
Describe the purpose of the Hadoop ecosystem
components.
2
Chapter Outline
Chapter Outline
• Overview of data science
• Data Processing Cycle
• Data types and their representation
• Data value Chain
• Basic concepts of big data

3
Overview Data science
What is data science?

• It is a multi-disciplinary field that uses

scientific methods, processes,
algorithms, and systems to extract
knowledge and insights from structured,
semi-structured and unstructured data.
• Data science is much more than simply
analyzing data.
• It offers a range of roles and requires a
range of skills

4
What is data?
• Data can be defined as a representation of facts,
concepts, or instructions in a formalized
manner,
• It should be suitable for communication,
interpretation, or processing, by human or
electronic machines.
• It is unprocessed facts and figures
• Represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, , =, etc.) and picture ,sound
and video.

5 02/15/2025
What is information?
 Information is the processed data on
which decisions and actions are based.
 Data that has been processed into a form
that is meaningful to the recipient and real
value in the decision of recipient.
 Information is interpreted data; created
from organized, structured, and processed
data in a particular context.
Data PROCESSING Information

6
Data Processing Cycle
 Data processing is the re-structuring of data by
people or machines to increase their usefulness
and add values for a particular purpose.
 Data processing cycle is a sequence of steps or
operations for processing data to make it usable
format.
 Basic data processing steps are:- input,
processing, and output.

7 02/15/2025
Cont’d…
 Input:- data is prepared in some
convenient form for processing
 Processing:- input data is changed to
produce data in a more useful form
 Output
• Result of the proceeding processing
step is collected
• Particular form of the output data
depends on the use of the data

8 02/15/2025
Data types and their representation
 Data can be available in different
format and can be described from
different perspectives.
1. Data types from computer programming
perspective
2. Data types from data analytics
perspective
3. Metadata

9 02/15/2025
Data types from Computer
programming perspective
 Almost all programming languages explicitly
include the notion of data type,
 Common data types include:
• Integers(int)- is used to store whole numbers,
mathematically known as integers
• Booleans(bool)- is used to represent restricted to
one of two values: true or false
• Characters(char)- is used to store a single
character
• Floating-point numbers(float)- is used to store
real numbers
• Alphanumeric strings(string)- used to store a
combination of characters and numbers
10 02/15/2025
Data types from Data Analytics
perspective
 From a data analytics point of view,
 Three common types of data
1. Structured,
2. Semi-structured, and
3. Unstructured data type

11 02/15/2025
Cont’d…
Structured data
 Pre-defined data model and is therefore
straightforward to analyze
 Conforms to a tabular format with a relationship
between the different rows and columns
Example:- excel files or SQL databases
Name Sex Age Result Status

Abebe M 24 90 Pass

Almaz F 22 93 Pass

12 02/15/2025
Cont’d…
Semi-structured data
 A form of structured data that does not conform with the
formal structure of data models associated with relational
databases or other forms of data tables.
 Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within
the data
Example:- JSON and XML are forms of semi-structured data
<student><name> <student><name>
Abebe</name> Almaz</name>
<sex>Male</sex> <sex>Female</sex>
<age>24</age> <age>22</age>
<Result>90</Result> <Result>93</Result>
<Status>Pass</Status></ <Status>Pass</Status></
student> student

13 02/15/2025
Cont’d…
Unstructured data
 Information that either does not have a
predefined data model or is not organized in
a pre-defined manner.
 Typically text-heavy but may contain data
such as dates, numbers, and facts as well.
Examples: of unstructured data include audio,
video files or no-SQL databases

14 02/15/2025
Cont’d…
Metadata
• Metadata is data about data
• Provides additional information about
a specific set of data
• It is frequently used by Big Data
solutions for initial analysis

15 02/15/2025
Data value Chain
 Describe the information flow within a big
data system as a series of steps needed
to generate value and useful insights
from data
 It identifies the following key high-level
activities:
Data acquisition
Data analysis
Data curation
Data storage
Data usage
16 02/15/2025
Cont’d…
Data Acquisition
 Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.
Data Analysis
 Making the raw data acquired amenable to use in
decision-making as well as domain-specific usage
 Involves exploring, transforming, and modeling
data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden
information

17 02/15/2025
Cont’d…
Data Curation
 Active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its
effective usage
 The process includes different activities such as content
creation, selection, classification, transformation,
validation, and preservation.
Data Storage
 Persistence management of data in a scalable way that
satisfies the needs of applications that require fast access
to the data
 RDBMS have been the main, and almost unique, a
solution to the storage paradigm for nearly 40 years
 NoSQL technologies have been designed with the
scalability goal in mind and present a wide range of
18 solutions based on alternative data models02/15/2025
Cont’d…
Data Usage
 It covers the data-driven business activities that
need access to data, its analysis, and the tools
needed to integrate the data analysis within the
business activity
 enhance competitiveness through the reduction
of costs, increased added value

19 02/15/2025
Basic concepts of big data
• Due to the advent of new technologies, devices, and
communication means like social networking sites, IoT
and so on the amount of data produced by mankind is
growing rapidly every year.

• 328.77 million terabytes each day

• If this data is stored inside disks and pile up them, it may fill
an entire football field

20 02/15/2025
What Is Big Data?
 Big data is the term for a collection of data sets so large
and complex
 It becomes difficult to process using on-hand database
management tools or traditional data processing
applications
Big data is characterized by 3V and more:
1. Volume: large amounts of data Zeta bytes/Massive
datasets
2. Velocity: Data is live streaming or in motion
3. Variety: data comes in many different forms from diverse
sources
4. Veracity: can we trust the data? How accurate is it? etc.

21 02/15/2025
Clustered Computing and Hadoop
Ecosystem
Clustered Computing
• Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages.
• To better address the high storage and computational needs
of big data, computer clusters are a better fit.
• Clustered Computing: is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
22 02/15/2025
Cont’d…
 Big data clustering software combines the
resources of many smaller machines,
seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage
space, CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of
all three of these resources.
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees to
prevent hardware or software failures from affecting
access to data and processing.
• Easy Scalability: Clusters make it easy to scale
horizontally by adding additional machines to the
group.
23 02/15/2025
Cont’d…
 Using clusters requires a solution for
managing cluster membership,
coordinating resource sharing, and
scheduling actual work on individual
nodes.
 Cluster membership and resource
allocation can be handled by software like
Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
The machines involved in the computing
cluster are also typically involved with the
management of a distributed storage
24 02/15/2025
system
Hadoop and its Ecosystem
 Hadoop is an open-source software from
Apache Software Foundation to store and
process large non-relational data sets via a
large, scalable distributed model
 Open-source framework intended to
make interaction with big data easier.
 Allows for the distributed processing of
large datasets across clusters of computers
using simple programming models.

25 02/15/2025
Cont’d…
Characteristics of Hadoop Ecosystems
1. Economical: Its systems are highly economical
as ordinary computers can be used for data
processing.
2. Reliable: It is reliable as it stores copies of the
data on different machines and is resistant to
hardware failure.
3. Scalable: It is easily scalable both, horizontally
and vertically. A few extra nodes help in scaling
up the framework.
4. Flexible: It is flexible and you can store as
much structured and unstructured data as you
need to and decide to use them later.
26 02/15/2025
Cont’d…
 It has an ecosystem that has evolved from its four core
components:
A. Data management,
B. Data Access
C. Data Processing
D. Data Storage
.

27 02/15/2025
Cont’d…
• It is continuously growing to meet the needs
of Big Data.
• It comprises the following components and
many others:
• HDFS: Hadoop Distributed File System
• HBase: NoSQL Database
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
28
• Oozie: Job Scheduling 02/15/2025
Big Data Life Cycle with Hadoop

1. Ingesting data into the system

 The first stage of Big Data processing is Ingest
 Data is ingested or transferred to hadoop from various
sources such as relational databases, systems, or
local files.
 Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
2. Processing the data in storage
 The second stage is Processing
 The data is stored and processed
 Data is stored in the distributed file system, HDFS,
and the noSQL distributed data, hbase .
 Spark and MapReduce perform data processing
29 02/15/2025
Cont’d…
3. Computing and analyzing data
 The third stage is to Analyze.
 Data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
 Pig converts the data using a map and reduce and then
analyzes it.
 Hive is also based on the map and reduce programming
and is most suitable for structured data
4. Visualizing the results
 The fourth stage is Access,
 Data access is performed by tools such as Hue and
Cloudera Search.
 In this stage, the analyzed data can be accessed by
users.
30 02/15/2025
ap t er 2
o f Ch
En d

Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Data Science
No ratings yet
Data Science
35 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
data science
No ratings yet
data science
23 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
data science
No ratings yet
data science
23 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
32 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Chapter 6
No ratings yet
Chapter 6
21 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
Chapter 1
No ratings yet
Chapter 1
32 pages
Chapter Three: The Analysis Phase: Werabe University Institute of Technology Department of Information Systems
No ratings yet
Chapter Three: The Analysis Phase: Werabe University Institute of Technology Department of Information Systems
64 pages
Chapter 4
No ratings yet
Chapter 4
13 pages
Chapter One: Over View (Basic Concepts) : Werabe University Institute of Technology Department of Information Systems
No ratings yet
Chapter One: Over View (Basic Concepts) : Werabe University Institute of Technology Department of Information Systems
27 pages
Chapter Four: System Design: Werabe University Institute of Technology Department of Information Systems
No ratings yet
Chapter Four: System Design: Werabe University Institute of Technology Department of Information Systems
23 pages
Chapter Five: System Implementation and Maintenance
No ratings yet
Chapter Five: System Implementation and Maintenance
17 pages
Chapter Two: Project Identification, Selection and Initiation & Planning
No ratings yet
Chapter Two: Project Identification, Selection and Initiation & Planning
33 pages
ICSE Mind Maps & On Tips Notes Class 10 - Computer Applications
No ratings yet
ICSE Mind Maps & On Tips Notes Class 10 - Computer Applications
5 pages
C Language PDF
No ratings yet
C Language PDF
402 pages
Qbasic Wikibook
No ratings yet
Qbasic Wikibook
31 pages
The Openehr Modelling Guide: Release 1
No ratings yet
The Openehr Modelling Guide: Release 1
19 pages
Python Kids
No ratings yet
Python Kids
9 pages
Basics of Python Programming PDF
No ratings yet
Basics of Python Programming PDF
31 pages
Te Ii Comp 051409043451 22 2
No ratings yet
Te Ii Comp 051409043451 22 2
22 pages
Unit Ii
No ratings yet
Unit Ii
82 pages
Fortran Notes
No ratings yet
Fortran Notes
38 pages
Scripting Guide PDF PDF
No ratings yet
Scripting Guide PDF PDF
934 pages
Four Techniques For Better LabVIEW Code
100% (4)
Four Techniques For Better LabVIEW Code
24 pages
C++ Chapter 7 Solution of Data Handling by Sumita Aroa
50% (8)
C++ Chapter 7 Solution of Data Handling by Sumita Aroa
12 pages
03 1 of 2 JavaScript - Variables & Data Types
No ratings yet
03 1 of 2 JavaScript - Variables & Data Types
14 pages
Puter Science Interdisciplinary Problems Principles and Python Programming
100% (7)
Puter Science Interdisciplinary Problems Principles and Python Programming
740 pages
RhinoCommon Using the Geometry Namespace with Python v.1.02
No ratings yet
RhinoCommon Using the Geometry Namespace with Python v.1.02
124 pages
Programming Tools
No ratings yet
Programming Tools
30 pages
QBasic-Simple Tutorial
No ratings yet
QBasic-Simple Tutorial
47 pages
Software Size Estimation Using Function Point Analysis - A Case Study For A Mobile Application
No ratings yet
Software Size Estimation Using Function Point Analysis - A Case Study For A Mobile Application
4 pages
Core Java Material PDF
100% (1)
Core Java Material PDF
116 pages
Using QuickC PDF
No ratings yet
Using QuickC PDF
630 pages
Vtu 7TH Sem Cse/ise C# Programming & .Net Notes 10cs761/10is761
73% (11)
Vtu 7TH Sem Cse/ise C# Programming & .Net Notes 10cs761/10is761
109 pages
Lecture Notes
No ratings yet
Lecture Notes
12 pages
C Programming MCQ
85% (13)
C Programming MCQ
6 pages
Revision Questions For Oop2 Java
No ratings yet
Revision Questions For Oop2 Java
5 pages
ITSE205-DataStructures and Algorithms PDF
No ratings yet
ITSE205-DataStructures and Algorithms PDF
115 pages
Delphi XE2 Foundations - Part 1 - Rolliston, Chris
100% (2)
Delphi XE2 Foundations - Part 1 - Rolliston, Chris
160 pages
a1.4.0.0 COMPUTER PROGRAMMING IN QBASIC
No ratings yet
a1.4.0.0 COMPUTER PROGRAMMING IN QBASIC
24 pages
Introduction and History of C Programming Language
100% (2)
Introduction and History of C Programming Language
5 pages
Kotlin Reference
No ratings yet
Kotlin Reference
1,320 pages
CP 111 Tutorial Questions 3
No ratings yet
CP 111 Tutorial Questions 3
2 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Chapter 2: Data Science

• It is a multi-disciplinary field that uses

• 328.77 million terabytes each day

1. Ingesting data into the system

You might also like