0% found this document useful (0 votes)
13 views

Chapter 2

Chapter 2 provides an overview of data science, including its definition, the data processing cycle, and the distinction between data and information. It discusses various data types, the data value chain in the context of big data, and introduces the Hadoop ecosystem as a solution for managing large datasets. Key concepts such as big data characteristics, clustered computing, and the life cycle of big data processing with Hadoop are also covered.

Uploaded by

Ali Hussen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 2

Chapter 2 provides an overview of data science, including its definition, the data processing cycle, and the distinction between data and information. It discusses various data types, the data value chain in the context of big data, and introduces the Hadoop ecosystem as a solution for managing large datasets. Key concepts such as big data characteristics, clustered computing, and the life cycle of big data processing with Hadoop are also covered.

Uploaded by

Ali Hussen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Chapter 2: Data Science

1
Chapter Objective
Chapter Objectives
At the end of this chapter student should be
able to :
Describe what data science is and the role of data
scientists.
Differentiate data and information.
Describe data processing life cycle .
Understand different data types from diverse
perspectives.
Describe data value chain in emerging era of big data.
Understand the basics of Big Data.
Describe the purpose of the Hadoop ecosystem
components.
2
Chapter Outline
Chapter Outline
• Overview of data science
• Data Processing Cycle
• Data types and their representation
• Data value Chain
• Basic concepts of big data

3
Overview Data science
What is data science?

• It is a multi-disciplinary field that uses


scientific methods, processes,
algorithms, and systems to extract
knowledge and insights from structured,
semi-structured and unstructured data.
• Data science is much more than simply
analyzing data.
• It offers a range of roles and requires a
range of skills

4
What is data?
• Data can be defined as a representation of facts,
concepts, or instructions in a formalized
manner,
• It should be suitable for communication,
interpretation, or processing, by human or
electronic machines.
• It is unprocessed facts and figures
• Represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, , =, etc.) and picture ,sound
and video.

5 02/15/2025
What is information?
 Information is the processed data on
which decisions and actions are based.
 Data that has been processed into a form
that is meaningful to the recipient and real
value in the decision of recipient.
 Information is interpreted data; created
from organized, structured, and processed
data in a particular context.
Data PROCESSING Information

6
Data Processing Cycle
 Data processing is the re-structuring of data by
people or machines to increase their usefulness
and add values for a particular purpose.
 Data processing cycle is a sequence of steps or
operations for processing data to make it usable
format.
 Basic data processing steps are:- input,
processing, and output.

7 02/15/2025
Cont’d…
 Input:- data is prepared in some
convenient form for processing
 Processing:- input data is changed to
produce data in a more useful form
 Output
• Result of the proceeding processing
step is collected
• Particular form of the output data
depends on the use of the data

8 02/15/2025
Data types and their representation
 Data can be available in different
format and can be described from
different perspectives.
1. Data types from computer programming
perspective
2. Data types from data analytics
perspective
3. Metadata

9 02/15/2025
Data types from Computer
programming perspective
 Almost all programming languages explicitly
include the notion of data type,
 Common data types include:
• Integers(int)- is used to store whole numbers,
mathematically known as integers
• Booleans(bool)- is used to represent restricted to
one of two values: true or false
• Characters(char)- is used to store a single
character
• Floating-point numbers(float)- is used to store
real numbers
• Alphanumeric strings(string)- used to store a
combination of characters and numbers
10 02/15/2025
Data types from Data Analytics
perspective
 From a data analytics point of view,
 Three common types of data
1. Structured,
2. Semi-structured, and
3. Unstructured data type

11 02/15/2025
Cont’d…
Structured data
 Pre-defined data model and is therefore
straightforward to analyze
 Conforms to a tabular format with a relationship
between the different rows and columns
Example:- excel files or SQL databases
Name Sex Age Result Status

Abebe M 24 90 Pass

Almaz F 22 93 Pass

12 02/15/2025
Cont’d…
Semi-structured data
 A form of structured data that does not conform with the
formal structure of data models associated with relational
databases or other forms of data tables.
 Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within
the data
Example:- JSON and XML are forms of semi-structured data
<student><name> <student><name>
Abebe</name> Almaz</name>
<sex>Male</sex> <sex>Female</sex>
<age>24</age> <age>22</age>
<Result>90</Result> <Result>93</Result>
<Status>Pass</Status></ <Status>Pass</Status></
student> student

13 02/15/2025
Cont’d…
Unstructured data
 Information that either does not have a
predefined data model or is not organized in
a pre-defined manner.
 Typically text-heavy but may contain data
such as dates, numbers, and facts as well.
Examples: of unstructured data include audio,
video files or no-SQL databases

14 02/15/2025
Cont’d…
Metadata
• Metadata is data about data
• Provides additional information about
a specific set of data
• It is frequently used by Big Data
solutions for initial analysis

15 02/15/2025
Data value Chain
 Describe the information flow within a big
data system as a series of steps needed
to generate value and useful insights
from data
 It identifies the following key high-level
activities:
Data acquisition
Data analysis
Data curation
Data storage
Data usage
16 02/15/2025
Cont’d…
Data Acquisition
 Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.
Data Analysis
 Making the raw data acquired amenable to use in
decision-making as well as domain-specific usage
 Involves exploring, transforming, and modeling
data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden
information

17 02/15/2025
Cont’d…
Data Curation
 Active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its
effective usage
 The process includes different activities such as content
creation, selection, classification, transformation,
validation, and preservation.
Data Storage
 Persistence management of data in a scalable way that
satisfies the needs of applications that require fast access
to the data
 RDBMS have been the main, and almost unique, a
solution to the storage paradigm for nearly 40 years
 NoSQL technologies have been designed with the
scalability goal in mind and present a wide range of
18 solutions based on alternative data models02/15/2025
Cont’d…
Data Usage
 It covers the data-driven business activities that
need access to data, its analysis, and the tools
needed to integrate the data analysis within the
business activity
 enhance competitiveness through the reduction
of costs, increased added value

19 02/15/2025
Basic concepts of big data
• Due to the advent of new technologies, devices, and
communication means like social networking sites, IoT
and so on the amount of data produced by mankind is
growing rapidly every year.

• 328.77 million terabytes each day


• If this data is stored inside disks and pile up them, it may fill
an entire football field

20 02/15/2025
What Is Big Data?
 Big data is the term for a collection of data sets so large
and complex
 It becomes difficult to process using on-hand database
management tools or traditional data processing
applications
Big data is characterized by 3V and more:
1. Volume: large amounts of data Zeta bytes/Massive
datasets
2. Velocity: Data is live streaming or in motion
3. Variety: data comes in many different forms from diverse
sources
4. Veracity: can we trust the data? How accurate is it? etc.

21 02/15/2025
Clustered Computing and Hadoop
Ecosystem
Clustered Computing
• Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages.
• To better address the high storage and computational needs
of big data, computer clusters are a better fit.
• Clustered Computing: is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
22 02/15/2025
Cont’d…
 Big data clustering software combines the
resources of many smaller machines,
seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage
space, CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of
all three of these resources.
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees to
prevent hardware or software failures from affecting
access to data and processing.
• Easy Scalability: Clusters make it easy to scale
horizontally by adding additional machines to the
group.
23 02/15/2025
Cont’d…
 Using clusters requires a solution for
managing cluster membership,
coordinating resource sharing, and
scheduling actual work on individual
nodes.
 Cluster membership and resource
allocation can be handled by software like
Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
The machines involved in the computing
cluster are also typically involved with the
management of a distributed storage
24 02/15/2025
system
Hadoop and its Ecosystem
 Hadoop is an open-source software from
Apache Software Foundation to store and
process large non-relational data sets via a
large, scalable distributed model
 Open-source framework intended to
make interaction with big data easier.
 Allows for the distributed processing of
large datasets across clusters of computers
using simple programming models.

25 02/15/2025
Cont’d…
Characteristics of Hadoop Ecosystems
1. Economical: Its systems are highly economical
as ordinary computers can be used for data
processing.
2. Reliable: It is reliable as it stores copies of the
data on different machines and is resistant to
hardware failure.
3. Scalable: It is easily scalable both, horizontally
and vertically. A few extra nodes help in scaling
up the framework.
4. Flexible: It is flexible and you can store as
much structured and unstructured data as you
need to and decide to use them later.
26 02/15/2025
Cont’d…
 It has an ecosystem that has evolved from its four core
components:
A. Data management,
B. Data Access
C. Data Processing
D. Data Storage
.

27 02/15/2025
Cont’d…
• It is continuously growing to meet the needs
of Big Data.
• It comprises the following components and
many others:
• HDFS: Hadoop Distributed File System
• HBase: NoSQL Database
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
28
• Oozie: Job Scheduling 02/15/2025
Big Data Life Cycle with Hadoop

1. Ingesting data into the system


 The first stage of Big Data processing is Ingest
 Data is ingested or transferred to hadoop from various
sources such as relational databases, systems, or
local files.
 Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
2. Processing the data in storage
 The second stage is Processing
 The data is stored and processed
 Data is stored in the distributed file system, HDFS,
and the noSQL distributed data, hbase .
 Spark and MapReduce perform data processing
29 02/15/2025
Cont’d…
3. Computing and analyzing data
 The third stage is to Analyze.
 Data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
 Pig converts the data using a map and reduce and then
analyzes it.
 Hive is also based on the map and reduce programming
and is most suitable for structured data
4. Visualizing the results
 The fourth stage is Access,
 Data access is performed by tools such as Hue and
Cloudera Search.
 In this stage, the analyzed data can be accessed by
users.
30 02/15/2025
ap t er 2
o f Ch
En d

31

You might also like