Chapter 2 Data Science
Chapter 2 Data Science
Course Name:
Introduction to
Emerging
Technologies Course
Module
(EMTE1011/1012)
Topics Covered
1. An Overview of Data Science
2. Data and information
3. Data types and representation
4. Data Processing Cycle
5. Data Value Chain (Acquisition, Analysis ,Curating, Storage, Usage)
6. Basic concepts of Big data
2.1) An Overview of Data Science
3
Cont’d
i) Data:
Data can be defined as a representation of facts, concepts,
or instructions in a formalized manner.
Which should be suitable for communication,
interpretation, or processing, by human or electronic
machines.
It can be described as unprocessed facts and figures.
10 It is represented with the help of characters such as
Alphabets (A-Z, a-z),
Digits (0-9) or
Special characters (+, -, /, *, <,>, =, etc.).
Cont’d
ii) Information:
It is the processed data on which decisions and
actions are based.
It is data that has been processed into a form that is
meaningful to the recipient.
Information is real or perceived value in the current
or the prospective action or decision of recipient.
Furtherer more,
11 Information is interpreted data
Created from organized, structured, and processed data in
a particular context.
12
Proces
Input Output
s
For example, when electronic computers are used, the input data can be
recorded on any one of the several types of storage medium, such as hard
disk, CD, flash disk and so on.
Cont’d
2) Processing
The input data is changed to produce data in a more useful form.
For example, interest can be calculated on deposit to a bank, or a
summary of sales for the month can be calculated from the sales orders .
3) Output
The result of the proceeding processing step is collected.
The particular form of the output data depends on the use of the data.
16
For example, output data may be payroll for employees.
Examples
17
18
20
23
Cont’d
This results in irregularities and ambiguities
That make it difficult to understand using traditional programs as
compared to data stored in structured databases.
Examples:
Audio,
Video files or
No-SQL databases.
24
Metadata
Metadata is data about data.
It provides additional information about a specific set of data.
In a set of photographs, for example, metadata could describe
when and where the photos were taken.
Advantages: For analyzing Big Data & its solution.
25
2.3) Data value Chain
26 Introduced to describe the information flow within a big data system as
a series of steps needed to generate value and useful insights from data.
The Big Data Value Chain identifies the following key high-level
activities:
Data Acquisition
Data Analysis
Data Storage
Data Usage
-
Cont’d
27
Data Acquisition: Is the process of gathering, filtering, and
cleaning data before it is put in a data warehouse or any other
storage solution on which data analysis can be carried out.
Data acquisition is one of the major big data challenges in terms
of infrastructure requirements.
Data acquisition can be able to handle:
Very high transaction volumes
Often in a distributed environment
Support flexible and dynamic data structures
Cont’d
28
Data Analysis: Involves exploring, transforming, and modeling
data with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
Related areas include:
Data mining
Business intelligence, and
Machine learning.
Cont’d
29
Data Curation: It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its effective usage.
Data curation processes can be categorized into different activities such as
Content creation
Selection
Classification
Transformation
Validation
Preservation/protection
Cont’d
30
Data curation is performed by expert curators that are
responsible for improving the accessibility and quality of
data.
Data curators (also known as scientific curators or data
annotators) hold the responsibility of ensuring that data are:
Trustworthy
Discoverable
Accessible
Reusable and fit their purpose
Cont’d
31
Data Storage: Is the persistence and management of data in a
scalable way that satisfies the needs of applications that require
fast access to the data.
The good examples for data storage is:
Relational Database Management Systems (RDBMS) which is
the main, and almost unique, a solution to the storage paradigm
for nearly 40 years.
Cont’d
32
Data Usage :It covers the data-driven business
activities that need access to data, its analysis, and
the tools needed to integrate the data analysis within
the business activity.
Data usage in business decision-making can enhance
competitiveness through the reduction of costs,
increased added value, or any other parameter that
can be measured against existing performance
criteria.
2.4. Basic concepts of big
33
data
What is Big Data?
Big data is a blanket term for the non-traditional
strategies and technologies needed to gather,
organize, process, and gather insights from large
datasets.
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
The 4 core
components of
Hadoop includes
✓ Data Management,
✓ Data Access,
✓ Data Processing
✓ Data Storage.
Cont…
44 Hadoop ecosystem includes of the following components
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
•HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
FUNDAMENTALS OF DATABASE SYSTEM
47
NameNode:
Cont’d
48
YARN:
3. Big Data Life Cycle with Hadoop (stages)
49
End of Chapter 2