Chapter 2 - Intro. To Data Sciences
Chapter 2 - Intro. To Data Sciences
Hadoop ecosystem
Review questions
Learning outcomes
After the successfully completing this chapter, the students can
Differentiate data and information
Explain data processing life cycle
Differentiate different data types from diverse perspectives
Explain the data value chain
Explain the basics of big data
Analyze Hadoop ecosystem components and their use in big data
2.1 Overview of Data Science
What is Data science?
A multi-disciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured, semi
structured and unstructured data.
is much more than simply analyzing data.
Cont. . . Data Science Experts/Scientist?
Data scientists are analytical experts who utilize their skills in both technology and social science to
find trends and manage data.
They use industry knowledge, contextual understanding, uncertainty of existing assumptions to
uncover solutions to business challenges..
Source: internet
2.3 Data Processing Cycle
Re-structuring or re-ordering of data by people or machine to increase their usefulness and
add values for a particular purpose.
The set of operations used to transform data into useful information.
input data is
prepared in some
convenient form for
processing output data is the result
e.g. electronic of processing step and
computers form of the output data
input data is changed to depends on the use of
produce data in a more the data.
useful form Produced information need to
e.g. calculating CGPA be stored for future usage
Cont. . .
2.4 Data types and their representation
Data type defines the operations that can be done on the data, the meaning of the data,
and the way values of that type can be stored.
Data types can be described from diverse perspective
Source: internet
Semi-structured Data
It is a form of structured data that does not conform with the formal
structure of data model.
Contains tags or other markers for separation semantic elements enforce
hierarchies of records and fields within the data
Fore example: JSON and XML
Source: internet
Unstructured Data
Itis information that either does not have a predefined data model or is not organized in
a pre-defined manner
Typically text-heavy but may contain data such as dates, numbers, and facts as well.
Examples: audio, video files
Source: internet
Metadata
It is not a separate data structure, but it is one of the most important elements for Big Data
analysis and big data solutions
Data about data that provides additional information about a specific set of data.
2.5 Data value chain
Introduced to describe the information flow within a big data system as a series of steps
needed to generate value and useful insights from data.
The big data value chain identifies the following key high-level activities
• ensuring data trustworthiness, accessibility, reusability • Ensuring the needs of
• Exploring, transforming,
and modelling data with
the goal of highlighting
relevant data
• Synthesizing and
extracting useful hidden
information with high
potential from a
business point of view
Cont. . . Use case of Data Science
Cont. . . Application domain of Data Science
2.6 Basic concepts of Big data
Big data is a blanket term for the non-traditional strategies and technologies needed
to gather, organize, process, and gather insights from large datasets.
• The amount of data • The speed at which • The types of data • Data trustworthiness (the • The way in which • Business value of
from myriad source data are generated • Data comes in degree to which big data the big data can be the data collected
• large amounts of • Data is live many different can be trusted) used and formatted • Uses and purpose
data Zeta bytes streaming or in forms from diverse • Data accuracy • To whom the data of data
(Massive datasets) motion sources How accurate is it? are accessible?
2.7 Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction with big data easier.
It is inspired by a technical document published by Google.
It allows for the distributed processing of large datasets across clusters of computers
using simple programming models.
The four key characteristics of Hadoop
Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A
few extra nodes help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
Cont. . . The 4 core components of Hadoop
and its Ecosystem
The 4 core components of
Hadoop includes
Data Management,
Data Access,
Data Processing
Data Storage.
?
Quiz (5%)
1. What is the difference between Data and Information?
2. Mention types of data from analysis perspective and give an example.
3. List Data Processing Life cycle?
4. Write the Characteristics of Big data?
Assignment Questions
1. Discuss the difference of Big data and Data Science.
2. Briefly discuss the Big data life cycle.
3. List and explain Big data application domains with example.
4. What is Clustered Computing? Explain its advantages.
Thank you!
27