0% found this document useful (0 votes)
62 views

Chapter 2 - Intro. To Data Sciences

The document provides an overview of key concepts in data science including data types, the data processing cycle, the data value chain, and big data fundamentals. It discusses how data science uses scientific methods to extract knowledge from structured and unstructured data. The document also describes the Hadoop ecosystem and its core components that allow for distributed processing of large datasets across computer clusters.

Uploaded by

Yeabsira Fikadu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Chapter 2 - Intro. To Data Sciences

The document provides an overview of key concepts in data science including data types, the data processing cycle, the data value chain, and big data fundamentals. It discusses how data science uses scientific methods to extract knowledge from structured and unstructured data. The document also describes the Hadoop ecosystem and its core components that allow for distributed processing of large datasets across computer clusters.

Uploaded by

Yeabsira Fikadu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

CHAPTER -TWO

Introduction to Data Science


Topics Covered
 Learning outcomes

 An Overview of Data Science

 Data and information

 Data types and representation

 Data Processing Cycle

 Data Value Chain (Acquisition, Analysis ,Curating, Storage, Usage

 Basic concepts of Big data

 Hadoop ecosystem

 Review questions
Learning outcomes
After the successfully completing this chapter, the students can
 Differentiate data and information
 Explain data processing life cycle
 Differentiate different data types from diverse perspectives
 Explain the data value chain
 Explain the basics of big data
 Analyze Hadoop ecosystem components and their use in big data
2.1 Overview of Data Science
 What is Data science?
 A multi-disciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured, semi
structured and unstructured data.
 is much more than simply analyzing data.
Cont. . . Data Science Experts/Scientist?
 Data scientists are analytical experts who utilize their skills in both technology and social science to
find trends and manage data.
 They use industry knowledge, contextual understanding, uncertainty of existing assumptions to
uncover solutions to business challenges..

 Need a strong quantitative background


in statistics and linear algebra as well as
programming knowledge

 Must master the full spectrum of the


data science life cycle and possess a
level of flexibility and understanding to
maximize returns.
2.2 Data and Information
 Data?
A representation of facts, concepts, or instructions in a formalized manner, which
should be suitable for communication, interpretation, or processing by human or
electronic machine.
 Described as unprocessed facts and figures.
 Represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or
special characters (+, -, /, *, <,>, =, etc.).
 Information?
 Organized or classified data, which has some meaningful values for the receiver.

 A processed data on which decisions and actions are based.


 Principle of information - processed data must qualify for the following
 Timely − Information should be available when required.

 Accuracy − Information should be accurate.

 Completeness − Information should be complete.


Data Vs Information

Source: internet
2.3 Data Processing Cycle
Re-structuring or re-ordering of data by people or machine to increase their usefulness and
add values for a particular purpose.
The set of operations used to transform data into useful information.

Data Processing Cycle

input data is
prepared in some
convenient form for
processing output data is the result
e.g. electronic of processing step and
computers form of the output data
input data is changed to depends on the use of
produce data in a more the data.
useful form Produced information need to
e.g. calculating CGPA be stored for future usage
Cont. . .
2.4 Data types and their representation
Data type defines the operations that can be done on the data, the meaning of the data,
and the way values of that type can be stored.
Data types can be described from diverse perspective

(a) Computer (b) Analytics


programming perspective
perspective

 Integers (int) --- whole numbers


 Booleans (bool) -- restricted to one of two values:  Structured Data:
true or false  Semi-structured Data:
 Characters (char) -- store a single character (symbol)  Unstructured Data
 floating-point numbers (float) --- store real numbers
 alphanumeric strings (string) --- group of characters
Cont. . .
 Structured Data:
 It conforms to a tabular format with a relationship between the different rows and columns.
 Examples of structured data are Excel files . Each of these has structured rows and columns
that can be sorted
 Take a tabular format. E.g. Excel files

Source: internet
Semi-structured Data
 It is a form of structured data that does not conform with the formal
structure of data model.
 Contains tags or other markers for separation semantic elements enforce
hierarchies of records and fields within the data
 Fore example: JSON and XML

Source: internet
Unstructured Data
 Itis information that either does not have a predefined data model or is not organized in
a pre-defined manner
 Typically text-heavy but may contain data such as dates, numbers, and facts as well.
 Examples: audio, video files

Source: internet
Metadata
 It is not a separate data structure, but it is one of the most important elements for Big Data
analysis and big data solutions
 Data about data that provides additional information about a specific set of data.
2.5 Data value chain
Introduced to describe the information flow within a big data system as a series of steps
needed to generate value and useful insights from data.
The big data value chain identifies the following key high-level activities
• ensuring data trustworthiness, accessibility, reusability • Ensuring the needs of

Cont. . . • content creation, selection, classification,


transformation, validation, and preservation
fast access to the data
• RDBMS & NoSQL

• Exploring, transforming,
and modelling data with
the goal of highlighting
relevant data
• Synthesizing and
extracting useful hidden
information with high
potential from a
business point of view
Cont. . . Use case of Data Science
Cont. . . Application domain of Data Science
2.6 Basic concepts of Big data
 Big data is a blanket term for the non-traditional strategies and technologies needed
to gather, organize, process, and gather insights from large datasets.

 Big data refers to:


 large and complex datasets that it
becomes difficult to process using
on-hand database management
tools or traditional data processing
applications

 The category of computing


strategies and technologies that are
used to handle large datasets
 Goal of Big data:
 To surface insights and connections from large
volumes of heterogeneous data that would not
be possible using conventional methods
Cont. . . Characteristics of Big data

• The amount of data • The speed at which • The types of data • Data trustworthiness (the • The way in which • Business value of
from myriad source data are generated • Data comes in degree to which big data the big data can be the data collected
• large amounts of • Data is live many different can be trusted) used and formatted • Uses and purpose
data Zeta bytes streaming or in forms from diverse • Data accuracy • To whom the data of data
(Massive datasets) motion sources How accurate is it? are accessible?
2.7 Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make interaction with big data easier.
 It is inspired by a technical document published by Google.
 It allows for the distributed processing of large datasets across clusters of computers
using simple programming models.
 The four key characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A
few extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
Cont. . . The 4 core components of Hadoop
and its Ecosystem
 The 4 core components of
Hadoop includes
 Data Management,
 Data Access,

 Data Processing
 Data Storage.

The Hadoop Ecosystem


Cont. . . The 4 core components of Hadoop
and its Ecosystem
 Hadoop ecosystem comprises of the following components
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Cont. . . The Big data life cycle with Hadoop

 Stage 1- Ingesting data into the system


 The data is ingested or transferred to Hadoop from various sources such as relational
databases, systems, or local files.
 Stage 2- Processing the data in storage (stored and processed )
 The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing.
 Stage 3- Computing and analyzing data
 by processing frameworks such as Pig, Hive, and Impala.

 Stage 4- Visualizing the results (Access)


 by tools such as Hue and Cloudera Search
End of Chapter 2
(Data Science)

?
Quiz (5%)
1. What is the difference between Data and Information?
2. Mention types of data from analysis perspective and give an example.
3. List Data Processing Life cycle?
4. Write the Characteristics of Big data?
Assignment Questions
1. Discuss the difference of Big data and Data Science.
2. Briefly discuss the Big data life cycle.
3. List and explain Big data application domains with example.
4. What is Clustered Computing? Explain its advantages.
Thank you!

27

You might also like