0% found this document useful (0 votes)

16 views

Chapter - 2 - Data Science

The document discusses key concepts related to data science including differentiating data and information, describing the data processing lifecycle and common data types from different perspectives. It also covers the basics of big data and components of the Hadoop ecosystem.

Uploaded by

Demeke

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Chapter - 2 - Data Science

Uploaded by

Demeke

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

1

Chapter-2

Introduction to Data science

 .
2
Unit objectives

 Differentiate data and information

 Describe the essence of data science and the role of data
scientist
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.
3
Overview of Data science

 Data science is a multi-disciplinary field which involves extracting

insights from vast amounts of data using scientific methods, algorithms,
and processes.

 It helps to extract knowledge and insights from structured, semi

structured and unstructured data.

 More importantly, it enables you to translate a business problem into a

research project and then translate it back into a practical solution.
4
Application of Data science

 Data science is much more than simply analyzing data; it plays wide range of
roles as follows;
 Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive business
advantage
 Can help you to detect fraud using advanced machine learning algorithms
 It could also helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines
 You can perform sentiment analysis to gauge customer brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right customer to enhance
your business
Data Vs. Information 5

 Data can be defined as a representation of facts, concepts, or

instructions in a formalized manner with help of characters such
as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,
*, <,>, =, etc.)

 Those facts are suitable for communication, interpretation,

or processing, by human or electronic machines.
6
Data Vs. Information cont’d

 Information is interpreted data; created from organized,

structured, and processed data in a particular context on which
decisions and actions are based.
7
Data Processing Cycle

 Data processing is re-structuring or re-ordering of data by people or machines

to increase their usefulness and add values for a particular purpose.
 The following are basic steps of data processing;

 Input - the input data is prepared in some convenient form for processing.
 Processing - in this step, the input data is changed to produce data in a more
useful form.
 For example, a summary of sales for the month can be calculated from
the sales orders.
 Output − at this stage, the result of the proceeding processing step is
collected.
Data types and their representations 8

 In computer science and computer programming, a data type is

simply an attribute of data that tells the compiler or interpreter how
the programmer intends to use the data.

 A data type makes the values that expression, such as a variable or a

function, might take.

 This data type defines the operations that can be done on the data, the
meaning of the data, and the way values of that type can be stored.
9
Data types from computer programming
perspective;

Almost all programming languages explicitly include the notion

of data type with different terminology.
Common data types include the following;
Integers(int)- is used to store whole numbers, mathematically
known as integers
Booleans(bool)- is used to represent restricted to one of two
values: true or false
Characters(char)- is used to store a single character
Floating- point numbers(float)- is used to store real numbers
Alphanumeric strings+(string)- used to store a combination
of characters and numbers
10
Data types from data analytics perspective:

 From data analytics perspective there are three common types of data
types or structures: Structured, Semi-structured, and Unstructured data
types.

Data types from data analytics perspective

11
Data types from data analytics perspective

 Structured data:-

 Structured data is data that adheres to a pre-defined data

model and is therefore straightforward to analyze.

 Structured data conforms to a tabular format with a

relationship between the different rows and columns

 E.g. Excel files, SQL databases

12
Data types from data analytics perspective

 Semi structured data :-

 Semi-structured data is a form of structured data that does not conform

with the formal structure. However, such files contains tags or other
markers to separate semantic elements and enforce hierarchies of
records and fields within the data.

 E.g. XML, JSON

13
Data types from data analytics perspective
cont’d

 Unstructured data:-
 Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
 Usually it is typically text-heavy but may contain data such as dates,
numbers, and facts as well
 E.g. audio, video files or NoSQL databases
14
Data types from data analytics perspective
cont’d

 Metadata :-

 Technically metadata is not a separate data structure, but it is

one of the most important elements for Big Data analysis and
big data solutions.

 It provides additional information about a specific set of

data; conveniently it can be said data about data

E.g. Date and location of a photograph

15
Data value Chain

 Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
 It identifies the following key high-level activities:-
16
Data Acquisition

 Data acquisition is the process of gathering, filtering, and cleaning data

before it is put in a data warehouse.

 The infrastructure required for big data acquisition must deliver low,
predictable latency in both capturing data and in executing queries.
 The infrastructure handle very high transaction volumes, often in a
distributed environment, & support flexible and dynamic data structures.

 Data acquisition is major challenges in big data because of it’s high-end

infrastructure requirement.
17
Data Analysis

 Data analysis involves exploring, transforming, and modeling

data with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information.

 It also deals with making the raw data acquired amenable to use
in decision making process
18
Data Curation

 Data curation is an active management of data over its life cycle

to ensure the necessary data quality requirements.

 It’s process can be categorized into different activities such as

content creation, selection, classification, transformation,
validation, and preservation.

 Data curation is performed by expert curators or annotators

that are responsible for improving the accessibility and quality
of data.
19
Data Storage

 Data storage is the persistence and management of data in a

scalable way that satisfies the needs of applications that require
fast access to the data.

 Relational database system has been used as a storage paradigm

for over 40 years.

 Following the volume and complexity of data recently highly

scalable NoSQL technologies is applied for big data storage
model.
20
Data Usage

 Data usage covers the data-driven business activities that need

access to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.

 It enhances business decision making competitiveness

through the reduction of costs, increased added value, or any
other parameter that can be measured against existing
performance criteria.
21
Basic concepts of big data

 What is big data?

 Big data is the term for a collection of data sets so large and
complex that becomes difficult to process using on-hand
database management tools or traditional data processing
applications.

 The common scale of big datasets is constantly shifting and

may vary significantly from organization to organization.
22
Basic concepts of big data

 Walmart handles more than 1 million customers transaction

every hour.
 Facebook handles 40 billion photos from its user base
 Decoding the human genome originally took 10 years to
process, now it can be achieved in one week.
23
Basic concepts of big data

 Large dataset” means a dataset is too large to reasonably

process or store with traditional tooling or on a single
computer.
 Big data is characterized by 5V’s actually beyond this:-
 Volume: refers to the amount of data that is being collected.
 Velocity: refers to the rate at which data is coming in.
 Variety: refers to the different kinds of data
 Value refers to the usefulness of the collected data.
 Veracity: refers to the quality of data that is coming in from
different sources.
24
Big data cont’d

 The following figure depicts the 5V’s of big data

25
Clustered Computing and Hadoop
Ecosystem

 Clustered Computing
 Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.
 To better address the high storage and computational needs of big data,
computer clusters are a better fit.
 Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits such as:
 Resource Pooling: combining storage space and cpu to process large
dataset
 High Availability: Clusters can provide varying levels of fault
tolerance and availability
 Easy Scalability: Clusters make it easy to scale horizontally by adding
additional machines to the group.
26
Clustered Computing cont’d

 Employing clustered resources may require managing cluster

membership, coordinating resource sharing, and scheduling
actual work on individual nodes or computers.

 The cluster membership and resource allocation task is done by

apache open source framework software's like Hadoop's
YARN(which stands for Yet Another Resource Negotiator.)

 The assembled cluster machines act seamlessly and help other

software interfaces to process the data.
27
Hadoop and its Ecosystem

 What is Hadoop?
 is basically an open source framework based on the Java
programming language, that allows for the distributed processing and
storage of large data sets across clusters of computers
 Hides underlying system details and complexities from user
 Developed in Java
 Flexible, enterprise-class support for processing large volumes of
data
 Inspired by Google technologies (MapReduce, GFS, BigTable, …)
28
Hadoop and its Ecosystem

 What is Hadoop?
 Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
 CPU + disks = “node”
 Nodes can be combined into clusters
 New nodes can be added as needed without changing
 Data formats
 How data is loaded
 How jobs are written
29
Hadoop and its Ecosystem Cont’d

 Hadoop has an ecosystem that has evolved from its four core components: data management,
access, processing, and storage.
 Following are components that collectively form a hadoop ecosystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
30
Hadoop and its Ecosystem Cont’d
31
Hadoop and its Ecosystem Cont’d

 The following figure depict Hadoop ecosystem

32
Life cycle of big data with Hadoop

 Ingesting data into the system

 First the data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data
 Processing the data in storage
 The second stage is Processing. In this stage, the data is stored
and processed.
 The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing
33
Life cycle of big data with Hadoop Cont’d

 Computing and analyzing data

 The third stage is analyzing and processing data using open source
frameworks such as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then analyzes it.
 Hive is also based on the map and reduce programming and is most
suitable for structured data
 Visualizing the results
 The fourth stage is Access, which is performed by tools such as Hue and
Cloudera Search.
 In this stage, the analyzed data can be accessed by users.

Data Model Scorecard - Article 1 of 11
0% (1)
Data Model Scorecard - Article 1 of 11
5 pages
Dbsecurity
No ratings yet
Dbsecurity
79 pages
Last Mile Delivery Dilemma in E-Commerce
No ratings yet
Last Mile Delivery Dilemma in E-Commerce
94 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
ETCh2
No ratings yet
ETCh2
36 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
data science
No ratings yet
data science
23 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Chapter 3 Function
No ratings yet
Chapter 3 Function
73 pages
Chapter - 3 - Artificial Intelligence (AI)
100% (1)
Chapter - 3 - Artificial Intelligence (AI)
63 pages
Chapter 2 Pointers in C++
No ratings yet
Chapter 2 Pointers in C++
44 pages
Chapter - 1 - Introduction To Emerging Technologies
No ratings yet
Chapter - 1 - Introduction To Emerging Technologies
22 pages
Journal Finding Tools & Research Metrics
No ratings yet
Journal Finding Tools & Research Metrics
40 pages
Sedimentology - 2013 - Colombera - A Quantitative Approach To Fluvial Facies Models Methods and Example Results
No ratings yet
Sedimentology - 2013 - Colombera - A Quantitative Approach To Fluvial Facies Models Methods and Example Results
33 pages
Libraries As An Effective Instrument For Achieving Academic Excelence
100% (2)
Libraries As An Effective Instrument For Achieving Academic Excelence
37 pages
Faizan Resume
No ratings yet
Faizan Resume
3 pages
HRD Program Evaluation CH 9
No ratings yet
HRD Program Evaluation CH 9
31 pages
Black White Minimalist CV Resume
No ratings yet
Black White Minimalist CV Resume
4 pages
Database Transactions TYPE A: Very Short Answer Questions
No ratings yet
Database Transactions TYPE A: Very Short Answer Questions
2 pages
Chap 4
No ratings yet
Chap 4
19 pages
SAP IDES Deployment Planing
No ratings yet
SAP IDES Deployment Planing
6 pages
Full Download (Ebook) Civil-Military Relations and Global Security Governance: Strategy, Hybrid Orders and the Case of Pakistan by Baciu, Cornelia ISBN 9780367647582, 0367647583 PDF DOCX
100% (7)
Full Download (Ebook) Civil-Military Relations and Global Security Governance: Strategy, Hybrid Orders and the Case of Pakistan by Baciu, Cornelia ISBN 9780367647582, 0367647583 PDF DOCX
51 pages
Azure Devops ppt by suraj
No ratings yet
Azure Devops ppt by suraj
157 pages
DBMS Basics: Dr. Rajesh Chauhan
No ratings yet
DBMS Basics: Dr. Rajesh Chauhan
56 pages
Civil 3D Tutorial
No ratings yet
Civil 3D Tutorial
9 pages
Seminar On Linked List
No ratings yet
Seminar On Linked List
27 pages
Teradata Interview Questions and Answers
No ratings yet
Teradata Interview Questions and Answers
21 pages
6.interview Questions
No ratings yet
6.interview Questions
59 pages
Fatawa Usmani Vol 04
No ratings yet
Fatawa Usmani Vol 04
554 pages
Baby Garments Final
100% (1)
Baby Garments Final
29 pages
TPW Data Mining
No ratings yet
TPW Data Mining
4 pages
The Elements of Research
No ratings yet
The Elements of Research
52 pages
Business Intelligence Systems - Types of BI Tools in 2023
No ratings yet
Business Intelligence Systems - Types of BI Tools in 2023
16 pages
Transcript
No ratings yet
Transcript
113 pages
CV8317XSTUD
No ratings yet
CV8317XSTUD
238 pages
Chapter 4
No ratings yet
Chapter 4
2 pages
Port Game
No ratings yet
Port Game
2 pages
Linux - File Permission
No ratings yet
Linux - File Permission
16 pages
Using The Exchange Tools ISINTEG and ESEUTIL
No ratings yet
Using The Exchange Tools ISINTEG and ESEUTIL
8 pages

Chapter - 2 - Data Science

Uploaded by

Chapter - 2 - Data Science

Uploaded by

1

Introduction to Data science

 Differentiate data and information

 Data science is a multi-disciplinary field which involves extracting

 It helps to extract knowledge and insights from structured, semi

 More importantly, it enables you to translate a business problem into a

 Data can be defined as a representation of facts, concepts, or

 Those facts are suitable for communication, interpretation,

 Information is interpreted data; created from organized,

 Data processing is re-structuring or re-ordering of data by people or machines

 In computer science and computer programming, a data type is

 A data type makes the values that expression, such as a variable or a

Almost all programming languages explicitly include the notion

Data types from data analytics perspective

 Structured data is data that adheres to a pre-defined data

 Structured data conforms to a tabular format with a

 E.g. Excel files, SQL databases

 Semi structured data :-

 Semi-structured data is a form of structured data that does not conform

 E.g. XML, JSON

 Technically metadata is not a separate data structure, but it is

 It provides additional information about a specific set of

E.g. Date and location of a photograph

 Data acquisition is the process of gathering, filtering, and cleaning data

 Data acquisition is major challenges in big data because of it’s high-end

 Data analysis involves exploring, transforming, and modeling

 Data curation is an active management of data over its life cycle

 It’s process can be categorized into different activities such as

 Data curation is performed by expert curators or annotators

 Data storage is the persistence and management of data in a

 Relational database system has been used as a storage paradigm

 Following the volume and complexity of data recently highly

 Data usage covers the data-driven business activities that need

 It enhances business decision making competitiveness

 What is big data?

 The common scale of big datasets is constantly shifting and

 Walmart handles more than 1 million customers transaction

 Large dataset” means a dataset is too large to reasonably

 The following figure depicts the 5V’s of big data

 Employing clustered resources may require managing cluster

 The cluster membership and resource allocation task is done by

 The assembled cluster machines act seamlessly and help other

 The following figure depict Hadoop ecosystem

 Ingesting data into the system

 Computing and analyzing data

You might also like