0% found this document useful (0 votes)
16 views

Chapter - 2 - Data Science

The document discusses key concepts related to data science including differentiating data and information, describing the data processing lifecycle and common data types from different perspectives. It also covers the basics of big data and components of the Hadoop ecosystem.

Uploaded by

Demeke
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter - 2 - Data Science

The document discusses key concepts related to data science including differentiating data and information, describing the data processing lifecycle and common data types from different perspectives. It also covers the basics of big data and components of the Hadoop ecosystem.

Uploaded by

Demeke
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

1

Chapter-2

Introduction to Data science

 .
2
Unit objectives

 Differentiate data and information


 Describe the essence of data science and the role of data
scientist
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.
3
Overview of Data science

 Data science is a multi-disciplinary field which involves extracting


insights from vast amounts of data using scientific methods, algorithms,
and processes.

 It helps to extract knowledge and insights from structured, semi


structured and unstructured data.

 More importantly, it enables you to translate a business problem into a


research project and then translate it back into a practical solution.
4
Application of Data science

 Data science is much more than simply analyzing data; it plays wide range of
roles as follows;
 Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive business
advantage
 Can help you to detect fraud using advanced machine learning algorithms
 It could also helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines
 You can perform sentiment analysis to gauge customer brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right customer to enhance
your business
Data Vs. Information 5

 Data can be defined as a representation of facts, concepts, or


instructions in a formalized manner with help of characters such
as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,
*, <,>, =, etc.)

 Those facts are suitable for communication, interpretation,


or processing, by human or electronic machines.
6
Data Vs. Information cont’d

 Information is interpreted data; created from organized,


structured, and processed data in a particular context on which
decisions and actions are based.
7
Data Processing Cycle

 Data processing is re-structuring or re-ordering of data by people or machines


to increase their usefulness and add values for a particular purpose.
 The following are basic steps of data processing;

 Input - the input data is prepared in some convenient form for processing.
 Processing - in this step, the input data is changed to produce data in a more
useful form.
 For example, a summary of sales for the month can be calculated from
the sales orders.
 Output − at this stage, the result of the proceeding processing step is
collected.
Data types and their representations 8

 In computer science and computer programming, a data type is


simply an attribute of data that tells the compiler or interpreter how
the programmer intends to use the data.

 A data type makes the values that expression, such as a variable or a


function, might take.

 This data type defines the operations that can be done on the data, the
meaning of the data, and the way values of that type can be stored.
9
Data types from computer programming
perspective;

Almost all programming languages explicitly include the notion


of data type with different terminology.
Common data types include the following;
Integers(int)- is used to store whole numbers, mathematically
known as integers
Booleans(bool)- is used to represent restricted to one of two
values: true or false
Characters(char)- is used to store a single character
Floating- point numbers(float)- is used to store real numbers
Alphanumeric strings+(string)- used to store a combination
of characters and numbers
10
Data types from data analytics perspective:

 From data analytics perspective there are three common types of data
types or structures: Structured, Semi-structured, and Unstructured data
types.

Data types from data analytics perspective


11
Data types from data analytics perspective

 Structured data:-

 Structured data is data that adheres to a pre-defined data


model and is therefore straightforward to analyze.

 Structured data conforms to a tabular format with a


relationship between the different rows and columns

 E.g. Excel files, SQL databases


12
Data types from data analytics perspective

 Semi structured data :-

 Semi-structured data is a form of structured data that does not conform


with the formal structure. However, such files contains tags or other
markers to separate semantic elements and enforce hierarchies of
records and fields within the data.

 E.g. XML, JSON


13
Data types from data analytics perspective
cont’d

 Unstructured data:-
 Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
 Usually it is typically text-heavy but may contain data such as dates,
numbers, and facts as well
 E.g. audio, video files or NoSQL databases
14
Data types from data analytics perspective
cont’d

 Metadata :-

 Technically metadata is not a separate data structure, but it is


one of the most important elements for Big Data analysis and
big data solutions.

 It provides additional information about a specific set of


data; conveniently it can be said data about data

E.g. Date and location of a photograph


15
Data value Chain

 Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
 It identifies the following key high-level activities:-
16
Data Acquisition

 Data acquisition is the process of gathering, filtering, and cleaning data


before it is put in a data warehouse.

 The infrastructure required for big data acquisition must deliver low,
predictable latency in both capturing data and in executing queries.
 The infrastructure handle very high transaction volumes, often in a
distributed environment, & support flexible and dynamic data structures.

 Data acquisition is major challenges in big data because of it’s high-end


infrastructure requirement.
17
Data Analysis

 Data analysis involves exploring, transforming, and modeling


data with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information.

 It also deals with making the raw data acquired amenable to use
in decision making process
18
Data Curation

 Data curation is an active management of data over its life cycle


to ensure the necessary data quality requirements.

 It’s process can be categorized into different activities such as


content creation, selection, classification, transformation,
validation, and preservation.

 Data curation is performed by expert curators or annotators


that are responsible for improving the accessibility and quality
of data.
19
Data Storage

 Data storage is the persistence and management of data in a


scalable way that satisfies the needs of applications that require
fast access to the data.

 Relational database system has been used as a storage paradigm


for over 40 years.

 Following the volume and complexity of data recently highly


scalable NoSQL technologies is applied for big data storage
model.
20
Data Usage

 Data usage covers the data-driven business activities that need


access to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.

 It enhances business decision making competitiveness


through the reduction of costs, increased added value, or any
other parameter that can be measured against existing
performance criteria.
21
Basic concepts of big data

 What is big data?

 Big data is the term for a collection of data sets so large and
complex that becomes difficult to process using on-hand
database management tools or traditional data processing
applications.

 The common scale of big datasets is constantly shifting and


may vary significantly from organization to organization.
22
Basic concepts of big data

 Walmart handles more than 1 million customers transaction


every hour.
 Facebook handles 40 billion photos from its user base
 Decoding the human genome originally took 10 years to
process, now it can be achieved in one week.
23
Basic concepts of big data

 Large dataset” means a dataset is too large to reasonably


process or store with traditional tooling or on a single
computer.
 Big data is characterized by 5V’s actually beyond this:-
 Volume: refers to the amount of data that is being collected.
 Velocity: refers to the rate at which data is coming in.
 Variety: refers to the different kinds of data
 Value refers to the usefulness of the collected data.
 Veracity: refers to the quality of data that is coming in from
different sources.
24
Big data cont’d

 The following figure depicts the 5V’s of big data


25
Clustered Computing and Hadoop
Ecosystem

 Clustered Computing
 Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.
 To better address the high storage and computational needs of big data,
computer clusters are a better fit.
 Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits such as:
 Resource Pooling: combining storage space and cpu to process large
dataset
 High Availability: Clusters can provide varying levels of fault
tolerance and availability
 Easy Scalability: Clusters make it easy to scale horizontally by adding
additional machines to the group.
26
Clustered Computing cont’d

 Employing clustered resources may require managing cluster


membership, coordinating resource sharing, and scheduling
actual work on individual nodes or computers.

 The cluster membership and resource allocation task is done by


apache open source framework software's like Hadoop's
YARN(which stands for Yet Another Resource Negotiator.)

 The assembled cluster machines act seamlessly and help other


software interfaces to process the data.
27
Hadoop and its Ecosystem

 What is Hadoop?
 is basically an open source framework based on the Java
programming language, that allows for the distributed processing and
storage of large data sets across clusters of computers
 Hides underlying system details and complexities from user
 Developed in Java
 Flexible, enterprise-class support for processing large volumes of
data
 Inspired by Google technologies (MapReduce, GFS, BigTable, …)
28
Hadoop and its Ecosystem

 What is Hadoop?
 Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
 CPU + disks = “node”
 Nodes can be combined into clusters
 New nodes can be added as needed without changing
 Data formats
 How data is loaded
 How jobs are written
29
Hadoop and its Ecosystem Cont’d

 Hadoop has an ecosystem that has evolved from its four core components: data management,
access, processing, and storage.
 Following are components that collectively form a hadoop ecosystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
30
Hadoop and its Ecosystem Cont’d
31
Hadoop and its Ecosystem Cont’d

 The following figure depict Hadoop ecosystem


32
Life cycle of big data with Hadoop

 Ingesting data into the system


 First the data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data
 Processing the data in storage
 The second stage is Processing. In this stage, the data is stored
and processed.
 The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing
33
Life cycle of big data with Hadoop Cont’d

 Computing and analyzing data


 The third stage is analyzing and processing data using open source
frameworks such as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then analyzes it.
 Hive is also based on the map and reduce programming and is most
suitable for structured data
 Visualizing the results
 The fourth stage is Access, which is performed by tools such as Hue and
Cloudera Search.
 In this stage, the analyzed data can be accessed by users.

You might also like