0% found this document useful (0 votes)
2 views

Chapter 2 - Intro to Data Sciences[2]

This chapter introduces data science as a multi-disciplinary field focused on extracting insights from various data types. It covers essential concepts such as the data value chain, data types, and the characteristics of big data, emphasizing the importance of data acquisition, analysis, curation, storage, and usage. Additionally, it discusses the challenges and technologies associated with big data, including the Hadoop ecosystem and the significance of the three Vs: volume, velocity, and variety.

Uploaded by

Natay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 2 - Intro to Data Sciences[2]

This chapter introduces data science as a multi-disciplinary field focused on extracting insights from various data types. It covers essential concepts such as the data value chain, data types, and the characteristics of big data, emphasizing the importance of data acquisition, analysis, curation, storage, and usage. Additionally, it discusses the challenges and technologies associated with big data, including the Hadoop ecosystem and the significance of the three Vs: volume, velocity, and variety.

Uploaded by

Natay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to Data

Chapter two
Science
Introduction to Data Science
2. Introduction to Data Science

Topics Covered 3-4


 Overview for Data Science
 Definition of data and information
 Data types and representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic concepts of Big data
2.1 Overview of Data Science
Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from structured, semi structured
and unstructured data.
2.1 Overview of Data Science

Data science continues to evolve as one of the most


promising and in-demand career paths for skilled
professionals.

Today, successful data professionals understand that


they must advance past traditional skills of analyzing
large amounts of data, data mining, and programming
skills.
What is expected of a data scientist?

In order to uncover useful intelligence for their organizations, data


scientists must master the full spectrum of the data science life cycle
and possess a level of flexibility and understanding to maximize returns
at each phase of the process.
Data scientists need to be curious and result-oriented, with exceptional
industry-specific knowledge and communication skills that allow them
to explain highly technical results to their non-technical counterparts.
Data science need a strong quantitative background in statistics and
linear algebra as well as programming knowledge with focuses in data
warehousing, mining, and modeling to build and analyze algorithms.
This chapter cover basic definitions of data and information, data types
and representation, data value change and basic concepts of big data.
Data Science Life cycle
2.1.1 Definition of data and information

What is data?

Data can be defined as a representation of facts, concepts,


or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing
by human or electronic machine.
Data is represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special characters
(+,-,/,*,<,>,= etc.)
What is Information?
Information is organized or classified data, which has some
meaningful values for the receiver. Information is the
processed data on which decisions and actions are based.
Information is a data that has been processed into a form
that is meaningful to recipient and is of real or perceived
value in the current or the prospective action or decision of
recipient.
For the decision to be meaningful, the processed data must
qualify for the following characteristics −
Timely − Information should be available when required.
Accuracy − Information should be accurate.
Completeness − Information should be complete.
Summery: Data Vs. Information

Data Information
Described as unprocessed or raw Described as processed data
facts and figures
Cannot help in decision making Can help in decision making
Raw material that can be Interpreted data; created from
organized, structured, and organized, structured, and processed
interpreted to create useful data in a particular context.
information systems.
‘groups of non-random’ symbols in Processed data in the form of text,
the form of text, images, and voice images, and voice representing
representing quantities, action quantities, action and objects'.
and objects'.
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular purpose.
 Data processing consists of the following basic steps - input, processing, and
output. These three steps constitute the data processing cycle.
 Input step − the input data is prepared in some convenient form for
processing.
 The form depends on the processing machine.
 For example - when electronic computers are used – input medium options
include magnetic disks, tapes, and so on.
 Processing step − the input data is changed to produce data in a more
useful form.
 For example - pay-checks can be calculated from the time cards, or a
summary of sales for the month can be calculated from the sales orders.
 Output step − the result of the proceeding processing step is collected.
 The particular form of the output data depends on the use of the data.
 For example - output data may be pay-checks for employees.
2.1.2 Data types and its representation – based on programming language
Data type or simply type is an attribute of data which tells the compiler
or interpreter how the programmer intends to use the data.
Almost all programming languages explicitly include the notion of data
type. Common data types include:
Integers
Booleans
Characters
floating-point numbers
alphanumeric strings
A data type constrains the values that an expression, such as a variable
or a function, might take.
This data type defines the operations that can be done on the data, the
meaning of the data, and the way values of that type can be stored.
On other hand, for the analysis of data, there are three common types of
data types or structures: Structured data, unstructured data, and semi-
structured data.
Data types/structure – based on analysis of data
 Structured Data, unstructured data, semi-structured data, and metadata

Structured Data
 Structured data is data that adheres to a pre-defined data model and is therefore
straightforward to analyze.

 Structured data conforms to a tabular format with relationship between the different rows and
columns. Common examples are Excel files or SQL databases.

 Each of these have structured rows and columns that can be sorted

 Structured data depends on the existence of a data model – a model of how data can be stored,
processed and accessed.

 It is possible to quickly aggregate data from various locations in the database.

 Structured data is considered the most ‘traditional’ form of data storage, since the earliest
versions of database management systems (DBMS) were able to store, process and access
structured data.
Unstructured Data
 Unstructured data is information that either does not have a predefined data model or is
not organized in a pre-defined manner. It is without proper formatting and alignment

 Unstructured information is typically text-heavy, but may contain data such as dates,
numbers, and facts as well.

 This results in irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in structured databases.
 Common examples include: audio, video files or No-SQL databases.

 The ability to store and process unstructured data has greatly grown in recent years, with
many new technologies and tools coming to the market that are able to store specialized
types of unstructured data. For example:
 MongoDB is optimized to store documents.
 Apache Graph - is optimized for storing relationships between nodes.

 The ability to analyze unstructured data is especially relevant in the context of Big Data,
since a large part of data in organizations is unstructured. Think about pictures, videos or
PDF documents.

 The ability to extract value from unstructured data is one of main drivers behind the quick
Semi-structured Data
Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated with
relational databases or other forms of data tables,
but nonetheless contain tags or other markers to separate
semantic elements and enforce hierarchies of records and fields
within the data. Therefore, it is also known as self-describing
structure.
Fore example: JSON and XML are forms of semi-structured data.
The reason that this third category exists (between structured and
unstructured data) is because semi-structured data is considerably
easier to analyze than unstructured data.
Many Big Data solutions and tools have the ability to ‘read’ and
process either JSON or XML. This reduces the complexity to
analyze structured data, compared to unstructured data.
Metadata – Data about Data
A last category of data type is metadata. From a technical
point of view, this is not a separate data structure, but it is
one of the most important elements for Big Data analysis and
big data solutions.
Metadata is data about data.
It provides additional information about a specific set of data.
In a set of photographs, for example, metadata could
describe when and where the photos were taken. The
metadata then provides fields for dates and locations which,
by themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big
Data solutions for initial analysis.
2.2 Data value Chain

The Data Value Chain is introduced to


describe the information flow within a big
data system as a series of steps needed to
generate value and useful insights from
data.
The Big Data Value Chain identifies the
following key high-level activities:
Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage solution on which data analysis
can be carried out.
Data acquisition is one of the major big data challenges in terms of infra-
structure requirements.
The infrastructure required to support the acquisition of big data must
deliver low, predictable latency in both capturing data and in executing
queries; be able to handle very high transaction volumes, often in a dis-
tributed environment; and support flexible and dynamic data structures.

Data Analysis
It is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
Data analysis involves exploring, transforming, and modelling data with
the goal of highlighting relevant data, synthesizing and extracting useful
hidden information with high potential from a business point of view.
Related areas include data mining, business intelligence, and machine
learning (covered in Chapter 4).
Data Curation
It is the active management of data over its life cycle to ensure
it meets the necessary data quality requirements for its effec-
tive usage.
Data curation processes can be categorized into different ac-
tivities such as content creation, selection, classification,
transformation, validation, and preservation.
Data curation is performed by expert curators that are respon-
sible for improving the accessibility and quality of data.
Data curators (also known as scientific curators, or data anno-
tators) hold the responsibility of ensuring that data are trust-
worthy, discoverable, accessible, reusable, and fit their pur-
pose.
A key trend for the curation of big data utilizes community and
crowd sourcing approaches.
Data Storage
It is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access
to the data.
 Relational Database Management Systems (RDBMS) have been
the main, and almost unique, solution to the storage paradigm
for nearly 40 years.
However, the ACID (Atomicity, Consistency, Isolation, and
Durability) properties that guarantee database transactions
lack flexibility with regard to schema changes and the
performance and fault tolerance when data volumes and
complexity grow, making them unsuitable for big data
scenarios.
NoSQL technologies have been designed with the scalability
goal in mind and present a wide range of solutions based on
alternative data models.
Data Usage
It covers the data-driven business activities that need ac-
cess to data, its analysis, and the tools needed to integrate
the data analysis within the business activity.
Data usage in business decision-making can enhance com-
petitiveness through reduction of costs, increased added
value, or any other parameter that can be measured
against existing performance criteria
2.3 Basic concepts of big data

Big data is a blanket term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large datasets.
While the problem of working with data that exceeds the computing power
or storage of a single computer is not new, the pervasiveness, scale, and
value of this type of computing has greatly expanded in recent years.
We will also take a high-level look at some of the processes and technologies
currently being used in this space.
What Is Big Data?
An exact definition of “big data” is difficult to nail down because projects,
vendors, practitioners, and business professionals use it quite differently.
With that in mind, generally speaking, big data is:
1. large datasets
2. the category of computing strategies and technologies that are used to handle
large datasets
In this context, “large dataset” means a dataset too large to reasonably
process or store with traditional tooling or on a single computer.
This means that the common scale of big datasets is constantly shifting and
may vary significantly from organization to organization.
Why Are Big Data Systems Different?
The basic requirements for working with big data are the
same as the requirements for working with datasets of any
size.
However, the massive scale, the speed of ingesting and
processing, and the characteristics of the data that must be
dealt with at each stage of the process present significant
new challenges when designing solutions.
The goal of most big data systems is to surface insights and
connections from large volumes of heterogeneous data that
would not be possible using conventional methods.
In 2001, Gartner’s Doug Laney first presented what became
known as the “three Vs of big data” to describe some of the
characteristics that make big data different from other data
processing:
Characteristics of Big Data – 3V’s

Volume
 large amounts of data Zeta bytes/Massive datasets
 These datasets can be orders of magnitude larger than traditional datasets, which demands
more thought at each stage of the processing and storage life cycle.
 Cluster management and algorithms capable of breaking tasks into smaller pieces become
increasingly important.

Velocity
 Another way in which big data differs significantly from other data systems is the speed
that information moves through the system.

 Data is frequently flowing into the system from multiple sources and is often expected to
be processed in real time to gain insights and update the current understanding of the
system.

 Data is constantly being added, massaged, processed, and analyzed in order to keep up
with the influx of new information and to surface valuable information early when it is most
relevant.

 These ideas require robust systems with highly available components to guard against
failures along the data pipeline.
Variety
Data comes in many different forms from diverse sources.
The formats and types of media can vary significantly as
well. Rich media like images, video files, and audio
recordings are ingested alongside text files, structured
logs, etc.
Clustered Computing and Hadoop Ecosystem
 Because of the quantities of big data, individual computers are often inadequate for handling the data at most
stages. Therefore, to address the high storage and computational needs of big data, computer clusters are a
better fit.

Big data clustering software combines the resources of many smaller machines, to provide a number of benefits:

 Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory
pooling is also extremely important.

 High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data and processing. This becomes increasingly
important as we continue to emphasize the importance of real-time analytics.

 Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. This
means the system can react to changes in resource requirements without expanding the physical resources on
a machine.

 Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and
scheduling actual work on individual nodes. Solution for cluster membership and resource allocation include:
 software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator) or Apache Mesos.

 The assembled computing cluster often acts as a foundation which other software interfaces with to process
the data. The machines involved in the computing cluster are also typically involved with the management of a
distributed storage system (discuss in data persistence).
Clustered Computing and Hadoop Ecosystem
Using clusters requires a solution for managing cluster
membership, coordinating resource sharing, and scheduling
actual work on individual nodes.
Cluster membership and resource allocation can be handled
by software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
The machines involved in the computing cluster are also
typically involved with the management of a distributed
storage system
Hadoop and its Ecosystem
It is a framework that allows for the distributed processing of
large datasets across clusters of computers using simple
programming models.
It is inspired by a technical document published by Google.
The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
• Reliable: It is reliable as it stores copies of the data on different machines
and is resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra
nodes help in scaling up the framework
• Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Hadoop and its Ecosystem
Hadoop has an ecosystem that has evolved from
its four core components: data management,
access, processing, and storage.
comprises the following components and many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm
libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
End of Data Science!!!

You might also like