0% found this document useful (0 votes)
42 views60 pages

Unit 1

Uploaded by

jidey30017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views60 pages

Unit 1

Uploaded by

jidey30017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Unit -1

Introduction to Data Science and Big Data


Contents

Basics and need of Data Science and Big Data, Applications of Data
Science, Data explosion, 5 V’s of Big Data, Relationship between
Data Science and Information Science, Business intelligence versus
Data Science, Data Science Life Cycle, Data: Data Types, Data
Collection.

Need of Data wrangling, Methods: Data Cleaning, Data Integration,
Data Reduction, Data Transformation, Data Discretization.
Introduction to Data Science

Data science in simple words can be defined as an interdisciplinary
field of study that uses data for various research and reporting, to
derive insights and meaning out of that data.


Data science is an interdisciplinary academic field that uses statistics,
scientific computing, scientific methods, processes, algorithms and
systems to extract knowledge and insights from noisy, structured, and
unstructured data.
Applications of Data Science

1. Healthcare: Healthcare companies are using data science to build


sophisticated medical instruments to detect and cure diseases.

2. Gaming: Video and computer games are now being created with the help of
data science and that has taken the gaming experience to the next level.

3. Image Recognition: Identifying patterns in images and detecting objects in


an image is one of the most popular data science applications.

4. Recommendation Systems: Netflix and Amazon give movie and product


recommendations based on what you like to watch, purchase, or browse on
their platforms.
5. Logistics: Data Science is used by logistics companies to optimize routes to
ensure faster delivery of products and increase operational efficiency.

6. Fraud Detection: Banking and financial institutions use data science and
related algorithms to detect fraudulent transactions.

7. Internet Search: Search engines, such as Google, Yahoo, Bing and others,
employ data science algorithms to offer the best results for our searched query
in a matter of seconds.

8. Speech recognition: Speech recognition is dominated by data science


techniques. We may see the excellent work of these algorithms in our daily
lives using virtual speech assistant like Google Assistant, Alexa, or Siri
9. Airline Route Planning: As a result of data science, it is easier to predict
flight delays for the airline industry. It also helps to determine whether to land
immediately at the destination or to make a stop in between, such as a flight
from Delhi to the United States of America or to stop in between and then
arrive at the destination.

10. Augmented Reality: There is a fascinating relationship between data


science and virtual reality. A virtual reality headset incorporates computer
expertise, algorithms, and data to create the greatest viewing experience
possible. The popular game Pokemon GO is a minor step in that direction.
Data Explosion

The rapid or exponential increase in the amount of data that is generated and stored
in the computing systems, that reaches level where data management becomes
difficult, is called Data Explosion.

The key drivers of data growth are following :
➢ Increase in storage capacities.
➢ Cheaper storage.
➢ Increase in data processing capabilities by modern computing devices.
➢ Data generated and made available by different sectors.
5 V’s of Big Data

Big data is a collection of data from many different sources and is often
describe by five characteristics: volume, variety, velocity, value and
veracity.

Volume: The size and amounts of big data that companies manage and
analyze. If the volume of data is large enough, it can be considered big data.

Variety: Variety describes the diversity of the data types and its
heterogeneous sources. The data comes in three different types: Structured
Data, Unstructured Data and Semi-Structured Data.

Velocity: It describes how rapidly the data is generated and how quickly it
moves. This data flow comes from sources such as mobile phones, social
media, networks, servers, etc. An organization that uses big data will have a
large and continuous flow of data that is being created and sent to its end
destination. This data needs to be digested and analyzed quickly.

Value: This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data. The
value of big data usually comes from insight discovery and pattern
recognition that lead to more effective operations, stronger customer
relationships and other clear and quantifiable business benefits. The
more insights derived from the Big Data, the higher its value.

Veracity: Veracity describes the data’s accuracy and quality. Since the
data is pulled from diverse sources, the information can have
uncertainties, errors, redundancies, gaps, and inconsistencies.
Veracity, overall, refers to the level of trust there is in the collected
data.
Relationship between Data Science and
Information Science

Data Science and Information Science are distinct but complementary disciplines.

Data Science:- Data science is used in business functions such as strategy formation,
decision making and operational processes. It transform raw, messy data into
actionable knowledge which supports decision-making. It touches on practices such
as artificial intelligence, analytics, predictive analytics and algorithm design. Data
science is an interdisciplinary field about scientific methods, processes, and systems
to extract knowledge or insights from data in various forms, either structured or
unstructured.

Information Science:- Information Science is the use of any computers, storage,
networking and other physical devices, infrastructure and processes to create,
process, store, secure and exchange all forms of electronic data. It is the
development, maintenance, and use of computer software, systems, and networks. It
includes their use for the processing and distribution of data.
Business Intelligence Vs Data Science

Perspective: BI systems are designed to look backwards based on real
data from real events. Data Science looks forward, interpreting the
information to predict what might happen in the future.

Focus: BI delivers detailed reports, trends but it doesn’t tell you what
this data may look like in the future in the form of patterns and
experimentation.

Process: Traditional BI systems tend to be static and comparative. They
do not offer room for exploration and experimentation.

Data sources: Because of its static nature, BI data sources tend to be pre-
planned and added slowly. Data science offers a much more flexible
approach as it means data sources can be added on the go as needed.

Transform: How the data delivers a difference to the business is
important. BI helps you answer the questions you know, whereas Data
Science helps you to discover new questions because of the way it
encourages companies to apply insights to new data.


Storage: Like any business asset, data needs to be flexible. BI systems
tend to be warehoused, which means it is difficult to deploy across the
business. Data Science can be distributed real time.

Data quality: Any data analysis is only as good as the quality of the data
captured. BI provides a single version of truth while data science offers
precision, confidence level and much wider probabilities with its findings.

IT owned vs. business owned In the past, BI systems were often owned and
operated by the IT department, sending along intelligence to analysts who
interpreted it. With Data Science, the analysts are in-charge. The new Big
Data solutions are designed to be owned by analysts, who spend most of
their time analyzing data and making predictions upon which to base
business decisions.
Data Science Life Cycle
1. Business Understanding:


The complete cycle revolves around the enterprise goal. What will you resolve if
you do no longer have a specific problem?

It is extraordinarily essential to apprehend the commercial enterprise goal sincerely
due to the fact that will be your ultimate aim of the analysis.

e.g. You need to understand if the customer desires to minimize savings loss, or if
they prefer to predict the rate of a commodity, etc.
2. Data Understanding (Data Mining):


After enterprise understanding, the subsequent step is data understanding.

This step includes describing the data, their structure, their relevance, their records
type.

Explore the information using graphical plots.
3. Preparation of Data (Data Cleaning)


Next comes the data preparation stage.

This consists of steps like choosing the applicable data, integrating the data by
means of merging the data sets, cleaning it, treating the lacking values through
either eliminating them or imputing them, treating inaccurate data through
eliminating them, additionally test for outliers the use of box plots and cope with
them.

Constructing new data, derive new elements from present ones.

Format the data into the preferred structure, eliminate undesirable columns and
features.

Data preparation is the most time-consuming but arguably the most essential step in
the complete existence cycle.

Your model will be as accurate as your data.
4. Exploratory Data Analysis (Data Exploration):


This step includes getting some concept about the answer and elements affecting it,
earlier than constructing the real model.

Distribution of data inside distinctive variables is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps.

Many data visualization strategies are used to discover each and every characteristic
individually and by means of combining them with different features.
5. Data Modelling (Feature Engineering)


Data modelling is the heart of data analysis.

A model takes the organized data as input and gives the preferred output.

This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem.

After deciding on the model family, amongst the number of algorithms amongst that
family, we need to cautiously pick out the algorithms to put into effect and enforce
them.

We need to tune the hyper parameters of every model to obtain the preferred
performance.
6. Model Evaluation (Predictive Modelling):


Here the model is evaluated for checking if it is geared up to be deployed.

The model is examined on an unseen data, evaluated on a cautiously thought out set
of assessment metrics.

If we do not acquire a quality end result in the evaluation, we have to re-iterate the
complete modelling procedure until the preferred stage of metrics is achieved.

We can construct more than one model for a certain phenomenon. The model
assessment helps us select and construct an ideal model.
7. Model Deployment (Data Visualization):


The model after a rigorous assessment is at the end deployed in the preferred
structure and channel.

This is the last step in the data science life cycle.

Each step in the data science life cycle defined above must be laboured upon
carefully. If any step is performed improperly, and hence, have an effect on the
subsequent step and the complete effort goes to waste.

Right from Business perception to model deployment, every step has to be given
appropriate attention, time, and effort.
Data
Preprocessing

58
Why Data Preprocessing?
● Data in the real world is dirty
➢ Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g. occupation=“ ”
➢ Noisy: containing errors or outliers
e.g. Salary=“-10”
➢ Inconsistent: containing discrepancies in codes or names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”

● No quality data, no quality mining results!

59
Major tasks in Data Preprocessing
1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
5. Discretization

60
1. Data Cleaning
Data cleaning tasks:

1) Fill in missing values

2) Identify outliers and smooth out noisy data

3) Correct inconsistent data

4) Resolve redundancy caused by data integration

61
Missing Data
● Data is not always available
e.g. many tuples have no recorded value for several attributes
● Missing data may be due to
➢ equipment malfunction
➢ inconsistent with other recorded data and thus deleted
➢ data not entered due to misunderstanding
➢ certain data may not be considered important at the time of entry

62
Handling of Missing values
1) Ignore the tuple: usually done when class label is missing
2) Fill in the missing value manually: tedious + infeasible?
3) Fill in it automatically with

- a global constant : e.g., “unknown”

- the attribute mean

- the attribute mean for all samples belonging to the same class

4) The most probable value

63
Noisy Data
● Noise: It is the random error in measured variable. That is there can be
incorrect attribute value (outlier). e.g. Salary=“-10”
● Incorrect attribute values may be due to
➢ faulty data collection instruments
➢ data entry problems
➢ data transmission problems

64
Handling of Noisy Data
● Binning method:
✔ first sort data and partition into equal-frequency (equi-depth) or Equal-width
(distance) bins
✔ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries etc.
✔ used also for discretization (discussed later)
● Clustering
✔ detect and remove outliers
● Regression
✔ smooth by fitting the data into regression functions

65
Binning method
Binning Methods for Data Smoothing:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins::
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

66
Cluster Analysis

67
Regression
● Data smoothing can also be done by regression.
● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.
● Using regression, we find a mathematical equation to fit the data which helps to
smooth out the noise

68
Continued...

Y1

Y1’ y=x+1

X1 x

69
2. Data Integration
● Data integration combines data from multiple sources into a coherent store (unified
view of that data)

● Schema integration: e.g., A.cust-id = B.cust-#


Integrate metadata from different sources

● Entity identification problem:


Identify real world entities from multiple data sources
e.g., Bill Clinton = William Clinton

● Detecting and resolving data value conflicts:


➢ For the same real world entity, attribute values from different sources are
different
➢ Possible reasons: different representations, different scales, e.g., metric vs. British
units 70
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases


➔ Object identification: The same attribute or object may have different names
in different databases
➔ Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
➔ Redundant attributes may be able to be detected by correlation analysis
➔ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

71
Correlation analysis
➢ Correlation coefficient (also called Pearson’s product moment
coefficient):

r A,B =
∑ ( A− A )( B−B) ∑ ( AB )−n A B
=
(n−1)σAσB (n−1)σAσB
where n is the number of tuples, A and B are the respective means of A and B, σA
and σB are the respective standard deviation of A and B

- If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
- rA,B < 0 : negatively correlated;

- rA,B = 0 : independent

72
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.

Where, N is number of tuples and µ is mean.


● Square of standard deviation is known as variance.

73
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.

Where, N is number of tuples and µ is mean.


● Square of standard deviation is known as variance.

74
3. Data Transformation
➔ Smoothing: remove noise from data (Use of binning,regression and clustering)
➔ Aggregation: summarization (e.g. daily sales data aggregated to compute monthly
or annual sales)
➔ Generalization: concept hierarchy climbing (e.g. street can be generalized to
higher level concepts like city or country)
➔ Discretization: values of numeric attribute (e.g age) are replaced by interval
labels(0-10,10-20,..) or conceptual labels(e.g. youth, adult, senior)
➔ Normalization: Attributes are scaled to fall within a smaller range such as -1.0 to
1.0 or 0.0 to 1.0
➢ min-max normalization
➢ z-score normalization
➢ normalization by decimal scaling
➔ Attribute/feature construction: New attributes constructed from the given ones75
Data Transformation: Normalization
● Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

76
Normalization Continued...

● Z-score normalization (μ: mean, σ: standard deviation):


Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

73,600  54,000
 1.225
16,000

● Normalization by decimal scaling:

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

77
4. Data Reduction
Why data reduction?
➢ A database/data warehouse may store terabytes of data. Complex data
analysis/mining may take a very long time to run on the complete data set
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
Data reduction strategies:
➔ Data cube aggregation
➔ Dimensionality reduction
➔ Data Compression
➔ Numerosity reduction
➔ Discretization and concept hierarchy generation

78
A. Data cube aggregation

79
B. Dimensionality reduction

● Most data mining algorithms are column-wise implemented, which makes them
slower and slower on a growing number of data columns (i.e. dimensions).
● So it is necessary to remove some unimportant attributes/dimensions in the dataset
in order to improve the performance of data mining algorithm.
● Attribute selection methods:
1) Missing Values Ratio: Data columns with too many missing values are unlikely
to carry much useful information. Thus data columns with number of missing values
greater than a given threshold can be removed.
2) Low Variance Filter: Similarly to the previous technique, data columns with
little changes in the data carry little information. Thus all data columns with
variance lower than a given threshold are removed.
3) High Correlation Filter. Data columns with very similar trends are also likely to
carry very similar information. In this case, only one of them will suffice to feed the
machine learning model. Pairs of columns with correlation coefficient higher than a
threshold are reduced to only one.
80
Continued...

4) Principal Component Analysis (PCA): Principal Component Analysis (PCA) is


a statistical procedure that orthogonally transforms the original n coordinates of a
data set into a new set of n coordinates called principal components. As a result of
the transformation, the first principal component has the largest possible variance;
each succeeding component has the highest possible variance under the constraint
that it is orthogonal to (i.e., uncorrelated with) the preceding components. Keeping
only the first m < n components reduces the data dimensionality while retaining
most of the data information, i.e. the variation in the data.
5) Backward Feature Elimination: In this technique, at a given iteration, the
selected classification algorithm is trained on n input features. Then we remove one
input feature at a time and train the same model on n-1 input features n times. The
input feature whose removal has produced the smallest increase in the error rate is
removed, leaving us with n-1 input features. The classification is then repeated
using n-2 features, and so on.

81
Continued...

6) Forward Feature Construction: This is the inverse process to the Backward


Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a
time, i.e. the feature that produces the highest increase in performance.
● Both algorithms, Backward Feature Elimination and Forward Feature Construction,
are quite time and computationally expensive. They are practically only applicable
to a data set with an already relatively low number of input columns.

82
C. Numerosity reduction

● Reduce data volume by choosing alternative, smaller forms of data


representation
1) Parametric methods:
➢ Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
➢ Regression, Log-Linear Models.
2) Non-parametric methods
➢ Do not assume models
➢ Major families: histograms, clustering, sampling

83
Regression

● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.

84
Continued...

Y1

Y1’ y=x+1

X1 x

85
Histogram

● A histogram is a graphical
representation of the distribution of
numerical data.
● Divide data into buckets and store
average (sum) for each bucket
● Partitioning rules:
- Equal-width: equal bucket range
- Equal-frequency (or equal-depth)

86
Clustering

Partition data set into clusters


based on similarity, and store
cluster representation (e.g.,
centroid and diameter) only

87
Sampling
● Sampling is the process of selecting units from a population of interest so
that by studying the sample we may fairly generalize our results back to the
population from which they were chosen
● Types of sampling:
1) Simple random sampling without replacement (SRSWOR): The
sampling units are chosen without replacement in the sense that the units once chosen
are not placed back in the population.

2) Simple random sampling with replacement (SRSWR): The sampling


units are chosen with replacement in the sense that the chosen units are placed back
in the population.

3) Cluster sample: The population is divided into N groups, called clusters. We can
randomly select n clusters to include in the sample.

4) Stratified sampling: We divide the population into separate groups, called strata.
Then, a simple random sample is drawn from each group.

88
5. Discretization
● Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the attribute into a range of intervals.
● Interval labels can then be used to replace actual data values.
● Many discretization techniques can be applied recursively in order to
provide a hierarchical partitioning of the attribute values known as concept
hierarchy.

89
Continued...

● Discretization techniques can be categorized based on which direction it


proceeds, as:
➢ Top-down: If the process starts by first finding one or a few points to split
the entire attribute range, and then repeats this recursively on the resulting
intervals
➢ Bottom-up: Starts by considering all of the continuous values as potential
split-points, removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting intervals.

90
Continued...

91
Continued...

● Methods for concept hierarchy generation are:


➢ Binning: Attribute values can be discretized by distributing the values into
bin and replacing each bin by the mean bin value or bin median value. These
technique can be applied recursively to the resulting partitions in order to
generate concept hierarchies.
➢ Histogram analysis: Histograms can also be used for discretization.
Partitioning rules can be applied to define range of values as
- Equal-width: equal bucket range
- Equal-frequency (or equal-depth)

92
Continued...

➢ Cluster analysis:
A clustering algorithm can be applied to partition data into clusters or
groups. Each cluster forms a node of a concept hierarchy, where all nodes
are at the same conceptual level. Each cluster may be further decomposed
into sub-clusters, forming a lower kevel in the hierarchy. Clusters may also
be grouped together to form a higher-level concept hierarchy.
➢ Data segmentation by “natural partitioning”:
Breaking up annual salaries in the range of into ranges like ($50,000-
$100,000) are often more desirable than ranges arrived at by cluster analysis.

93

You might also like