Unit 1
Unit 1
●
Data science is an interdisciplinary academic field that uses statistics,
scientific computing, scientific methods, processes, algorithms and
systems to extract knowledge and insights from noisy, structured, and
unstructured data.
Applications of Data Science
2. Gaming: Video and computer games are now being created with the help of
data science and that has taken the gaming experience to the next level.
6. Fraud Detection: Banking and financial institutions use data science and
related algorithms to detect fraudulent transactions.
7. Internet Search: Search engines, such as Google, Yahoo, Bing and others,
employ data science algorithms to offer the best results for our searched query
in a matter of seconds.
●
Storage: Like any business asset, data needs to be flexible. BI systems
tend to be warehoused, which means it is difficult to deploy across the
business. Data Science can be distributed real time.
●
Data quality: Any data analysis is only as good as the quality of the data
captured. BI provides a single version of truth while data science offers
precision, confidence level and much wider probabilities with its findings.
●
IT owned vs. business owned In the past, BI systems were often owned and
operated by the IT department, sending along intelligence to analysts who
interpreted it. With Data Science, the analysts are in-charge. The new Big
Data solutions are designed to be owned by analysts, who spend most of
their time analyzing data and making predictions upon which to base
business decisions.
Data Science Life Cycle
1. Business Understanding:
➢
The complete cycle revolves around the enterprise goal. What will you resolve if
you do no longer have a specific problem?
➢
It is extraordinarily essential to apprehend the commercial enterprise goal sincerely
due to the fact that will be your ultimate aim of the analysis.
➢
e.g. You need to understand if the customer desires to minimize savings loss, or if
they prefer to predict the rate of a commodity, etc.
2. Data Understanding (Data Mining):
➢
After enterprise understanding, the subsequent step is data understanding.
➢
This step includes describing the data, their structure, their relevance, their records
type.
➢
Explore the information using graphical plots.
3. Preparation of Data (Data Cleaning)
➢
Next comes the data preparation stage.
➢
This consists of steps like choosing the applicable data, integrating the data by
means of merging the data sets, cleaning it, treating the lacking values through
either eliminating them or imputing them, treating inaccurate data through
eliminating them, additionally test for outliers the use of box plots and cope with
them.
➢
Constructing new data, derive new elements from present ones.
➢
Format the data into the preferred structure, eliminate undesirable columns and
features.
➢
Data preparation is the most time-consuming but arguably the most essential step in
the complete existence cycle.
➢
Your model will be as accurate as your data.
4. Exploratory Data Analysis (Data Exploration):
➢
This step includes getting some concept about the answer and elements affecting it,
earlier than constructing the real model.
➢
Distribution of data inside distinctive variables is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps.
➢
Many data visualization strategies are used to discover each and every characteristic
individually and by means of combining them with different features.
5. Data Modelling (Feature Engineering)
➢
Data modelling is the heart of data analysis.
➢
A model takes the organized data as input and gives the preferred output.
➢
This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem.
➢
After deciding on the model family, amongst the number of algorithms amongst that
family, we need to cautiously pick out the algorithms to put into effect and enforce
them.
➢
We need to tune the hyper parameters of every model to obtain the preferred
performance.
6. Model Evaluation (Predictive Modelling):
➢
Here the model is evaluated for checking if it is geared up to be deployed.
➢
The model is examined on an unseen data, evaluated on a cautiously thought out set
of assessment metrics.
➢
If we do not acquire a quality end result in the evaluation, we have to re-iterate the
complete modelling procedure until the preferred stage of metrics is achieved.
➢
We can construct more than one model for a certain phenomenon. The model
assessment helps us select and construct an ideal model.
7. Model Deployment (Data Visualization):
➢
The model after a rigorous assessment is at the end deployed in the preferred
structure and channel.
➢
This is the last step in the data science life cycle.
➢
Each step in the data science life cycle defined above must be laboured upon
carefully. If any step is performed improperly, and hence, have an effect on the
subsequent step and the complete effort goes to waste.
➢
Right from Business perception to model deployment, every step has to be given
appropriate attention, time, and effort.
Data
Preprocessing
58
Why Data Preprocessing?
● Data in the real world is dirty
➢ Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g. occupation=“ ”
➢ Noisy: containing errors or outliers
e.g. Salary=“-10”
➢ Inconsistent: containing discrepancies in codes or names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”
59
Major tasks in Data Preprocessing
1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
5. Discretization
60
1. Data Cleaning
Data cleaning tasks:
61
Missing Data
● Data is not always available
e.g. many tuples have no recorded value for several attributes
● Missing data may be due to
➢ equipment malfunction
➢ inconsistent with other recorded data and thus deleted
➢ data not entered due to misunderstanding
➢ certain data may not be considered important at the time of entry
62
Handling of Missing values
1) Ignore the tuple: usually done when class label is missing
2) Fill in the missing value manually: tedious + infeasible?
3) Fill in it automatically with
- the attribute mean for all samples belonging to the same class
63
Noisy Data
● Noise: It is the random error in measured variable. That is there can be
incorrect attribute value (outlier). e.g. Salary=“-10”
● Incorrect attribute values may be due to
➢ faulty data collection instruments
➢ data entry problems
➢ data transmission problems
64
Handling of Noisy Data
● Binning method:
✔ first sort data and partition into equal-frequency (equi-depth) or Equal-width
(distance) bins
✔ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries etc.
✔ used also for discretization (discussed later)
● Clustering
✔ detect and remove outliers
● Regression
✔ smooth by fitting the data into regression functions
65
Binning method
Binning Methods for Data Smoothing:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins::
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
66
Cluster Analysis
67
Regression
● Data smoothing can also be done by regression.
● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.
● Using regression, we find a mathematical equation to fit the data which helps to
smooth out the noise
68
Continued...
Y1
Y1’ y=x+1
X1 x
69
2. Data Integration
● Data integration combines data from multiple sources into a coherent store (unified
view of that data)
●
71
Correlation analysis
➢ Correlation coefficient (also called Pearson’s product moment
coefficient):
r A,B =
∑ ( A− A )( B−B) ∑ ( AB )−n A B
=
(n−1)σAσB (n−1)σAσB
where n is the number of tuples, A and B are the respective means of A and B, σA
and σB are the respective standard deviation of A and B
- If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
- rA,B < 0 : negatively correlated;
- rA,B = 0 : independent
72
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.
73
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.
74
3. Data Transformation
➔ Smoothing: remove noise from data (Use of binning,regression and clustering)
➔ Aggregation: summarization (e.g. daily sales data aggregated to compute monthly
or annual sales)
➔ Generalization: concept hierarchy climbing (e.g. street can be generalized to
higher level concepts like city or country)
➔ Discretization: values of numeric attribute (e.g age) are replaced by interval
labels(0-10,10-20,..) or conceptual labels(e.g. youth, adult, senior)
➔ Normalization: Attributes are scaled to fall within a smaller range such as -1.0 to
1.0 or 0.0 to 1.0
➢ min-max normalization
➢ z-score normalization
➢ normalization by decimal scaling
➔ Attribute/feature construction: New attributes constructed from the given ones75
Data Transformation: Normalization
● Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
76
Normalization Continued...
73,600 54,000
1.225
16,000
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
77
4. Data Reduction
Why data reduction?
➢ A database/data warehouse may store terabytes of data. Complex data
analysis/mining may take a very long time to run on the complete data set
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
Data reduction strategies:
➔ Data cube aggregation
➔ Dimensionality reduction
➔ Data Compression
➔ Numerosity reduction
➔ Discretization and concept hierarchy generation
78
A. Data cube aggregation
79
B. Dimensionality reduction
● Most data mining algorithms are column-wise implemented, which makes them
slower and slower on a growing number of data columns (i.e. dimensions).
● So it is necessary to remove some unimportant attributes/dimensions in the dataset
in order to improve the performance of data mining algorithm.
● Attribute selection methods:
1) Missing Values Ratio: Data columns with too many missing values are unlikely
to carry much useful information. Thus data columns with number of missing values
greater than a given threshold can be removed.
2) Low Variance Filter: Similarly to the previous technique, data columns with
little changes in the data carry little information. Thus all data columns with
variance lower than a given threshold are removed.
3) High Correlation Filter. Data columns with very similar trends are also likely to
carry very similar information. In this case, only one of them will suffice to feed the
machine learning model. Pairs of columns with correlation coefficient higher than a
threshold are reduced to only one.
80
Continued...
81
Continued...
82
C. Numerosity reduction
83
Regression
● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.
84
Continued...
Y1
Y1’ y=x+1
X1 x
85
Histogram
● A histogram is a graphical
representation of the distribution of
numerical data.
● Divide data into buckets and store
average (sum) for each bucket
● Partitioning rules:
- Equal-width: equal bucket range
- Equal-frequency (or equal-depth)
86
Clustering
87
Sampling
● Sampling is the process of selecting units from a population of interest so
that by studying the sample we may fairly generalize our results back to the
population from which they were chosen
● Types of sampling:
1) Simple random sampling without replacement (SRSWOR): The
sampling units are chosen without replacement in the sense that the units once chosen
are not placed back in the population.
3) Cluster sample: The population is divided into N groups, called clusters. We can
randomly select n clusters to include in the sample.
4) Stratified sampling: We divide the population into separate groups, called strata.
Then, a simple random sample is drawn from each group.
88
5. Discretization
● Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the attribute into a range of intervals.
● Interval labels can then be used to replace actual data values.
● Many discretization techniques can be applied recursively in order to
provide a hierarchical partitioning of the attribute values known as concept
hierarchy.
89
Continued...
90
Continued...
91
Continued...
92
Continued...
➢ Cluster analysis:
A clustering algorithm can be applied to partition data into clusters or
groups. Each cluster forms a node of a concept hierarchy, where all nodes
are at the same conceptual level. Each cluster may be further decomposed
into sub-clusters, forming a lower kevel in the hierarchy. Clusters may also
be grouped together to form a higher-level concept hierarchy.
➢ Data segmentation by “natural partitioning”:
Breaking up annual salaries in the range of into ranges like ($50,000-
$100,000) are often more desirable than ranges arrived at by cluster analysis.
93