DA-4th unit
DA-4th unit
UNIT IV
(Data Analytics)
Object Segmentation: Regression Vs Segmentation – Supervised and
Unsupervised Learning, Tree Building – Regression, Classification,
Overfitting, Pruning and Complexity, Multiple Decision Trees etc. Time
Series Methods: Arima, Measures of Forecast Accuracy, STL approach,
Extract features from generated model as Height, Average Energy etc
and Analyze for prediction.
Segmentation:
Segmentation is a methodology that involves dividing broad
market/items/customers into subsets of entities with common
characteristics and homogeneous groups — then designing and
implementing strategies specific to these segments makes easier
decision making.
Segmentation is used in different areas of Risk Management like credit
risk, operational risk, reserving and investment among others.
Segmentation is often used for modeling Credit risk. Applicants are
segmented based on the estimated credit risk and decisions are made
based on the segment in which the applicant falls.
Supervised Machine Learning:
In Supervised learning, you train the machine using data which is well
"labeled." It means some data is already tagged with the correct answer.
It can be compared to learning which takes place in the presence of a
supervisor or a teacher.
A supervised learning algorithm learns from labeled training data,
helps you to predict outcomes for unforeseen data. Successfully
building, scaling, and deploying accurate supervised machine learning
Data science model takes time and technical expertise from a team of
highly skilled data scientists. Moreover, Data scientist must rebuild
models to make sure the insights given remains true until its data
changes.
Why Supervised Learning?
Supervised learning allows you to collect data or produce a data
output from the previous experience.
Helps you to optimize performance criteria using experience
Decision Trees:
Decision trees are used to solve both classification and regression
problems in the form of trees that can be incrementally updated by
splitting the dataset into smaller datasets (numerical and categorical),
where the results are represented in the leaf nodes.
CHAID:
CHAID (Chi-square Automatic Interaction Detector) analysis is an
algorithm used for discovering relationships between a categorical
response variable and other categorical predictor variables. It is
useful when looking for patterns in datasets with lots of categorical
variables and is a convenient way of summarizing the data as the
relationships can be easily visualized.
In practice, CHAID is often used in direct marketing to understand how
different groups of customers might respond to a campaign based on their
characteristics. So suppose, for example, that we run a marketing
campaign and are interested in understanding what customer
characteristics (e.g., gender, socio-economic status, geographic
location, etc.) are associated with the response rate achieved. We build
a CHAID “tree” showing the effects of different customer characteristics
on the likelihood of response.
Regression Trees:
A regression tree refers to an algorithm where the target variable is and
the algorithm is used to predict it’s value. As an example of a regression
type problem, you may want to predict the selling prices of a residential
house, which is a continuous dependent variable.
Downloaded by LAXMI VARSHITHA LABALA ([email protected])
lOMoARcPSD|50889649
This will depend on both continuous factors like square footage as well as
categorical factors like the style of home, area in which the property is
located and so on.
Entropy:
2. Reduction in Variance:
Till now, we have discussed the algorithms for categorical target variable.
Reduction in variance is an algorithm used for continuous target
variables (regression problems). This algorithm uses the standard
formula of variance to choose the best split. The split with lower variance
is selected as the criteria to split the population:
3. Gini Index:
Gini says, if we select two items from a population at random then
they must be of same class and probability for this is 1 if population is
pure.
1. It works with categorical target variable “Success” or “Failure”.
2. It performs only Binary splits
3. Higher the value of Gini higher the homogeneity.
4. CART (Classification and Regression Tree) uses Gini method to
create binary splits.
Steps to Calculate Gini for a split
1. Calculate Gini for sub-nodes, using formula sum of square of
probability for success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node
of that split.
You might often come across the term ‘Gini Impurity’ which is
determined by subtracting the gini value from 1. So mathematically we
can say,
Gini Impurity = 1-Gini
To compute Gini impurity for a set of items, suppose i ε {1, 2... m},
and let fi be the fraction of items labeled with value i in the set.
4. Chi-Square:
It is an algorithm to find out the statistical significance between the
differences between sub-nodes and parent node. We measure it by sum of
squares of standardized differences between observed and expected
frequencies of target variable.
1. It works with categorical target variable “Success” or “Failure”.
2. It can perform two or more splits.
3. Higher the value of Chi-Square higher the statistical
significance of differences between sub-node and Parent node.
4. Chi-Square of each node is calculated using formula,
Chi-square = ((Actual – Expected)^2 / Expected)^1/2
5. It generates tree called CHAID (Chi-square Automatic
Interaction Detector)
Trend
Seasonal Variations
Cyclic Variations
Random or Irregular movements
Seasonal and Cyclic Variations are the periodic changes or short-term
fluctuations.
Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
Cyclical variation – Its much alike seasonal variation but the
rise and fall of time series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by
any of the three above mentioned components. They can be
classified into – stationary and non – stationary variation.
Time series models can be simulated, estimated from data, and used to
produce forecasts of future behavior.
White Noise:
A series is called white noise if it is purely random in nature. Let {Et}
denote such a series then it has zero mean [E(ct)=0], has a constant
variance [V(et)= 02 ] and is an uncorrelated [ER0=0] random variable.
The scatter plot of such a series across time will indicate no pattern
and hence forecasting the future values of such a series is not possible.
Ideal value = 0;
MFE > 0, model tends to under-
forecast MFE < 0, model tends to over-
forecast
While MFE is a measure of forecast model bias, MAD indicates the
absolute size of the errors
Uses of Forecast error:
Forecast model bias
Absolute size of the forecast errors
Compare alternative forecasting models
Identify forecast models that need adjustment
2. Mean Absolute Deviation (MAD)
It is also called MAD for short, and it is the average of the absol ute
value, or the difference between actual values and their average value,
and is used for the calculation of demand variability. It is expressed by
the following formula.
For n time periods where we have actual demand and forecast values:
Where:
n is the number of fitted points,
At is the actual value,
Ft is the forecast value.
Σ is summation notation (the absolute value is summed for every
forecasted point in time).
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage
and especially in data warehousing that:
the data loading kicks off without waiting for the completion of the
previous phases.
ETL systems commonly integrate data from multiple applications
(systems), typically developed and supported by different vendors or
hosted on separate computer hardware.
The disparate systems containing the original data are frequently
managed and operated by different employees. For example, a cost
accounting system may combine data from payroll, sales, and
purchasing.