da-unit-iii
da-unit-iii
DA-Unit-III
Regression
Notice that the line is as close as possible to all the scattered data points.
This is what an ideal best fit line looks like.
To better understand the whole process let’s see how to calculate the line using the
Least Squares Regression.
Steps to calculate the Line of Best Fit
Start constructing the line that best depicts the relationship between
variables in the data, we first need to get our basics right. Take a look at the
equation below:
Surely, you’ve come across this equation before. It is a simple equation that
represents a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better
understand this, let’s break down the equation:
y: dependent variable
m: the slope of the line
x: independent variable
c: y-intercept
So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the
dependent variable.
Let’s see how this can be done.
As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:
Step 2: Compute the y-intercept (the value of y at the point where the line crosses the
y-axis):
Let us use the concept of least squares regression to find the line of best fit
for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:
Once you substitute the values, it should look something like this:
Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price
$8 can he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T-shirts
This comes down to 13 T-shirts! That’s how simple it is to make predictions
using Linear Regression.
Now let’s try to understand based on what factors can we confirm that the
above line is the line of best fit.
The least squares regression method works by minimizing the sum of the
square of the errors as small as possible, hence the name least squares. Basically
the distance between the line of best fit and the error must be minimized as much
as possible. This is the basic idea behind the least squares regression method.
A few things to keep in mind before implementing the least squares
regression method is:
The data must be free of outliers because they might lead to a biased and
wrongful line of best fit.
The line of best fit can be drawn iteratively until you get a line with the
minimum possible squares of errors.
This method works well even with non-linear data.
Technically, the difference between the actual value of ‘y’ and the predicted
value of ‘y’ is called the Residual (denotes the error).
variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting
curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the
property that the sum of squares of all the deviations from given values must be
minimum. I.e:
1. Choosing the predictor variables and response variable on which to collect the
data.
2. Collecting data. You may be using data that already exists (retrospective), or you
may be conducting an experiment during which you will collect data (prospective).
Note that this step is important in determining the researcher’s ability to claim
‘association’ or ‘causality’ based on the regression model.
3. Exploring the data.
• check for data errors and missing values.
• study the bivariate relationships to reveal other outliers and influential
observations, relationships, and identify possible multicollinearities to
suggest possible transformations. (a document was sent to you on Sept. 21st
regarding these topics).
4. Dividing the data into a model-building set and a model-validation set:
• The training set is used to estimate the model.
• The validation set is later used for cross-validation of the selected model.
5. Identify several candidate models:
• Use best subsets regression.
• Use stepwise regression.
6. Evaluate the selected models for violation of the model conditions. Below checks
may be performed visually via residual plots as well as formal statistical tests.
• Check the linearity condition.
• Check for normality of the residuals.
• Check for constant variance of the residuals.
• After time-ordering your data (if appropriate), assess the independence of
the observations.
• Overall goodness-of-fit of the model. Above checks turn out to be
unsatisfactory, then modifications to the model may be needed (such as a
different functional form). Regardless, checking the assumptions of your
model as well as the model’s overall adequacy is usually accomplished
through residual diagnostic procedures.
7. Select the final model:
• Compare the competing models by cross-validating them against the
validation data. Remember, there is not necessarily only one good model for
a given set of data. There may be a few equally satisfactory models.
Logistic Regression
Regression models traditionally work with continuous numeric value data for
dependent and independent variables. Logistic regression models can, however,
work with dependent variables with binary values, such as whether a loan is
approved (yes or no). Logistic regression measures the relationship between a
categorical dependent variable and one or more independent variables. For
example, Logistic regression might be used to predict whether a patient has a given
1. There must be a fixed number of trials denoted by n, i.e. in the data set,
there must be a fixed number of rows.
2. Each trial can have only two outcomes; i.e., the response variable can have
only two unique categories.
3. The outcome of each trial must be independent of each other; i.e., the
unique levels of the response variable must be independent of each other.
4. The probability of success (p) and failure (q) should be the same for each
trial.
Let's understand how Logistic Regression works. For Linear Regression,
where the output is a linear combination of input feature(s), we write the equation
as:
`Y = βo + β1X + ∈`
In Logistic Regression, we use the same equation but with some
modifications made to Y. Let's reiterate a fact about Logistic Regression: we
calculate probabilities. And, probabilities always lie between 0 and 1. In other
words, we can say:
1. The response value must be positive.
2. It should be lower than 1.
First, we'll meet the above two criteria. We know the exponential of any value
is always a positive number. And, any number divided by number + 1 will always
be lower than 1. Let's implement these two findings:
As you might recognize, the right side of the (immediate) equation above
depicts the linear combination of independent variables. The left side is known as
the log - odds or odds ratio or logit function and is the link function for Logistic
Regression. This link function follows a sigmoid (shown below) function which
limits its range of probabilities between 0 and 1.
Null and Residual Deviance. Null deviance is calculated from the model with
no features, i.e.,only intercept. The null model predicts class via a constant
probability.
Residual deviance is calculated from the model having all the features.On
comarison with Linear Regression, think of residual deviance as residual sum of
square (RSS) and null deviance as total sum of squares (TSS). The larger the
difference between null and residual deviance, better the model.
Also, you can use these metrics to compared multiple models: whichever
model has a lower null deviance, means that the model explains deviance pretty
well, and is a better model. Also, lower the residual deviance, better the model.
Practically, AIC is always given preference above deviance to evaluate model fit.
3. Confusion Matrix
Confusion matrix is the most crucial metric commonly used to evaluate
classification models. It's quite confusing but make sure you understand it by
heart. If you still don't understand anything, ask me in comments. The skeleton of
a confusion matrix looks like this:
As you can see, the confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format. In table above, Positive class = 1
and Negative class = 0. Following are the metrics we can derive from a confusion
matrix:
Accuracy - It determines the overall predicted accuracy of the model. It is
calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True
Negatives + False Positives + False Negatives)
True Positive Rate (TPR) - It indicates how many positive values, out of all the
positive values, have been correctly predicted. The formula to calculate the true
positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known
as Sensitivity or Recall.
False Positive Rate (FPR) - It indicates how many negative values, out of all the
negative values, have been incorrectly predicted. The formula to calculate the
false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.
True Negative Rate (TNR) - It indicates how many negative values, out of all the
negative values, have been correctly predicted. The formula to calculate the true
negative rate is (TN/TN + FP). It is also known as Specificity.
False Negative Rate (FNR) - It indicates how many positive values, out of all the
positive values, have been incorrectly predicted. The formula to calculate false
negative rate is (FN/FN + TP).
Precision: It indicates how many values, out of all the predicted positive values,
are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the
harmonic mean of precision and recall. It lies between 0 and 1. Higher the value,
better the model. It is formulated as 2((precision*recall) / (precision+recall)).
4. Receiver Operator Characteristic (ROC)
ROC determines the accuracy of a classification model at a user defined
threshold value. It determines the model's accuracy using Area Under Curve (AUC).
The area under the curve (AUC), also referred to as index of accuracy (A) or
concordant index, represents the performance of the ROC curve. Higher the area,
better the model. ROC is plotted between True Positive Rate (Y axis) and False
Positive Rate (X Axis). In this plot, our aim is to push the red curve (shown below)
toward 1 (left corner) and maximize the area under curve. Higher the curve, better
the model. The yellow line represents the ROC curve at 0.5 threshold. At this point,
sensitivity = specificity.
Cons
Logistic regression can suffer from complete separation. If there is a feature
that would perfectly separate the two classes, the logistic regression model
can no longer be trained. This is because the weight for that feature would
not converge, because the optimal weight would be infinite. This is really a
bit unfortunate, because such a feature is really very useful. But you do not
need machine learning if you have a simple rule that separates both classes.
The problem of complete separation can be solved by introducing
penalization of the weights or defining a prior probability distribution of
weights.
Logistic regression is less prone to overfitting but it can overfit in high
dimensional datasets and in that case, regularization techniques should be
considered to avoid over-fitting in such scenarios.
The data which has been used is Bankloan. The dataset has 850 rows and 9
columns. (age, education, employment, address, income, debtinc, creddebt,
othdebt, default). The dependent variable is default (Defaulted and Not Defaulted).
Let’s first load and check the head of data.
bankloan<-read.csv(“bankloan.csv”)
head(bankloan)
Now, making the subset of the data with 700 rows.
mod_bankloan <- bankloan[1:700,]
Setting a seed of 1000 (meaning picking random numbers from 1000 as starting
point)
Sabbineni Venkateswara Rao Page 13
set.seed(500)
Let’s have a sample of 500 values. So, creating a variable of training data of 700
rows.
>train<-sample(1:700, 500, replace=FALSE)
Creating training as well as testing data.
>trainingdata<- mod_bankloan [train,]
>testingdata<- mod_bankloan [-train,]
Now, let’s fit the model. Be sure to specify the parameter family=binomial in the
glm() function.
model1<-glm(default~.,family=binomial(link=’logit’),data=trainingdata)
>summary(model1)
The summary will also include the significance level of all the variables. If the P
value is less than 0.05 then the variables are significant. We can also remove the
insignificant variables to make our accurate.
In our model, only age, employment, address and creddebt seems to be significant.
So, building another model with only these variables.
model12<-
glm(default~age+employ+address+creddebt,family=binomial(link=’logit’),data=traini
ngdata)
pred2<-predict(model12, newdata=testingdata,type=”response”)
predicted_class2<-ifelse(pred2<0.5, “Defaluted”, “Not Defaulted”)
table(testingdata$default, predicted_class2)
err_rate<-1-sum((testingdata$default ==predicted_class2))/200
err_rate
0.31
Here the error rate is 31%.
Now, we can plot this in Receiver Operating Characteristics Curve
(commonly known as ROC curve). In R, it can be done by downloading a package
called ROCR. An output of the plot is given below.
ROC traces the percentage of true positives accurately predicted by a given
logit model as the prediction probability cutoff is lowered from 1 to 0. For a perfect
model, as the cutoff is lowered, it should mark more of actual 1’s as positives and
lesser of actual 0’s as 1’s. The area under curve, known as index of accuracy is a
performance metric for the curve. Higher the area under curve, better the
prediction power of the model.
Regression Modeling
Regression modeling or analysis is a statistical process for estimating the
relationships among variables. It includes many techniques for modeling and
analyzing several variables, when the focus is on the relationship between a
dependent variable and one or more independent variables (or 'predictors').
Understand influence of changes in dependent variable:
More specifically, regression analysis helps one understand how the typical
value of the dependent variable (or 'criterion variable') changes when any one of the
independent variables is varied, while the other independent variables are held
fixed. Most commonly, regression analysis estimates the conditional expectation of
the dependent variable given the independent variables, i.e the average value of the
dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of
the dependent variable given the independent variables. In all cases, the estimation
Spam detection
Weather forecasting
Predicting housing prices based on the prevailing market price
Stock price predictions, among others
Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/
Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time.
Now, I want to create a model to predict who will play cricket during leisure period? In
this problem, we need to segregate students who play cricket in their leisure time based
on highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all
values of three variable and identify the variable, which creates the best homogeneous
sets of students (which are heterogeneous to each other). In the snapshot below, you
can see that variable Gender is able to identify best homogeneous sets compared to the
other two variables.
As mentioned above, decision tree identifies the most significant variable and it’s
value that gives best homogeneous sets of population. Now the question which arises
is, how does it identify the variable and the split? To do this, decision tree uses various
algorithms, which we will discuss in next article.
Types of Decision Tree
Types of decision tree is based on the type of target variable we have. It can be of
two types:
1. Binary Variable Decision Tree: Decision Tree which has binary target variable
then it called as Binary Variable Decision Tree. Example:- In above scenario of
student problem, where the target variable was “Student will play cricket or not”
i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target
variable then it is called as Continuous Variable Decision Tree.
Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that income of
customer is a significant variable but insurance company does not have income details
for all customers. Now, as we know this is an important variable, then we can build a
decision tree to predict customer income based on occupation, product and various
other variables. In this case, we are predicting values for continuous variable.
Terminology related to Decision Trees:
Let’s look at the basic terminology used with Decision trees:
ROOT Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.
SPLITTING: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree
Parent and Child Node: A node, which is divided into sub-nodes is called parent node
of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm
has advantages and disadvantages, below I am discussing some of these for decision
trees.
Advantages:
1. Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better
power to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration
stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most
significant variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modeling techniques. It is not influenced by outliers and missing
values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages:
1. Overfit: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by use of random forests.
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
Pre-pruning that stop growing the tree earlier, before it perfectly classifies
the training set.
Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree.
Practically, the second approach of post-pruning overfit trees is more successful
because it is not easy to precisely estimate when to stop growing the tree.
The error rate at the parent node is 0.46 and since the error rate for its children
(0.51) increases with the split, we do not want to keep the children.
If we require that the probability has to be less than a limit (e.g., 0.05), therefore
we decide not to split the node.
Time Series
A time series is a set of statistics, usually collected at regular intervals. Time
series data occur naturally in many application areas.
economics - e.g., monthly data for unemployment, hospital admissions, etc.
finance - e.g., daily exchange rate, a share price, etc.
environmental - e.g., daily rainfall, air quality readings.
medicine - e.g., ECG brain wave activity every 2−8secs
The methods of time series analysis pre-date those for general stochastic
processes and Markov Chains. The aims of time series analysis are to describe and
summaries time series data, fit low-dimensional models, and make forecasts.
Components of Time Series
Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise
and fall of time series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any
of the three above mentioned components. They can be classified
into – stationary and non – stationary variation.
When the data neither increases nor decreases, i.e. it’s
completely random it’s called stationary variation.
When the data has some explainable portion remaining and can be
analyzed further then such case is called non – stationary variation.
Ideal value = 0;
MFE > 0, model tends to under-forecast MFE < 0, model tends to over-forecast
STL Model
A time series can be divided into 3 components: the trend, the seasonality and
the error or residuals of the model.
The STL model is a deterministic model that allows the components to be
calculated separately using different methods. It estimates the behavior of the trend
using a LOESS regression, and in turn, calculates the seasonal component by
selecting one of more models, but it is usually selected only between 2: the seasonal
ARIMA model, or the ETS model. The main difference that the STL model has with
the others is that, when considering the trend as a LOESS estimation, it is extremely
flexible to the changes in the trend of the series, unlike the linear regression, which
assumes that the series maintains the same constant.
Trend:
As mentioned previously, the way to calculate the trend using the STL model is
to calculate it from a LOESS regression. LOESS combines the simplicity of linear
least squares regression with the flexibility of non-linear regression by fitting simple
models on local subsets of data to create a function that describes the deterministic
part of the variation in point-to-point data. In fact, one of the main attractions of this
method is that it is not necessary to specify a global function to fit a model to the
data. In return, a greater calculation power is necessary.
Because it is so computationally intensive, LOESS would have been practically
impossible to use at the time when the least squares regression was developed. Most
of the other modern methods for process modeling are similar to those of LOESS in
this regard. These methods have been consciously designed to use our current
calculation capacity to achieve objectives not easily achieved by traditional methods.
The key parameter for the estimation of the regression LOESS is the span. The
span is the degree of smoothing of the series. Higher smoothing values (h) produce
softer functions that move less in response to fluctuations in the data. The smaller
the h, the closer the adjustment of the regression function to the data will be. Using
too small a value of the smoothing parameter is not desirable because the regression
function will begin to capture the random error in the data. The useful values of the
smoothing parameter are generally in the range of 0.25 to 0.5 for most LOESS
applications. As an example to this smoothing difference we will occupy different
values of span for the same regression, in order to compare the results, using the
following code:
#Estimation:
loessMod10 <- loess(Sales ~ Period, data=Train, span=0.10) # 10%
smoothing span
loessMod25 <- loess(Sales ~ Period, data=Train, span=0.25) # 25%
smoothing span
loessMod50 <- loess(Sales ~ Period, data=Train, span=0.50) # 50%
smoothing span
loessMod75 <- loess(Sales ~ Period, data=Train, span=0.75) # 75%
We save the results of the predictions in Data frames that allow us to plot as a
comparison each prediction along with the actual training base. It was necessary, to
perform the LOESS regression estimation, to select an explanatory variable and an
explained variable. As it is a series of time, we use as an explanatory variable the
fictitious variable that we create with the name Period, and the variable to explain is
the level of wine sales. A brief explanation of why these variables were selected in
this order is due to the fact that we seek to find the relationship (or the effect, in this
case) that the time has on the wine sales level.
#Predictions:
We can observe the comparison between different span values separately. Part of
the work of the data scientist is to find the value that helps to maximize the
estimation of the different models, and in this way avoid problems of over fitting or
under fitting. In such a way that we will seek to minimize the estimation error from
different span values for the series. To achieve this result, we will use the loess.as
function, from the fANCOVA package.
The loess.as function aims to select the optimal smoothing value from two
methods:: bias-corrected Akaike information criterion (aicc); and generalized cross-
validation (gcv).The code to calculate the optimal span value is:
The analysis of the residuals helps us to contrast interesting and useful results
for the general analysis of the series. In the first place, that the series presents a
seasonal behavior, in such a way that the waste has fallen and lowered in specific
periods of time; This is not surprising, since we are assuming that the series is only
composed of the component of the trend.
Secondly, the series presents a different distribution to the normal, since there
is an important peak in the Gauss campaign plotted. In such a way that, according
to the results, it is necessary to estimate in turn the
Trend + Seasonal:
The STLF function allows the calculation of the seasonal component from
selecting a model that meets this specific task. The most common options are
usually the method by model ARIMA and mor model ETS. Both models have an
important effect that facilitates the calculation of seasonality once the trend is
already conceived (which was already calculated from LOESS). To be sure that the
appropriate model was selected to model the behavior of the seasonal component,
both the Akaike criterion and the RMSE of both models will be compared, and we will
select the one that best suits us according to our purposes.
As a first step, it is necessary to define the training series as a time series, with a
periodicity of 12 (since we are considering a monthly seasonality that is repeated year
after year). For the forecast (12 months) of the series, we will need to select the s.window
= 12, because we will look for the behavior of the seasonal component with a periodicity
of 12 months. In turn, as the calculation of the optimal value of the span for the
estimation of the trend was made, it will be added to the model from the criterion
t.window. We will start by making the forecast with the ETS model.
Ts<-ts(Train$Sales, freq=12)
ForecastEts[["model"]][["aic"]]
We can check that the Akaike information selection criterion tells us that the
value is 3333. Now, we perform the same process that was done, but changing to an
ARIMA model with the following code:
ForecastArima<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =
0.7906048,method = c("arima"))
ForecastArima[["model"]][["aic"]] ## [1]
2944.866
Using the calculation of the selection criteria on the proposed ARIMA model, we
contrasted that, according to the selection criteria, the ARIMA model is better to
perform the forecast of the series than the model with ETS. Now, we will proceed to
contrast with the RMSE and verify if the ARIMA model has a greater predictive power
than the ETS model. For this we will use the following code:
ForecastArima<-as.numeric(ForecastArima$mean)
TestSales<-as.numeric(Test$Sales)
A<-data.frame(forecast::accuracy(ForecastArima,TestSales))
ForecastEts<-as.numeric(ForecastEts$mean)
B<-data.frame(forecast::accuracy(ForecastEts,TestSales))
results was the model with the seasonal component calculated from the ETS. ., with
an RMSE lower than that of the ARIMA model. In this way, we proceed to make the
forecast of the series using only the ETS model as a seasonal component.
The comparison between the predictions of the series with the real data of the
training set is made using the following code:
To perform the analysis of the waste we use, in the same way, the next code:
forecast::checkresiduals(ForecastEts$residuals)
We can now observe that the residuals have a behavior closer to the normal than
the previous proposed model, since there are no higher values than the Gaussian
distribution in the graph. On the other hand, there are no significant problems of
autocorrelation between the waste. As another important analysis, we proceed to
calculate the Q-Q curve that helps us compare the normality of waste. A Cuantil-
Cuantil chart allows you to see how close the distribution of a data set to some ideal
distribution or compare the distribution of two data sets. If it is interesting to compare
with the Gaussian distribution, it is called normal probability graph. The data is sorted
and graph the i-th data against the corresponding quantile Gaussian. The code to
elaborate this graph is the following:
As we can see in the Q-Q plot, there is a normal behavior of the waste series. The
objective of the Temporary Series is to decompose the series observed in two parts:
one is the dependent part of the past and the other the unpredictable part. The use of
the ETS model allowed to capture, in an effective way, the behavior of the component
of the trend, in such a way that the only thing that "remains" of the series is white
noise, that is, random variations that cannot be predicted.
Data Visualization
to fill the windows. A space-filling curve is a curve with a range that covers the entire
n-dimensional unit hypercube. Since the visualization windows are 2-D, we can use
any 2-D space-filling curve. Figure 2.11 shows some frequently used 2-D space-
filling curves. Note that the windows do not have to be rectangular. For example, the
circle segment technique uses windows in the shape of segments of a circle, as
illustrated in Figure 2.12.
This technique can ease the comparison of dimensions because the dimension
windows are located side by side and form a circle.
Chernoff faces make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial characteristics at
once.
Viewing large tables of data can be tedious. By condensing the data,
Chernoff faces make the data easier for users to digest. In this way, they
facilitate visualization of regularities and irregularities present in the data,