0% found this document useful (0 votes)
92 views

Chapter 2 - Preparing To Model

The document discusses preparing data for machine learning modeling. Key activities when preparing data include understanding the data types, exploring the data to find relationships and issues, and preprocessing the data. This may include handling missing data, feature extraction, and dividing data into training and test sets. Common data types in machine learning are qualitative vs quantitative data. Qualitative data includes categorical variables while quantitative includes interval and ratio numeric variables.

Uploaded by

Ankit Mourya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Chapter 2 - Preparing To Model

The document discusses preparing data for machine learning modeling. Key activities when preparing data include understanding the data types, exploring the data to find relationships and issues, and preprocessing the data. This may include handling missing data, feature extraction, and dividing data into training and test sets. Common data types in machine learning are qualitative vs quantitative data. Qualitative data includes categorical variables while quantitative includes interval and ratio numeric variables.

Uploaded by

Ankit Mourya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chapter: 2 Preparing to Model

Q1. What are the main activities involved when you are preparing to start with
modelling in machine learning?

Ans: Following are the typical preparation activities done once the input data
comes into the machine learning system:

• Understand the type of data in the given input data set.


• Explore the data to understand the nature and quality.
• Explore the relationships amongst the data elements, e.g. inter-feature
relationship. • Find potential issues in data.
• Do the necessary remediation, e.g. impute missing data values, etc., if needed. •
Apply pre-processing steps, as necessary.
• Once the data is prepared for modelling, then the learning tasks start off. As a part
of it, do the following activities:
❖ The input data is first divided into parts – the training data and the test data
(called holdout). This step is applicable for supervised learning only.
❖ Consider different models or learning algorithms for selection.
❖ Train the model based on the training data for supervised learning problem and
apply to unknown data. Directly apply the chosen unsupervised model on the input
data for unsupervised learning problem.

Q2.What are the basic data types in machine learning? Give an example of each
one of them.
Ans: Data can broadly be divided into following two types:

1. Qualitative data
2. Quantitative data

1. Qualitative data
Qualitative data provides information about the quality of an object or
information which cannot be measured.
For example, if we consider the quality of performance of students in terms of
‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data.
Also, name or roll number of students are information that cannot be measured
using some scale of measurement. So they would fall under qualitative data.
Qualitative data is also called categorical data.
Qualitative data can be further subdivided into two types as follows:

1. Nominal data
2. Ordinal data
1. Nominal data
It is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified. Examples of
nominal data are
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female, Other

Note: A special case of nominal data is when only two labels are possible, e.g.
pass/fail as a result of an examination. This sub-type of nominal data is called
‘dichotomous’.

It is obvious, mathematical operations such as addition, subtraction,


multiplication, etc. cannot be performed on nominal data.
For that reason, statistical functions such as mean, variance, etc. can also not be
applied on nominal data.
However, a basic count is possible. So mode, i.e. most frequently occurring value,
can be identified for nominal data.
2. Ordinal data
Ordinal data also assigns named values to attributes but unlike nominal data,
they can be arranged in a sequence of increasing or decreasing value so that we
can say whether a value is better than or greater than another value.

Examples of ordinal data are


1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.

1. Quantitative data
It relates to information about the quantity of an object – hence it can be
measured.
For example, if we consider the attribute ‘marks’, it can be measured using a
scale of measurement. Quantitative data is also termed as numeric data.
There are two types of quantitative data:
1. Interval data
2. Ratio data
1. Interval data is numeric data for which not only the order is known, but the
exact difference between values is also known.
An ideal example of interval data is Celsius temperature.
The difference between each value remains the same in Celsius temperature.
For example, the difference between 12°C and 18°C degrees is measurable
and is 6°C as in the case of difference between 15.5°C and 21.5°C.
Other examples include date, time, etc.
For interval data, mathematical operations such as addition and subtraction
are possible. For that reason, for interval data, the central tendency can be
measured by mean,median, or mode. Standard deviation can also be
calculated.

2. Ratio data represents numeric data for which exact value can be measured.
Absolute zero is available for ratio data. Also, these variables can be added,
subtracted, multiplied, or divided. The central tendency can be measured by
mean, median, or mode and methods of dispersion such as standard
deviation.
Examples of ratio data include height, weight, age, salary, etc.

Refer Following data set:


Q3. Write short notes on Histogram.

Ans. Histogram is another plot which helps in effective visualization of numeric


attributes. It helps in understanding the distribution of a numeric data into series
of intervals, also termed as ‘bins’.

Histograms might be of different shapes depending on the nature of the data, e.g.
skewness. Figure 2.11 provides a depiction of different shapes of the histogram
that are generally created. These patterns give us a quick understanding of the
data and thus act as a great data exploration tool.

Let’s now examine the histograms for the different attributes of Auto MPG data
set presented in Figure 2.12. The histograms for ‘mpg’ and ‘weight’ are right-
skewed.
The histogram for ‘acceleration’ is symmetric and unimodal, whereas the one for
‘model.year’ is symmetric and uniform. For the remaining attributes, histograms
are multimodal in nature.

Now let’s dig deep into one of the histograms, say the one for the attribute
‘acceleration’. The histogram is composed of a number of bars, one bar appearing
for each of the ‘bins’. The height of the bar reflects the total count of data
elements whose value falls within the specific bin value, or the frequency.
Talking in context of the histogram for acceleration, each ‘bin’ represents an
acceleration value interval of 2 units. So the second bin, e.g., reflects acceleration
value of 10 to 12 units.
Q4. Write short notes on Scatter plot.

Ans: A scatter plot helps in visualizing bivariate relationships, i.e. relationship between
two variables. It is a two-dimensional plot in which points or dots are drawn on
coordinates provided by values of the attributes.

For example, in a data set there are two attributes – attr_1 and attr_2. We want to
understand the relationship between two attributes, i.e. with a change in value of one
attribute, say attr_1, how does the value of the other attribute, say attr_2, changes.

We can draw a scatter plot, with attr_1 mapped to x-axis and attr_2 mapped in y-axis.
So, every point in the plot will have value of attr_1 in the x-coordinate and value of
attr_2 in the y-coordinate. As in a two-dimensional plot, attr_1 is said to be the
independent variable and attr_2 as the dependent variable.

Let’s take a real example in this context. In the data set Auto MPG, there is expected to
be some relation between the attributes ‘displacement’ and ‘mpg’. Let’s try to verify our
intuition using the scatter plot of ‘displacement’ and ‘mpg’. Let’s map ‘displacement’ as
the x-coordinate and ‘mpg’ as the y-coordinate. The scatter plot comes as in Figure 2.13.
In Figure 2.14, the pair wise relationship among the features – ‘mpg’,
‘displacement’, ‘horsepower’, ‘weight’, and ‘acceleration’ have been captured. As
you can see, in most of the cases, there is a significant relationship between the
attribute pairs. However, in some cases, e.g. between attributes ‘weight’ and
‘acceleration’, the relationship doesn’t seem to be very strong.
Q5. Write a short note on PCA.
Ans. Dimensionality reduction refers to the techniques of reducing the
dimensionality of a data set by creating new attributes by combining the original
attributes. The most common approach for dimensionality reduction is known
as Principal Component Analysis (PCA).
PCA is a statistical technique to convert a set of correlated variables into a set of
transformed, uncorrelated variables called principal components.
The principal components are a linear combination of the original variables.
They are orthogonal to each other.
Since principal components are uncorrelated, they capture the maximum
amount of variability in the data.
However, the only challenge is that the original attributes are lost due to the
transformation. PCA is a most widely used tool in exploratory data analysis and
in machine learning for predictive models.

Q6. What are the different techniques for data pre-processing? Explain, in
brief, dimensionality reduction and feature selection.
Ans: Followings are the techniques for data pre-processing:
1. Dimensionality Reduction
2. Feature subset selection
1. Dimensionality Reduction:
Till the end of the 1990s, very few domains were explored which
included data sets with a high number of attributes or features. In
general, the data sets used in machine learning used to be in few 10s.
However, in the last two decades, there has been a rapid advent of
computational biology like genome projects.
These projects have produced extremely high-dimensional data sets
with 20,000 or more features being very common. Also, there has
been a wide-spread adoption of social networking leading to a need
for text classification for customer behaviour analysis.
High-dimensional data sets need a high amount of computational
space and time. At the same time, not all features are useful – they
degrade the performance of machine learning algorithms.
Most of the machine learning algorithms performs better if the
dimensionality of data set, i.e. the number of features in the data set,
is reduced.
Dimensionality reduction helps in reducing irrelevance and
redundancy in features. Also, it is easier to understand a model if the
number of features involved in the learning activity is less.
Dimensionality reduction refers to the techniques of reducing the
dimensionality of a data set by creating new attributes by combining
the original attributes.
The most common approach for dimensionality reduction is known
as Principal Component Analysis (PCA). PCA is a statistical technique
to convert a set of correlated variables into a set of transformed,
uncorrelated variables called principal components.
The principal components are a linear combination of the original
variables. They are orthogonal to each other.
Since principal components are uncorrelated, they capture the
maximum amount of variability in the data. However, the only
challenge is that the original attributes are lost due to the
transformation. Another commonly used technique which is used for
dimensionality reduction is Singular Value Decomposition (SVD).

2. Feature subset selection :


Feature subset selection or simply called feature selection, both for
supervised as well as unsupervised learning, try to find out the
optimal subset of the entire feature set which significantly reduces
computational cost without any major impact on the learning
accuracy.
It may seem that a feature subset may lead to loss of useful
information as certain features are going to be excluded from the
final set of features used for learning.
However, for elimination only features which are not relevant or
redundant are selected.
A feature is considered as irrelevant if it plays an insignificant role in
classifying or grouping together a set of data instances.
All irrelevant features are eliminated while selecting the final feature
subset. A feature is potentially redundant when the information
contributed by the feature is more or less same as one or more other
features.
Among a group of potentially redundant features, a small number of
features can be selected as a part of the final feature subset without
causing any negative impact to learn model accuracy.

Q.7 What is IQR? How is it measured?

Ans: The interquartile range (IQR), also called as midspread or middle 50%, or
technically H-spread is the difference between the third quartile (Q3) and the first
quartile (Q1). It covers the center of the distribution and contains 50% of the
observations.

IQR = Q3 – Q1

the five-number summary statistics minimum, first quartile (Q1), median (Q2), third
quartile (Q3), and maximum. Below is a detailed interpretation of a box plot.

• The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus
giving the inter-quartile range (IQR).

• Median is given by the line or band within the box.

• The lower whisker extends up to 1.5 times of the inter-quartile range (or IQR) from
the bottom of the box, i.e. the first quartile or Q1.

However, the actual length of the lower whisker depends on the lowest data value that
falls within (Q1 − 1.5 times of IQR). Let’s try to understand this with an example. Say for
a specific set of data, Q1 = 73, median = 76 and Q3 = 79. Hence, IQR will be 6 (i.e. Q3 –
Q1).

So, lower whisker can extend maximum till (Q1 – 1.5 × IQR) = 73 – 1.5 × 6 = 64.
However, say there are lower range data values such as 70, 63, and 60. So, the lower
whisker will come at 70 as this is the lowest data value larger than 64.
Q8. Explain the detailed process of machine learning.
Ans. The detailed process of machine learning includes four steps:
Step 1: Preparing to Model
Step 2: Learning
Step 3: Performance Evaluation
Step 4: Performance Improvement
As shown in following Figure:

In all four steps machine learning activities involved as shown in below table:
Q9. Explain with proper example, different ways of exploring categorical data.
Ans: In the Auto MPG data set, attribute ‘car.name’ is categorical in nature. We may
consider ‘cylinders’ as a categorical variable instead of a numeric variable. The first
summary which we may be interested in noting is how many unique names are there
for the attribute ‘car name’ or how many unique values are there for ‘cylinders’
attribute. We can get this as follows:

We may also look for a little more details and want to get a table consisting the
categories of the attribute and count of the data elements falling into that
category. Tables 2.4 and 2.5 contain these details.

In the same way, we may also be interested to know the proportion (or
percentage) of count of data elements belonging to a category. Say, e.g., for the
attributes ‘cylinders’, the proportion of data elements belonging to the category
4 is 204 ÷ 398 = 0.513, i.e. 51.3%. Tables 2.6 and 2.7 contain the summarization
of the categorical attributes by proportion of data elements.
In context of categorical attribute, it is the category which has highest number of
data values. Since mean and median cannot be applied for categorical variables,
mode is the sole measure of central tendency.
Let’s try to find out the mode for the attributes ‘car name’ and ‘cylinders’. For
cylinders, since the number of categories is less and we have the entire table
listed above, we can see that the mode is 4, as that is the data value for which
frequency is highest.
More than 50% of data elements belong to the category 4. However, it is not so
evident for the attribute ‘car name’ from the information given above. When we
probe and try to find the mode, it is found to be category ‘ford pinto’ for which
frequency is of highest value 6.

Q.10 Write a short note on Interval Data.

Ans. Interval data is numeric data for which not only the order is known, but the exact
difference between values is also known.

An ideal example of interval data is Celsius temperature. The difference between each
value remains the same in Celsius temperature.

For example, the difference between 12°C and 18°C degrees is measurable and is 6°C as
in the case of difference between 15.5°C and 21.5°C.

Other examples include date, time, etc. For interval data, mathematical operations such
as addition and subtraction are possible. For that reason, for interval data, the central
tendency can be measured by mean, median, or mode. Standard deviation can also be
calculated.
However, interval data do not have something called a ‘true zero’ value. For example,
there is nothing called ‘0 temperature’ or ‘no temperature’.

Hence, only addition and subtraction applies for interval data. The ratio cannot be
applied. This means, we can say a temperature of 40°C is equal to the temperature of
20°C + temperature of 20°C. However, we cannot say the temperature of 40°C means it
is twice as hot as in temperature of 20°C.

Q11. Write a short note on Inter-quartile range.


Ans. The interquartile range (IQR), also called as midspread or middle 50%, or
technically H-spread is the difference between the third quartile (Q3) and the first
quartile (Q1). It covers the center of the distribution and contains 50% of the
observations.

IQR = Q3 – Q1

the five-number summary statistics minimum, first quartile (Q1), median (Q2), third
quartile (Q3), and maximum. Below is a detailed interpretation of a box plot.

• The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus
giving the inter-quartile range (IQR).

• Median is given by the line or band within the box.

• The lower whisker extends up to 1.5 times of the inter-quartile range (or IQR) from
the bottom of the box, i.e. the first quartile or Q1.

However, the actual length of the lower whisker depends on the lowest data value that
falls within (Q1 − 1.5 times of IQR). Let’s try to understand this with an example. Say for
a specific set of data, Q1 = 73, median = 76 and Q3 = 79. Hence, IQR will be 6 (i.e. Q3 –
Q1).

So, lower whisker can extend maximum till (Q1 – 1.5 × IQR) = 73 – 1.5 × 6 = 64.
However, say there are lower range data values such as 70, 63, and 60. So, the lower
whisker will come at 70 as this is the lowest data value larger than 64.

The upper whisker extends up to 1.5 as times of the inter-quartile range (or IQR) from
the top of the box, i.e. the third quartile or Q3. Similar to lower whisker, the actual
length of the upper whisker will also depend on the highest data value that falls within
(Q3 + 1.5 times of IQR).

Let’s try to understand this with an example. For the same set of data mentioned in the
above point, upper whisker can extend maximum till (Q3 + 1.5 × IQR) = 79 + 1.5 × 6 =
88. If there is higher range of data values like 82, 84, and 89. So, the upper whisker will
come at 84 as this is the highest data value lower than 88.

Q12. Write a short note on Cross-tab.


Ans. Two-way cross-tabulations (also called cross-tab or contingency table) are used to
understand the relationship of two categorical attributes in a concise way. It has a
matrix format that presents a summarized view of the bivariate frequency distribution.
A cross-tab, very much like a scatter plot, helps to understand how much the data values
of one attribute changes with the change in data values of another attribute. Let’s try to
see with examples, in context of the Auto MPG data set.

Let’s assume the attributes ‘cylinders’, ‘model.year’, and ‘origin’ as categorical and try to
examine the variation of one with respect to the other. As we understand, attribute
‘cylinders’ reflects the number of cylinders in a car and assumes values 3, 4, 5, 6, and 8.
Attribute ‘model.year’ captures the model year of each of the car and ‘origin’ gives the
region of the car, the values for origin 1, 2, and 3 corresponding to North America,
Europe, and Asia. Below are the cross-tabs. Let’s try to understand what information
they actually provide.

The first cross-tab, i.e. the one showing relationship between attributes ‘model. year’
and ‘origin’ help us understand the number of vehicles per year in each of the regions
North America, Europe, and Asia.

Looking at it in another way, we can get the count of vehicles per region over the
different years. All these are in the context of the sample data given in the Auto MPG
data set.

Moving to the second cross-tab, it gives the number of 3, 4, 5, 6, or 8 cylinder cars in


every region present in the sample data set. The last cross-tab presents the number of
3, 4, 5, 6, or 8 cylinder cars every year.

Tables 2.8–2.10 present cross-tabs for different attribute combinations.


Q13. Difference between Nominal Data and Ordinal Data
Ans.
Nominal Data Ordinal Data
Nominal data has no numeric value but a Ordinal data assigns named values to
named value. attributes but unlike nominal data.
Nominal values cannot be quantified. Ordinal data is placed into some kind of
order.
It also cannot be assigned to any type of Ordinal number only shows sequence.
order.
Examples: Examples:
1. Blood group: A, B, O, AB, etc. 1. Customer satisfaction: ‘Very Happy’,
2. 2. Nationality: Indian, American, ‘Happy’, ‘Unhappy’, etc.
British, etc. 2. Grades: A, B, C, etc.
3. 3. Gender: Male, Female, Other 3. Hardness of Metal: ‘Very Hard’, ‘Hard’,
‘Soft’, etc.

Q14. Difference between Box plot and Histogram


Ans.
Box plot Histogram
A box plot is an extremely effective Histogram is another plot which helps in
mechanism to get a one-shot view and effective visualization of numeric
understand the nature of the data. attributes.
The focus of box plot is to divide the data
The focus of histogram is to plot ranges of
elements in a data set into four equal data values (acting as ‘bins’), the number
portions. of data elements in each range will depend
on the data distribution.
Each portion contains an equal number of The size of each bar corresponding to the
data elements. different ranges will vary.
Boxplot have equal shapes depending on Histogram might be different shapes
equal distribution of data elements. depending on the nature of the data.

Q14. Difference between Mean and Median


Ans.

Note: In all example refer the data set of Auto MPG which available at page no. 3

You might also like