Rohit Unit 1 ML Notes
Rohit Unit 1 ML Notes
Analytics is a collection of techniques and tools used for creating value from data. Techniques
include concepts such as artificial intelligence (AI), machine learning (ML), and deep learning
(DL) algorithms. AI, ML, and DL are defined as follows:
Machine learning is a set of algorithms that have the capability to learn to perform tasks such as
prediction and classification effectively using data.
Machine learning algorithms are classified into four categories as defined below:
1. Supervised Learning Algorithms: These algorithms require the knowledge of both the
outcome variable (dependent variable) and the features (independent variable or input
variables). The algorithm learns (i.e., estimates the values of the model parameters or
feature weights) by defining a loss function which is usually a function of the difference
between the predicted value and actual value of the outcome variable. Algorithms such as
linear regression, logistic regression, discriminant analysis are examples of supervised
learning algorithms.
2. Unsupervised Learning Algorithms: These algorithms are a set of algorithms which do
not have the knowledge of the outcome variable in the dataset. The algorithms must find
the possible values of the outcome variable. Algorithms such as clustering and principal
component analysis are examples of unsupervised learning algorithms.
3. Reinforcement Learning Algorithms: Reinforcement learning algorithms are
algorithms that must take sequential actions (decisions) to maximize a cumulative
reward. In many datasets, there could be uncertainty around both input as well as the
output variables. For example, consider the case of spell check in various text editors. If a
person types “buutiful” in Microsoft Word, the spell check in Microsoft Word will
immediately identify this as a spelling mistake and give options such as “beautiful”,
“bountiful”, and “dutiful”. Here the prediction is not one single value, but a set of values.
Techniques such as Markov chain and Markov decision process are examples of
reinforcement learning algorithms.
4. Evolutionary Learning Algorithms: Evolutional algorithms are algorithms that imitate
natural evolution to solve a problem. Techniques such as genetic algorithm and ant
colony optimization fall under the category of evolutionary learning algorithms.
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
3. List the frameworks which are used for developing machine learning models.
The framework for ML algorithm development can be divided into five integrated stages:
problem and opportunity identification, collection of relevant data, data pre-processing, ML
model building, and model deployment.
---------------------------------------------------------------------------------------------------------------------
Above ss explains that these libraries are used for what purpose (can be asked in MCQ or VIVA)
---------------------------------------------------------------------------------------------------------------------
3. Training Data: Training data is the dataset used to train a machine learning model. It
includes input-output pairs where the output is known.
4. Test Data: Test data is a separate dataset used to evaluate the performance of a trained
model. It helps measure the model’s ability to generalize to new data.
5. Validation Data: Validation data is used during the training process to tune
hyperparameters and avoid overfitting. It provides an unbiased evaluation of the model
fit.
6. Overfitting: Overfitting occurs when a model learns the training data too well, capturing
noise and irrelevant patterns, leading to poor performance on new data.
7. Underfitting: Underfitting occurs when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance on both training and test data.
9. Label: A label is the output variable in supervised learning, representing the ground truth
that the model aims to predict.
10. Supervised Learning: A type of machine learning where the model is trained on labeled
data, i.e., data that includes both input features and the corresponding output labels.
11. Unsupervised Learning: A type of machine learning where the model is trained on
unlabeled data, aiming to find hidden patterns or intrinsic structures in the input data.
12. Reinforcement Learning: A type of machine learning where an agent learns to make
decisions by performing actions in an environment to maximize cumulative rewards.
---------------------------------------------------------------------------------------------------------------------
1. Classification: The task of predicting a categorical label for given input data. Examples
include spam detection and image classification.
2. Regression: The task of predicting a continuous value for given input data. Examples
include predicting house prices and stock prices.
3. Clustering: The task of grouping similar data points together based on certain
characteristics. Examples include customer segmentation and image segmentation.
4. Dimensionality Reduction: The task of reducing the number of input variables in a
dataset. Examples include Principal Component Analysis (PCA) and t-SNE.
5. Anomaly Detection: The task of identifying rare items, events, or observations that differ
significantly from the majority of the data. Examples include fraud detection and network
security.
6. Reinforcement Learning Tasks: Tasks where an agent interacts with an environment to
learn a policy that maximizes cumulative rewards. Examples include game playing and
robotics.
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
1. Problem Definition: Clearly define the problem and objectives. Understand the business
context and the requirements for the machine learning application.
2. Data Collection and Preprocessing: Collect relevant data and preprocess it. This
includes cleaning the data, handling missing values, normalizing/standardizing features,
and splitting the data into training, validation, and test sets.
3. Feature Engineering: Create and select relevant features that will help the model learn
patterns from the data. This may involve domain knowledge and various techniques like
encoding categorical variables and creating interaction features.
4. Model Selection and Training: Select appropriate machine learning algorithms and train
the models on the training data. Tune hyperparameters using the validation set.
5. Model Evaluation: Evaluate the trained models using performance metrics relevant to
the problem (e.g., accuracy, precision, recall, F1-score for classification; RMSE, MAE
for regression).
6. Model Deployment: Deploy the model to a production environment where it can make
predictions on new data. Ensure the model can handle real-time data and integrate with
existing systems.
---------------------------------------------------------------------------------------------------------------------
Feature engineering involves creating new features or transforming existing ones to improve
model performance. It leverages domain knowledge and data manipulation techniques.
Model selection involves choosing the best model from a set of candidate models based on their
performance on a validation set. It ensures that the chosen model generalizes well to unseen data.
Cross-Validating Models
K-Fold Cross-Validation:
o Description: The dataset is split into k equal parts. The model is trained k times,
each time leaving out one part for validation and using the remaining k-1 parts for
training. The results are averaged to produce a final performance metric.
o Example: Using 5-fold cross-validation for a dataset of 100 samples. Each fold
contains 20 samples.
o Description: Similar to k-fold but ensures that each fold is representative of the
overall class distribution.
---------------------------------------------------------------------------------------------------------------------
Data visualization is useful to gain insights and understand what happened in the past in a
given context. It is also helpful for feature engineering.
Matplotlib: Matplotlib is a Python 2D plotting library and most widely used library for data
visualization. It provides extensive set of plotting APIs to create various plots such as scattered,
bar, box, and distribution plots with custom styling and annotation. It is a library for creating 2D
plots of arrays in Python. It is written in Python and makes use of NumPy arrays. Matplotlib is
well integrated with pandas to read columns and create plots.
Seaborn: Seaborn is also a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing innovative and informative statistical charts.
To create graphs and plots, we need to import matplotlib.pyplot and seaborn modules. To
display the plots on the Jupyter Notebook, we need to provide a directive %matplotlib inline.
Only if the directive is provided, the plots will be displayed on the notebook.
---------------------------------------------------------------------------------------------------------------------
2. Bar Chart.
A bar chart, also known as a bar graph, is a visual representation of data where individual bars
represent different categories or values. Each bar's length or height is proportional to the value it
represents. Bar charts are commonly used to compare quantities across different categories or to
show changes over time when the data is categorized.
To draw a bar chart, call barplot() of seaborn library. The DataFrame should be passed in the
parameter data. To display the average sold price by each age category, pass SOLD PRICE as y
parameter and AGE as x parameter.
3. Histogram
A histogram is a plot that shows the frequency distribution of a set of continuous variable.
Histogram gives an insight into the underlying distribution (e.g., normal distribution) of the
variable, outliers, skew ness, etc. To draw a histogram, invoke hist() method of matplotlib
library.
---------------------------------------------------------------------------------------------------------------------
A distribution or density plot depicts the distribution of data over a continuous interval.
Density plot is like smoothed histogram and visualizes distribution of data over a continuous
interval. So, a density plot also gives insight into what might be the distribution of the
population. To draw the distribution plot, we can use distplot() of seaborn library.
sn.distplot(ipl_auction_df[‘SOLD PRICE’]);
5. Box Plot
Box plot (aka Box and Whisker plot) is a graphical representation of numerical data that can
be used to understand the variability of the data and the existence of outliers. Box plot is
designed by identifying the following descriptive statistics: 1. Lower quartile (1st quartile),
median and upper quartile (3rd quartile). 2. Lowest and highest values. 3. Inter-quartile range
(IQR). Box plot is constructed using IQR, minimum and maximum values. IQR is the distance
(difference) between the 3rd quartile and 1st quartile. The length of the box is equivalent to IQR.
It is possible that the data may contain values beyond Q1 – 1.5IQR and Q3 + 1.5IQR.
The whisker of the box plot extends till Q1 – 1.5IQR and Q3 + 1.5IQR; observations
beyond these two limits are potential outliers. To draw the box plot, call boxplot() of seaborn
library.
---------------------------------------------------------------------------------------------------------------------
6. Scatter Plot
In a scatter plot, the values of two variables are plotted along two axes and resulting pattern
can reveal correlation present between the variables, if any. The relationship could be linear or
non-linear. A scatter plot is also useful for assessing the strength of the relationship and to find if
there are any outliers in the data. Scatter plots are used during regression model building to
decide on the initial model, that is whether to include a variable in a regression model or not.
Since IPL is T20 cricket, it is believed that the number of sixers a player has hit in past would
have influenced his SOLD PRICE. A scatter plot between SOLD PRICE of batsman and number
of sixes the player has hit can establish this correlation. The scatter() method of matplotlib can be
used to draw the scatter plot which takes both the variables.
To draw the direction of relationship between the variables, regplot() of seaborn can be used.
Above figure shows there is a positive correlation between number of sixes hit by a batsman and
the SOLD PRICE.
---------------------------------------------------------------------------------------------------------------------
7. Pair plot
If there are many variables, it is not convenient to draw scatter plots for each pair of variables
to understand the relationships. So, a pair plot can be used to depict the relationships in a single
diagram which can be plotted using pairplot() method.
The plot is drawn like a matrix and each row and column is represented by a variable. Each cell
depicts the relationship between two variables, represented by that row and column variable. For
example, the cell on second row and first column shows the relationship between AVE and SR-B.
The diagonal of the matrix shows the distribution of the variable. For all the correlations, AVE
and SIXERS seem to be highly correlated with SOLD PRICE compared to SR-B.
8. Correlation and Heatmap.
Correlation is used for measuring the strength and direction of the linear relationship between
two continuous random variables X and Y. It is a statistical measure that indicates the extent to
which two variables change together. A positive correlation means the variables increase or
decrease together; a negative correlation means if one variable increases, the other decreases.
1. The correlation value lies between −1.0 and 1.0. The sign indicates whether it is positive
or negative correlation.
2. −1.0 indicates a perfect negative correlation, whereas +1.0 indicates perfect positive
correlation. Correlation values can be computed using corr() method of the DataFrame
and rendered using a heat map.
The color map scale is shown along the heatmap. Setting annot attribute to true prints the
correlation values in each box of the heatmap and improves readability of the heatmap. Here the
heatmap shows that AVE and SIXER show positive correlation, while SOLD PRICE and SR-B
are not so strongly correlated.
UNIT 1 CHAPTER 3: PROBABILITY DISTRIBUTIONS AND
HYPOTHESIS TESTS
1. Explain the terminologies that are used in probability theory.
3. Event: Event (E) is a subset of a sample space and probability is usually calculated with
respect to an event. Examples of events include:
a. Number of cancellation of orders placed at an E-commerce portal site exceeding
10%.
b. The number of fraudulent credit card transactions exceeding 1%.
c. The life of a capital equipment being less than one year.
d. Number of warranty claims less than 10 for a vehicle manufacturer with a fleet of
2000 vehicles under warranty.
---------------------------------------------------------------------------------------------------------------------
2. What are random variables? Explain how the random variable can be classified.
A random variable is a function that maps every outcome in the sample space to a real
number. It plays an important role in describing, measuring, and analyzing uncertain events such
as customer churn, employee attrition, demand for a product, and so on. A random variable can
be classified as discrete or continuous depending on the values it can take.
If the random variable X can assume only a finite or countably infinite set of values, then it is
called a discrete random variable. Examples of discrete random variables are as follows:
a. Credit rating (usually classified into different categories such as low, medium, and
high or using labels such as AAA, AA, A, BBB, etc.).
b. Number of orders received at an e-commerce retailer which can be countably infinite.
c. Customer churn [the random variables take binary values: (a) Churn and (b) Do not
churn].
d. Fraud [the random variables take binary values: (a) Fraudulent transaction and (b)
Genuine transaction].
A random variable X which can take a value from an infinite set of values is called a continuous
random variable. Examples of continuous random variables are as follows:
a. Market share of a company (which take any value from an infinite set of values between
0 and 100%).
b. Percentage of attrition of employees of an organization.
c. Time-to-failure of an engineering system.
d. Time taken to complete an order placed at an e-commerce portal.
Discrete random variables are described using probability mass function (PMF) and cumulative
distribution function (CDF). PMF is the probability that a random variable X takes a specific
value k; for example, the number of fraudulent transactions at an e-commerce platform is 10,
written as P(X = 10). On the other hand, CDF is the probability that a random variable X takes a
value less than or equal to 10, which is written as P(X ≤ 10). Continuous random variables are
described using probability density function (PDF) and cumulative distribution function (CDF).
PDF is the probability that a continuous random variable takes value in a small neighborhood of
“x” and is given by
The CDF of a continuous random variable is the probability that the random variable takes value
less than or equal to a value “a”.
---------------------------------------------------------------------------------------------------------------------
a. The random variable can have only two outcomes − success and failure (also known as
Bernoulli trials).
b. The objective is to find the probability of getting x successes out of n trials.
c. The probability of success is p and thus the probability of failure is (1 − p).
d. The probability p is constant and does not change between trials.
Success and failure are generic terminologies used in binomial distribution, based on the context
we will interpret success and failure. Few examples of business problems with two possible
outcomes are as follows:
a. Customer churn where the outcomes are: (a) Customer churn and (b) No customer churn.
b. Fraudulent insurance claims where the outcomes are: (a) Fraudulent claim and (b)
Genuine claim.
c. Loan repayment default by a customer where the outcomes are: (a) Default and
(b) No default. The PMF of the binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by
The CDF of a binomial distribution (probability that the number of success will be x or less than
x out of n trials) is given by
In Python, the scipy.stats.binom class provides methods to work with binomial distribution.
Example: Fashion Trends Online (FTO) is an e-commerce company that sells women apparel. It
is observed that 10% of their customers return the items purchased by them for many reasons
(such as size, color, and material mismatch). On a specific day, 20 customers purchased items
from FTO
---------------------------------------------------------------------------------------------------------------------
The Poisson distribution is a discrete probability distribution that expresses the probability of
a given number of events occurring in a fixed interval of time or space. These events must
happen with a known constant mean rate and independently of the time since the last event. For
example, number of cancellation of orders by customers at an e-commerce portal, number of
customer complaints, number of cash withdrawals at an ATM, number of typographical errors in
a book, number of potholes on Bangalore roads, etc. To find the probability of number of events,
we use Poisson distribution. The PMF of a Poisson distribution is given by
where:
k! is the factorial of k.
Example:
Suppose a call center receives an average of 5 calls per hour. To find the probability that exactly
3 calls will be received in an hour, we use λ=5 and k=3:
Thus, the probability of receiving exactly 3 calls in an hour is approximately 0.1404 or 14.04%.
---------------------------------------------------------------------------------------------------------------------
Where:
a. The parameter l is the scale parameter and represents the rate of occurrence of the event.
b. Mean of exponential distribution is given by 1/l.
Example:
The time-to-failure of an avionic system follows an exponential distribution with a mean time
between failures (MTBF) of 1000 hours. Calculate
a. The probability that the system will fail before 1000 hours.
b. The probability that it will not fail up to 2000 hours.
c. The time by which 10% of the system will fail (i.e., calculate P10 life).
---------------------------------------------------------------------------------------------------------------------
Normal distribution, also known as Gaussian distribution, is one of the most popular
continuous distribution in the field of analytics especially due to its use in multiple contexts.
Normal distribution is observed across many naturally occurring measures such as age, salary,
sales volume, birth weight, height, etc. It is also popularly known as bell curve (as it is shaped
like a bell). The normal distribution is parameterized by two parameters: the mean of the
distribution m and the variance s 2. The sample mean of a normal distribution is given by
Variance is given by
---------------------------------------------------------------------------------------------------------------------
7. What is Central Limit Theorem?
---------------------------------------------------------------------------------------------------------------------
Hypothesis is a claim and the objective of hypothesis testing is to either reject or retain a null
hypothesis (current belief) with the help of data. Hypothesis testing consists of two
complementary statements called null hypothesis and alternative hypothesis. Null hypothesis is
an existing belief and alternate hypothesis is what we intend to establish with new evidences
(samples). Hypothesis tests are broadly classified into parametric tests and non-parametric tests.
Parametric tests are about population parameters of a distribution such as mean, proportion,
standard deviation, etc., whereas non-parametric tests are not about parameters, but about other
characteristics such as independence of events or data following certain distributions such as
normal distribution.
a. Children who drink the health drink Complan (a health drink produced by the company
Heinz in India) are likely to grow taller.
b. Women use camera phone more than men (Freier, 2016).
c. Vegetarians miss few flights (Siegel, 2016).
d. Smokers are better sales people.
---------------------------------------------------------------------------------------------------------------------
9. Explain what is Z-Test, One Sample t-Test, Two Sample t-Test, Paired Sample t-Test,
Chi-Square Goodness of Fit Test.
a. We need to test the value of population mean, given that population variance is known.
b. The population is a normal distribution and the population variance is known.
c. The sample size is large and the population variance is known. That is, the assumption of
normal distribution can be relaxed for large samples (n > 30). Z-statistic is calculated as
One Sample t-Test: The t-test is used when the population standard deviation S is unknown (and
hence estimated from the sample) and is estimated from the sample. Mathematically,
The expected value (mean) of a sample of independent observations is equal to the given
population mean.
Two Sample t-test: A two-sample t-test is required to test difference between two
population means where standard deviations are unknown. The parameters are estimated from
the samples.
Paired Sample t-test: to check whether the difference in the parameter values is
statistically significant before and after the intervention or between two different types
of interventions. This is called a paired sample t-test and is used for comparing two
different interventions applied on the same sample.
Chi-Square Goodness of Fit Test: Chi-square goodness of fit test is a non-parametric test used
for comparing the observed distribution of data with the expected distribution of the data to
decide whether there is any statistically significant difference between the observed
distribution and a theoretical distribution. Chi-square statistics is given by
---------------------------------------------------------------------------------------------------------------------
A machine learning model is a program that has been trained on a set of data to recognize
certain types of patterns. These models use statistical techniques to infer the relationships within
the data and make predictions or decisions without being explicitly programmed for specific
tasks.
a. Logical Models: Logical models use a logical expression to divide the instance space
into segments and hence construct grouping models. A logical expression is an
expression that returns a Boolean value, i.e., a True or False outcome. Once the data is
grouped using a logical expression, the data is divided into homogeneous groupings for
the problem we are trying to solve. For example, for a classification problem, all the
instances in the group belong to one class. There are mainly two kinds of logical models:
Tree models and Rule models. Rule models consist of a collection of implications or IF-
THEN rules. For tree-based models, the ‘if-part’ defines a segment and the ‘then-part’
defines the behavior of the model for this segment. Rule models follow the same
reasoning. Tree models can be seen as a particular type of rule model where the if-parts
of the rules are organized in a tree structure. Both Tree models and Rule models use the
same approach to supervised learning. The approach can be summarized in two
strategies: we could first find the body of the rule (the concept) that covers a sufficiently
homogeneous set of examples and then find a label to represent the body. Alternatively,
we could approach it from the other direction, i.e., first select a class we want to learn and
then find rules that cover examples of the class. A simple tree-based model is shown
below. The tree shows survival numbers of passengers on the Titanic ("sibsp" is the
number of spouses or siblings aboard). The values under the leaves show the probability
of survival and the percentage of observations in the leaf. The model can be summarized
as: Your chances of survival were good if you were (i) a female or (ii) a male younger
than 9.5 years with less than 2.5 siblings.
b. Linear Models: Linear models are relatively simple. In this case, the function is
represented as a linear combination of its inputs. Thus, if x1 and x2 are two scalars or
vectors of the same dimension and a and b are arbitrary scalars, then ax1 + bx2 represents a
linear combination of x1 and x2. In the simplest case where f(x) represents a straight line,
we have an equation of the form f (x) = mx + c where c represents the intercept and m
represents the slope.
Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are
the parameters that we are trying to learn from the data. This technique is different from tree or
rule models, where the structure of the model (e.g., which features to use in the tree, and where)
is not fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is
different as well. As a result of having relatively few parameters, Linear models have low
variance and high bias. This implies that Linear models are less likely to overfit the training
data than some other models.
Distance is applied through the concept of neighbors and examples. Neighbors are points in
proximity with respect to the distance measure expressed through examples. Examples are either
centroids that find a center of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic mean,
which minimizes squared Euclidean distance to all other points.
Notes:
The centroid represents the geometric center of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to
any object in n-dimensional space: its centroid is the mean position of all the points.
Medoids are similar in concept to means or centroids. Medoids are most commonly used
on data when a mean or centroid cannot be defined. They are used in contexts where the
centroid is not representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbor models, which use the training
data as examples – for example, in classification. The K-means clustering algorithm also uses
examples to create clusters of similar data points.
The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set of
classes (c_0 through c_k). We aim to determine the probability of the features occurring in each
class, and to return the most likely class. Therefore, for each class, we need to calculate P(c_i |
x_0, …, x_n).
---------------------------------------------------------------------------------------------------------------------
Feature extraction involves transforming the data into a new space with fewer dimensions. This
new space captures the most important information from the original data.
o Description: PCA is a statistical method that transforms the original features into
a new set of orthogonal (uncorrelated) features called principal components. The
first principal component captures the most variance in the data, and each
subsequent component captures the remaining variance.
4. Autoencoders:
Feature selection involves selecting a subset of relevant features for use in model construction. It
aims to retain the most informative features while discarding irrelevant or redundant ones.
1. Filter Methods:
o Description: Filter methods rank features based on statistical measures and select
the top-ranking features. Common criteria include correlation coefficients, chi-
square tests, and mutual information.
2. Wrapper Methods:
o Example: Using RFE with a decision tree classifier to select the most important
features for predicting loan defaults.
3. Embedded Methods:
o Description: Embedded methods perform feature selection during the model
training process. Techniques include regularization methods like Lasso (L1
regularization) and Ridge (L2 regularization) that add penalties to the model for
having too many features.
o Description: Tree-based methods like Random Forests and Gradient Boosting can
provide feature importance scores based on how often a feature is used to split the
data.
o Example: Using feature importance scores from a Random Forest model to select
the top features for a classification task.