0% found this document useful (0 votes)
19 views27 pages

Rohit Unit 1 ML Notes

Uploaded by

Abhishek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views27 pages

Rohit Unit 1 ML Notes

Uploaded by

Abhishek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT 1 CHAPTER 1: INTRODUCTION TO MACHINE LEARING

1. Introduction to analytics and machine learning:

Analytics is a collection of techniques and tools used for creating value from data. Techniques
include concepts such as artificial intelligence (AI), machine learning (ML), and deep learning
(DL) algorithms. AI, ML, and DL are defined as follows:

1. Artificial Intelligence: Algorithms and systems that exhibit human-like intelligence.


2. Machine Learning: Subset of AI that can learn to perform a task with extracted data
and/or models.
3. Deep Learning: Subset of machine learning that imitates the functioning of the human
brain to solve problems.

Machine learning is a set of algorithms that have the capability to learn to perform tasks such as
prediction and classification effectively using data.

Machine learning algorithms are classified into four categories as defined below:

1. Supervised Learning Algorithms: These algorithms require the knowledge of both the
outcome variable (dependent variable) and the features (independent variable or input
variables). The algorithm learns (i.e., estimates the values of the model parameters or
feature weights) by defining a loss function which is usually a function of the difference
between the predicted value and actual value of the outcome variable. Algorithms such as
linear regression, logistic regression, discriminant analysis are examples of supervised
learning algorithms.
2. Unsupervised Learning Algorithms: These algorithms are a set of algorithms which do
not have the knowledge of the outcome variable in the dataset. The algorithms must find
the possible values of the outcome variable. Algorithms such as clustering and principal
component analysis are examples of unsupervised learning algorithms.
3. Reinforcement Learning Algorithms: Reinforcement learning algorithms are
algorithms that must take sequential actions (decisions) to maximize a cumulative
reward. In many datasets, there could be uncertainty around both input as well as the
output variables. For example, consider the case of spell check in various text editors. If a
person types “buutiful” in Microsoft Word, the spell check in Microsoft Word will
immediately identify this as a spelling mistake and give options such as “beautiful”,
“bountiful”, and “dutiful”. Here the prediction is not one single value, but a set of values.
Techniques such as Markov chain and Markov decision process are examples of
reinforcement learning algorithms.
4. Evolutionary Learning Algorithms: Evolutional algorithms are algorithms that imitate
natural evolution to solve a problem. Techniques such as genetic algorithm and ant
colony optimization fall under the category of evolutionary learning algorithms.

---------------------------------------------------------------------------------------------------------------------

2. What are the steps for a typical ML algorithm?

 A typical ML algorithm uses the following steps:

1. Identify the problem or opportunity for value creation.


2. Identify sources of data (primary as well secondary data sources) and create a data lake
(integrated data set from different sources).
3. Pre-process the data for issues such as missing and incorrect data. Generate derived
variables (feature engineering) and transform the data if necessary. Prepare the data for
ML model building.
4. Divide the datasets into subsets of training and validation datasets.
5. Build ML models and identify the best model(s) using model performance in validation
data.
6. Implement Solution/Decision/Develop Product.

---------------------------------------------------------------------------------------
3. List the frameworks which are used for developing machine learning models.

 The framework for ML algorithm development can be divided into five integrated stages:
problem and opportunity identification, collection of relevant data, data pre-processing, ML
model building, and model deployment.

The success of ML projects will depend on the following activities:


1. Feature Extraction: Feature extraction is a process of extracting features from different
sources. For a given problem, it is important to identify the features or independent
variables that may be necessary for building the ML algorithm. Organizations store data
captured by them in enterprise resource planning (ERP) systems, but there is no
guarantee that the organization would have identified all important features while
designing the ERP system. It is also possible that the problem being addressed using the
ML algorithm may require data that is not captured by the organization. For example,
consider a company that is interested in predicting the warranty cost for the vehicle
manufactured by them. The number of warranty claims may depend on weather
conditions such as rainfall, humidity, and so on. In many cases, feature extraction itself
can be an iterative process.
2. Feature Engineering: Once the data is made available (after feature extraction), an
important step in machine learning is feature engineering. The model developer should
decide how he/she would like to use the data that has been captured by deriving new
features. For example, if X1 and X2 are two features that are captured in the original
data. We can derive new features by taking ratio (X1 /X2) and product (X1 X2).
3. Model Building and Feature Selection: During model building, the objective is to
identify the model that is more suitable for the given problem context. The selected
model may not be always the most accurate model, as accurate models may take more
time to compute and may require expensive infrastructure. The final model for
deployment will be based on multiple criteria such as accuracy, computing speed, cost of
deployment, and so on.
4. Model Deployment: Once the final model is chosen, then the organization must decide
the strategy for model deployment. Model deployment can be in the form of simple
business rules, chatbots, real-time actions, robots, and so on.

---------------------------------------------------------------------------------------------------------------------

Above ss explains that these libraries are used for what purpose (can be asked in MCQ or VIVA)

---------------------------------------------------------------------------------------------------------------------

4. Explain the key terminologies of machine learning.

1. Model: A model is a mathematical representation of a real-world process, trained on data


to make predictions or decisions.
2. Algorithm: An algorithm is a set of rules or instructions used by a model to learn patterns
from data.

3. Training Data: Training data is the dataset used to train a machine learning model. It
includes input-output pairs where the output is known.

4. Test Data: Test data is a separate dataset used to evaluate the performance of a trained
model. It helps measure the model’s ability to generalize to new data.

5. Validation Data: Validation data is used during the training process to tune
hyperparameters and avoid overfitting. It provides an unbiased evaluation of the model
fit.

6. Overfitting: Overfitting occurs when a model learns the training data too well, capturing
noise and irrelevant patterns, leading to poor performance on new data.

7. Underfitting: Underfitting occurs when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance on both training and test data.

8. Feature: A feature is an individual measurable property or characteristic of a


phenomenon being observed. Features are the input variables used for training a model.

9. Label: A label is the output variable in supervised learning, representing the ground truth
that the model aims to predict.

10. Supervised Learning: A type of machine learning where the model is trained on labeled
data, i.e., data that includes both input features and the corresponding output labels.

11. Unsupervised Learning: A type of machine learning where the model is trained on
unlabeled data, aiming to find hidden patterns or intrinsic structures in the input data.

12. Reinforcement Learning: A type of machine learning where an agent learns to make
decisions by performing actions in an environment to maximize cumulative rewards.

---------------------------------------------------------------------------------------------------------------------

5. Explain the key tasks of machine learning.

1. Classification: The task of predicting a categorical label for given input data. Examples
include spam detection and image classification.
2. Regression: The task of predicting a continuous value for given input data. Examples
include predicting house prices and stock prices.
3. Clustering: The task of grouping similar data points together based on certain
characteristics. Examples include customer segmentation and image segmentation.
4. Dimensionality Reduction: The task of reducing the number of input variables in a
dataset. Examples include Principal Component Analysis (PCA) and t-SNE.
5. Anomaly Detection: The task of identifying rare items, events, or observations that differ
significantly from the majority of the data. Examples include fraud detection and network
security.
6. Reinforcement Learning Tasks: Tasks where an agent interacts with an environment to
learn a policy that maximizes cumulative rewards. Examples include game playing and
robotics.

---------------------------------------------------------------------------------------------------------------------

6. Write a note on selection of machine learning algorithms.

1. Type of Problem: Identify whether the problem is classification, regression, clustering,


etc., and choose an algorithm suited for that task.
2. Data Characteristics: Consider the size, dimensionality, and nature of the data. Some
algorithms perform better with large datasets or specific data distributions.
3. Accuracy and Interpretability: Trade-off between accuracy and interpretability. Some
models like neural networks are highly accurate but less interpretable, while models like
decision trees are more interpretable.
4. Computational Efficiency: Consider the computational resources required for training
and prediction. Some algorithms may require more memory and processing power.
5. Scalability: Choose algorithms that scale well with increasing data size and complexity.
6. Model Complexity: Consider the complexity of the model and the risk of overfitting.
Simple models may generalize better but may not capture complex patterns.

---------------------------------------------------------------------------------------------------------------------

7. Explain the development process of machine learning algorithms

1. Problem Definition: Clearly define the problem and objectives. Understand the business
context and the requirements for the machine learning application.

2. Data Collection and Preprocessing: Collect relevant data and preprocess it. This
includes cleaning the data, handling missing values, normalizing/standardizing features,
and splitting the data into training, validation, and test sets.

3. Feature Engineering: Create and select relevant features that will help the model learn
patterns from the data. This may involve domain knowledge and various techniques like
encoding categorical variables and creating interaction features.
4. Model Selection and Training: Select appropriate machine learning algorithms and train
the models on the training data. Tune hyperparameters using the validation set.

5. Model Evaluation: Evaluate the trained models using performance metrics relevant to
the problem (e.g., accuracy, precision, recall, F1-score for classification; RMSE, MAE
for regression).

6. Model Deployment: Deploy the model to a production environment where it can make
predictions on new data. Ensure the model can handle real-time data and integrate with
existing systems.

7. Monitoring and Maintenance: Continuously monitor the model’s performance and


retrain or update it as necessary. Address issues like model drift and ensure the model
adapts to changing data patterns.

8. Iterative Improvement: Continuously improve the model by iterating on the previous


steps, incorporating feedback, and exploring new features or algorithms.

---------------------------------------------------------------------------------------------------------------------

8. Explain Feature Engineering in detail.

 Feature engineering involves creating new features or transforming existing ones to improve
model performance. It leverages domain knowledge and data manipulation techniques.

1. Creating Interaction Features: Combine two or more features to capture interactions


between them. Example: Multiplying age and income to create an interaction feature in a
financial dataset.

2. Binning and Discretization: Transform continuous features into categorical ones by


grouping values into bins. Example: Binning ages into groups like 0-18, 19-35, 36-50,
51+.

3. Polynomial Features: Create new features by raising existing features to a power.


Example: Adding squared terms of existing features to capture non-linear relationships.

4. Log Transformations: Apply logarithmic transformations to skewed features to


normalize their distribution. Example: Applying log transformation to highly skewed
income data.
9. What is model evaluation and model selection.

 Model evaluation involves assessing the performance of a machine learning model on a


dataset. It helps to understand how well the model generalizes to unseen data and to compare
different models.

Model selection involves choosing the best model from a set of candidate models based on their
performance on a validation set. It ensures that the chosen model generalizes well to unseen data.

Cross-Validating Models

Cross-validation is a technique for assessing how a model generalizes to an independent dataset.


The most common type is k-fold cross-validation, where the dataset is divided into k subsets.
The model is trained on k-1 subsets and tested on the remaining subset, and this process is
repeated k times.

 K-Fold Cross-Validation:

o Description: The dataset is split into k equal parts. The model is trained k times,
each time leaving out one part for validation and using the remaining k-1 parts for
training. The results are averaged to produce a final performance metric.

o Example: Using 5-fold cross-validation for a dataset of 100 samples. Each fold
contains 20 samples.

 Stratified K-Fold Cross-Validation:

o Description: Similar to k-fold but ensures that each fold is representative of the
overall class distribution.

o Example: Using stratified 5-fold cross-validation for imbalanced classification


problems.

---------------------------------------------------------------------------------------------------------------------

UNIT 1 CHAPTER 2: DESCRIPTIVE ANALYTICS


1. Exploration of Data using Visualization.

 Data visualization is useful to gain insights and understand what happened in the past in a
given context. It is also helpful for feature engineering.

Matplotlib: Matplotlib is a Python 2D plotting library and most widely used library for data
visualization. It provides extensive set of plotting APIs to create various plots such as scattered,
bar, box, and distribution plots with custom styling and annotation. It is a library for creating 2D
plots of arrays in Python. It is written in Python and makes use of NumPy arrays. Matplotlib is
well integrated with pandas to read columns and create plots.

Seaborn: Seaborn is also a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing innovative and informative statistical charts.

To create graphs and plots, we need to import matplotlib.pyplot and seaborn modules. To
display the plots on the Jupyter Notebook, we need to provide a directive %matplotlib inline.
Only if the directive is provided, the plots will be displayed on the notebook.

---------------------------------------------------------------------------------------------------------------------

2. Bar Chart.

 A bar chart, also known as a bar graph, is a visual representation of data where individual bars
represent different categories or values. Each bar's length or height is proportional to the value it
represents. Bar charts are commonly used to compare quantities across different categories or to
show changes over time when the data is categorized.

To draw a bar chart, call barplot() of seaborn library. The DataFrame should be passed in the
parameter data. To display the average sold price by each age category, pass SOLD PRICE as y
parameter and AGE as x parameter.

sn.barplot(x = ‘AGE’, y = ‘SOLD PRICE’, data = soldprice_by_age);

3. Histogram

 A histogram is a plot that shows the frequency distribution of a set of continuous variable.
Histogram gives an insight into the underlying distribution (e.g., normal distribution) of the
variable, outliers, skew ness, etc. To draw a histogram, invoke hist() method of matplotlib
library.

plt.hist( ipl_auction_df[‘SOLD PRICE’] );


The histogram shows that SOLD PRICE is right skewed. Most players are auctioned at low price
range of 250000 and 500000, whereas few players are paid very highly, more than 1
million dollars.

---------------------------------------------------------------------------------------------------------------------

4. Distribution or density plot

 A distribution or density plot depicts the distribution of data over a continuous interval.
Density plot is like smoothed histogram and visualizes distribution of data over a continuous
interval. So, a density plot also gives insight into what might be the distribution of the
population. To draw the distribution plot, we can use distplot() of seaborn library.

sn.distplot(ipl_auction_df[‘SOLD PRICE’]);

5. Box Plot

 Box plot (aka Box and Whisker plot) is a graphical representation of numerical data that can
be used to understand the variability of the data and the existence of outliers. Box plot is
designed by identifying the following descriptive statistics: 1. Lower quartile (1st quartile),
median and upper quartile (3rd quartile). 2. Lowest and highest values. 3. Inter-quartile range
(IQR). Box plot is constructed using IQR, minimum and maximum values. IQR is the distance
(difference) between the 3rd quartile and 1st quartile. The length of the box is equivalent to IQR.
It is possible that the data may contain values beyond Q1 – 1.5IQR and Q3 + 1.5IQR.
The whisker of the box plot extends till Q1 – 1.5IQR and Q3 + 1.5IQR; observations
beyond these two limits are potential outliers. To draw the box plot, call boxplot() of seaborn
library.

box = sn.boxplot(ipl_auction_df[‘SOLD PRICE’]);

---------------------------------------------------------------------------------------------------------------------

6. Scatter Plot

 In a scatter plot, the values of two variables are plotted along two axes and resulting pattern
can reveal correlation present between the variables, if any. The relationship could be linear or
non-linear. A scatter plot is also useful for assessing the strength of the relationship and to find if
there are any outliers in the data. Scatter plots are used during regression model building to
decide on the initial model, that is whether to include a variable in a regression model or not.
Since IPL is T20 cricket, it is believed that the number of sixers a player has hit in past would
have influenced his SOLD PRICE. A scatter plot between SOLD PRICE of batsman and number
of sixes the player has hit can establish this correlation. The scatter() method of matplotlib can be
used to draw the scatter plot which takes both the variables.

To draw the direction of relationship between the variables, regplot() of seaborn can be used.
Above figure shows there is a positive correlation between number of sixes hit by a batsman and
the SOLD PRICE.

---------------------------------------------------------------------------------------------------------------------

7. Pair plot

 If there are many variables, it is not convenient to draw scatter plots for each pair of variables
to understand the relationships. So, a pair plot can be used to depict the relationships in a single
diagram which can be plotted using pairplot() method.

influential_features = [‘SR-B’, ‘AVE’, ‘SIXERS’, ‘SOLD PRICE’]


sn.pairplot(ipl_auction_df[influential_features], size=2)

The plot is drawn like a matrix and each row and column is represented by a variable. Each cell
depicts the relationship between two variables, represented by that row and column variable. For
example, the cell on second row and first column shows the relationship between AVE and SR-B.
The diagonal of the matrix shows the distribution of the variable. For all the correlations, AVE
and SIXERS seem to be highly correlated with SOLD PRICE compared to SR-B.
8. Correlation and Heatmap.

 Correlation is used for measuring the strength and direction of the linear relationship between
two continuous random variables X and Y. It is a statistical measure that indicates the extent to
which two variables change together. A positive correlation means the variables increase or
decrease together; a negative correlation means if one variable increases, the other decreases.

1. The correlation value lies between −1.0 and 1.0. The sign indicates whether it is positive
or negative correlation.
2. −1.0 indicates a perfect negative correlation, whereas +1.0 indicates perfect positive
correlation. Correlation values can be computed using corr() method of the DataFrame
and rendered using a heat map.

The color map scale is shown along the heatmap. Setting annot attribute to true prints the
correlation values in each box of the heatmap and improves readability of the heatmap. Here the
heatmap shows that AVE and SIXER show positive correlation, while SOLD PRICE and SR-B
are not so strongly correlated.
UNIT 1 CHAPTER 3: PROBABILITY DISTRIBUTIONS AND
HYPOTHESIS TESTS
1. Explain the terminologies that are used in probability theory.

 Basically, there are 3 terminologies that are used in probability theory:

1. Random Experiment: Random experiment is an experiment in which the outcome is not


known with certainty. That is, the output of a random experiment cannot be predicted
with certainty.
2. Sample Space: Sample space is the universal set that consists of all possible outcomes of
an experiment. Sample space is usually represented using the letter “S” and individual
outcomes are called the elementary events. The sample space can be finite or infinite.
Few random experiments and their sample spaces are discussed below:
a. Experiment: Outcome of a college application:
Sample Space = S = {admitted, not admitted}
b. Experiment: Predicting customer churn at an individual customer level:
Sample Space = S = {Churn, No Churn}
c.

3. Event: Event (E) is a subset of a sample space and probability is usually calculated with
respect to an event. Examples of events include:
a. Number of cancellation of orders placed at an E-commerce portal site exceeding
10%.
b. The number of fraudulent credit card transactions exceeding 1%.
c. The life of a capital equipment being less than one year.
d. Number of warranty claims less than 10 for a vehicle manufacturer with a fleet of
2000 vehicles under warranty.

---------------------------------------------------------------------------------------------------------------------

2. What are random variables? Explain how the random variable can be classified.

 A random variable is a function that maps every outcome in the sample space to a real
number. It plays an important role in describing, measuring, and analyzing uncertain events such
as customer churn, employee attrition, demand for a product, and so on. A random variable can
be classified as discrete or continuous depending on the values it can take.

If the random variable X can assume only a finite or countably infinite set of values, then it is
called a discrete random variable. Examples of discrete random variables are as follows:
a. Credit rating (usually classified into different categories such as low, medium, and
high or using labels such as AAA, AA, A, BBB, etc.).
b. Number of orders received at an e-commerce retailer which can be countably infinite.
c. Customer churn [the random variables take binary values: (a) Churn and (b) Do not
churn].
d. Fraud [the random variables take binary values: (a) Fraudulent transaction and (b)
Genuine transaction].

A random variable X which can take a value from an infinite set of values is called a continuous
random variable. Examples of continuous random variables are as follows:

a. Market share of a company (which take any value from an infinite set of values between
0 and 100%).
b. Percentage of attrition of employees of an organization.
c. Time-to-failure of an engineering system.
d. Time taken to complete an order placed at an e-commerce portal.

Discrete random variables are described using probability mass function (PMF) and cumulative
distribution function (CDF). PMF is the probability that a random variable X takes a specific
value k; for example, the number of fraudulent transactions at an e-commerce platform is 10,
written as P(X = 10). On the other hand, CDF is the probability that a random variable X takes a
value less than or equal to 10, which is written as P(X ≤ 10). Continuous random variables are
described using probability density function (PDF) and cumulative distribution function (CDF).
PDF is the probability that a continuous random variable takes value in a small neighborhood of
“x” and is given by

The CDF of a continuous random variable is the probability that the random variable takes value
less than or equal to a value “a”.

---------------------------------------------------------------------------------------------------------------------

3. Explain in detail about Binomial Distribution with an example

 Binomial distribution is a discrete probability distribution and has several applications in


many business contexts. A random variable X is said to follow a binomial distribution when:

a. The random variable can have only two outcomes − success and failure (also known as
Bernoulli trials).
b. The objective is to find the probability of getting x successes out of n trials.
c. The probability of success is p and thus the probability of failure is (1 − p).
d. The probability p is constant and does not change between trials.

Success and failure are generic terminologies used in binomial distribution, based on the context
we will interpret success and failure. Few examples of business problems with two possible
outcomes are as follows:

a. Customer churn where the outcomes are: (a) Customer churn and (b) No customer churn.
b. Fraudulent insurance claims where the outcomes are: (a) Fraudulent claim and (b)
Genuine claim.
c. Loan repayment default by a customer where the outcomes are: (a) Default and
(b) No default. The PMF of the binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by

The CDF of a binomial distribution (probability that the number of success will be x or less than
x out of n trials) is given by

In Python, the scipy.stats.binom class provides methods to work with binomial distribution.

Example: Fashion Trends Online (FTO) is an e-commerce company that sells women apparel. It
is observed that 10% of their customers return the items purchased by them for many reasons
(such as size, color, and material mismatch). On a specific day, 20 customers purchased items
from FTO

---------------------------------------------------------------------------------------------------------------------

4. Explain in detail about Poisson Distribution.

 The Poisson distribution is a discrete probability distribution that expresses the probability of
a given number of events occurring in a fixed interval of time or space. These events must
happen with a known constant mean rate and independently of the time since the last event. For
example, number of cancellation of orders by customers at an e-commerce portal, number of
customer complaints, number of cash withdrawals at an ATM, number of typographical errors in
a book, number of potholes on Bangalore roads, etc. To find the probability of number of events,
we use Poisson distribution. The PMF of a Poisson distribution is given by

where:

 X is the random variable representing the number of events,

 k is the actual number of events,

 λ is the average rate (mean) of events per interval,

 e is the base of the natural logarithm (approximately equal to 2.71828),

 k! is the factorial of k.

Example:

Suppose a call center receives an average of 5 calls per hour. To find the probability that exactly
3 calls will be received in an hour, we use λ=5 and k=3:

Thus, the probability of receiving exactly 3 calls in an hour is approximately 0.1404 or 14.04%.

---------------------------------------------------------------------------------------------------------------------

5. Explain in detail about Exponential Distribution.

 Exponential distribution is a single parameter continuous distribution that is traditionally used


for modeling time-to-failure of electronic components. The exponential distribution represents a
process in which events occur continuously and independently at a constant average rate. The
probability density function is given by

Where:

a. The parameter l is the scale parameter and represents the rate of occurrence of the event.
b. Mean of exponential distribution is given by 1/l.

Example:

The time-to-failure of an avionic system follows an exponential distribution with a mean time
between failures (MTBF) of 1000 hours. Calculate

a. The probability that the system will fail before 1000 hours.
b. The probability that it will not fail up to 2000 hours.
c. The time by which 10% of the system will fail (i.e., calculate P10 life).

---------------------------------------------------------------------------------------------------------------------

6. What is Normal Distribution.

 Normal distribution, also known as Gaussian distribution, is one of the most popular
continuous distribution in the field of analytics especially due to its use in multiple contexts.
Normal distribution is observed across many naturally occurring measures such as age, salary,
sales volume, birth weight, height, etc. It is also popularly known as bell curve (as it is shaped
like a bell). The normal distribution is parameterized by two parameters: the mean of the
distribution m and the variance s 2. The sample mean of a normal distribution is given by

Variance is given by

---------------------------------------------------------------------------------------------------------------------
7. What is Central Limit Theorem?

---------------------------------------------------------------------------------------------------------------------

8. Explain what is hypothesis test in detail with example.

 Hypothesis is a claim and the objective of hypothesis testing is to either reject or retain a null
hypothesis (current belief) with the help of data. Hypothesis testing consists of two
complementary statements called null hypothesis and alternative hypothesis. Null hypothesis is
an existing belief and alternate hypothesis is what we intend to establish with new evidences
(samples). Hypothesis tests are broadly classified into parametric tests and non-parametric tests.
Parametric tests are about population parameters of a distribution such as mean, proportion,
standard deviation, etc., whereas non-parametric tests are not about parameters, but about other
characteristics such as independence of events or data following certain distributions such as
normal distribution.

Few examples of the null hypothesis are as follows:

a. Children who drink the health drink Complan (a health drink produced by the company
Heinz in India) are likely to grow taller.
b. Women use camera phone more than men (Freier, 2016).
c. Vegetarians miss few flights (Siegel, 2016).
d. Smokers are better sales people.

---------------------------------------------------------------------------------------------------------------------

9. Explain what is Z-Test, One Sample t-Test, Two Sample t-Test, Paired Sample t-Test,
Chi-Square Goodness of Fit Test.

 Z-test: Z-test is used when

a. We need to test the value of population mean, given that population variance is known.
b. The population is a normal distribution and the population variance is known.
c. The sample size is large and the population variance is known. That is, the assumption of
normal distribution can be relaxed for large samples (n > 30). Z-statistic is calculated as

One Sample t-Test: The t-test is used when the population standard deviation S is unknown (and
hence estimated from the sample) and is estimated from the sample. Mathematically,
The expected value (mean) of a sample of independent observations is equal to the given
population mean.

Two Sample t-test: A two-sample t-test is required to test difference between two
population means where standard deviations are unknown. The parameters are estimated from
the samples.

Paired Sample t-test: to check whether the difference in the parameter values is
statistically significant before and after the intervention or between two different types
of interventions. This is called a paired sample t-test and is used for comparing two
different interventions applied on the same sample.

Chi-Square Goodness of Fit Test: Chi-square goodness of fit test is a non-parametric test used
for comparing the observed distribution of data with the expected distribution of the data to
decide whether there is any statistically significant difference between the observed
distribution and a theoretical distribution. Chi-square statistics is given by

---------------------------------------------------------------------------------------------------------------------

10. Explain what are machine learning models in detail.

 A machine learning model is a program that has been trained on a set of data to recognize
certain types of patterns. These models use statistical techniques to infer the relationships within
the data and make predictions or decisions without being explicitly programmed for specific
tasks.

a. Logical Models: Logical models use a logical expression to divide the instance space
into segments and hence construct grouping models. A logical expression is an
expression that returns a Boolean value, i.e., a True or False outcome. Once the data is
grouped using a logical expression, the data is divided into homogeneous groupings for
the problem we are trying to solve. For example, for a classification problem, all the
instances in the group belong to one class. There are mainly two kinds of logical models:
Tree models and Rule models. Rule models consist of a collection of implications or IF-
THEN rules. For tree-based models, the ‘if-part’ defines a segment and the ‘then-part’
defines the behavior of the model for this segment. Rule models follow the same
reasoning. Tree models can be seen as a particular type of rule model where the if-parts
of the rules are organized in a tree structure. Both Tree models and Rule models use the
same approach to supervised learning. The approach can be summarized in two
strategies: we could first find the body of the rule (the concept) that covers a sufficiently
homogeneous set of examples and then find a label to represent the body. Alternatively,
we could approach it from the other direction, i.e., first select a class we want to learn and
then find rules that cover examples of the class. A simple tree-based model is shown
below. The tree shows survival numbers of passengers on the Titanic ("sibsp" is the
number of spouses or siblings aboard). The values under the leaves show the probability
of survival and the percentage of observations in the leaf. The model can be summarized
as: Your chances of survival were good if you were (i) a female or (ii) a male younger
than 9.5 years with less than 2.5 siblings.

b. Linear Models: Linear models are relatively simple. In this case, the function is
represented as a linear combination of its inputs. Thus, if x1 and x2 are two scalars or
vectors of the same dimension and a and b are arbitrary scalars, then ax1 + bx2 represents a
linear combination of x1 and x2. In the simplest case where f(x) represents a straight line,
we have an equation of the form f (x) = mx + c where c represents the intercept and m
represents the slope.
Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are
the parameters that we are trying to learn from the data. This technique is different from tree or
rule models, where the structure of the model (e.g., which features to use in the tree, and where)
is not fixed in advance.

Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is
different as well. As a result of having relatively few parameters, Linear models have low
variance and high bias. This implies that Linear models are less likely to overfit the training
data than some other models.

c. Distance-based models: Distance-based models are the second class of Geometric


models. Like Linear models, distance-based models are based on the geometry of data.
As the name implies, distance-based models work on the concept of distance. In the
context of Machine learning, the concept of distance is not based on merely the physical
distance between two points. Instead, we could think of the distance between two points
considering the mode of transport between two points. Traveling between two cities by
plane covers less distance physically than by train because a plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a
Bishop can move diagonally. Thus, depending on the entity and the mode of travel, the
concept of distance can be experienced differently. The distance metrics commonly used
are Euclidean, Minkowski, Manhattan, and Mahalanobis.

Distance is applied through the concept of neighbors and examples. Neighbors are points in
proximity with respect to the distance measure expressed through examples. Examples are either
centroids that find a center of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic mean,
which minimizes squared Euclidean distance to all other points.

Notes:
 The centroid represents the geometric center of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to
any object in n-dimensional space: its centroid is the mean position of all the points.

 Medoids are similar in concept to means or centroids. Medoids are most commonly used
on data when a mean or centroid cannot be defined. They are used in contexts where the
centroid is not representative of the dataset, such as in image data.

Examples of distance-based models include the nearest-neighbor models, which use the training
data as examples – for example, in classification. The K-means clustering algorithm also uses
examples to create clusters of similar data points.

d. Probabilistic Models: The third family of machine learning algorithms is the


probabilistic models. probabilistic models use the idea of probability to classify new
entities. Probabilistic models see features and target variables as random variables. The
process of modeling represents and manipulates the level of uncertainty with respect to
these variables. There are two types of probabilistic models: Predictive and Generative.
Predictive probability models use the idea of a conditional probability distribution P (Y |
X) from which Y can be predicted from X. Generative models estimate the joint
distribution P (Y, X). Once we know the joint distribution for the generative models, we
can derive any conditional or marginal distribution involving the same variables. Thus,
the generative model is capable of creating new data points and their labels, knowing the
joint probability distribution. The joint distribution looks for a relationship between two
variables. Once this relationship is inferred, it is possible to infer new data points.

Naïve Bayes is an example of a probabilistic classifier.

The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set of
classes (c_0 through c_k). We aim to determine the probability of the features occurring in each
class, and to return the most likely class. Therefore, for each class, we need to calculate P(c_i |
x_0, …, x_n).

We can do this using the Bayes rule defined as

---------------------------------------------------------------------------------------------------------------------

11. What is Dimensionality Reduction? Explain Dimensionality Reduction using


Feature Extraction and Feature Selection.
 Dimensionality reduction is the process of reducing the number of input variables in a dataset.
It helps in simplifying models, reducing computational cost, and addressing the curse of
dimensionality. There are two main approaches to dimensionality reduction: feature extraction
and feature selection.

 Dimensionality Reduction using Feature Extraction

Feature extraction involves transforming the data into a new space with fewer dimensions. This
new space captures the most important information from the original data.

1. Principal Component Analysis (PCA):

o Description: PCA is a statistical method that transforms the original features into
a new set of orthogonal (uncorrelated) features called principal components. The
first principal component captures the most variance in the data, and each
subsequent component captures the remaining variance.

o Application: Useful for noise reduction, visualization, and speeding up machine


learning algorithms.

o Example: Reducing the dimensionality of image data for faster processing in


computer vision tasks.

2. Linear Discriminant Analysis (LDA):

o Description: LDA is used for supervised dimensionality reduction. It finds a


linear combination of features that best separates two or more classes. Unlike
PCA, LDA considers class labels and aims to maximize the ratio of between-class
variance to within-class variance.

o Application: Commonly used in classification problems.

o Example: Reducing dimensions in a dataset with multiple classes while


preserving class separability.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE):

o Description: t-SNE is a non-linear technique for reducing dimensions,


particularly used for visualizing high-dimensional data. It converts high-
dimensional Euclidean distances into conditional probabilities that represent
similarities.

o Application: Mainly used for data visualization.


o Example: Visualizing high-dimensional data like word embeddings or complex
image datasets.

4. Autoencoders:

o Description: Autoencoders are a type of neural network used to learn efficient


codings of input data. They consist of an encoder that compresses the input into a
latent-space representation and a decoder that reconstructs the input from this
representation.

o Application: Data compression, noise reduction, and feature extraction for


unsupervised learning.

o Example: Reducing the dimensionality of image data while preserving essential


features.

 Dimensionality Reduction using Feature Selection

Feature selection involves selecting a subset of relevant features for use in model construction. It
aims to retain the most informative features while discarding irrelevant or redundant ones.

1. Filter Methods:

o Description: Filter methods rank features based on statistical measures and select
the top-ranking features. Common criteria include correlation coefficients, chi-
square tests, and mutual information.

o Application: Preprocessing step before applying machine learning algorithms.

o Example: Using correlation to remove highly correlated features in a dataset to


avoid multicollinearity.

2. Wrapper Methods:

o Description: Wrapper methods evaluate feature subsets based on the performance


of a specific machine learning algorithm. Techniques like recursive feature
elimination (RFE) iteratively remove features and build models to identify the
best subset.

o Application: Model-specific feature selection to improve performance.

o Example: Using RFE with a decision tree classifier to select the most important
features for predicting loan defaults.

3. Embedded Methods:
o Description: Embedded methods perform feature selection during the model
training process. Techniques include regularization methods like Lasso (L1
regularization) and Ridge (L2 regularization) that add penalties to the model for
having too many features.

o Application: Integrated into model training to improve generalization.

o Example: Using Lasso regression to perform feature selection and shrinkage


simultaneously in a regression model.

4. Feature Importance from Tree-based Methods:

o Description: Tree-based methods like Random Forests and Gradient Boosting can
provide feature importance scores based on how often a feature is used to split the
data.

o Application: Feature selection in ensemble learning models.

o Example: Using feature importance scores from a Random Forest model to select
the top features for a classification task.

You might also like