machineLearning-unit1
machineLearning-unit1
Machine Learning (ML) is a branch of arti cial intelligence (AI) that focuses on the development of
algorithms and models that allow computers to learn from and make decisions or predictions based
on data. Instead of being explicitly programmed for every task, machine learning models are trained
on datasets to identify patterns and improve performance over time. This ability to learn and adapt
makes machine learning a powerful tool in various applications, from medical diagnosis to
autonomous driving.
2. Model: A trained machine learning algorithm that can make predictions or decisions based
on new data.
3. Training: The process of feeding data to an algorithm and allowing it to learn patterns and
relationships.
4. Training Data: The dataset used to train the machine learning model. It consists of input
data and the corresponding expected output.
6. Label: The output or target variable in supervised learning, representing the expected result
of a prediction.
7. Over tting: A scenario where a model learns the training data too well, including the noise,
leading to poor performance on unseen data.
8. Under tting: A scenario where a model fails to capture the underlying patterns in the
training data, resulting in poor performance on both training and unseen data.
9. Validation Set: A subset of the dataset used to tune the model's hyperparameters and
prevent over tting.
10. Test Set: A separate set of data used to evaluate the performance of a trained model.
Machine learning has transformed many industries, but several perspectives and issues need to be
considered:
• Data Privacy: Machine learning models require vast amounts of data, which raises concerns
about privacy, especially in sensitive areas such as healthcare and nance.
• Bias and Fairness: If the training data contains biases, the model can perpetuate these
biases in its predictions, leading to unfair outcomes.
fi
fi
fi
fi
fi
• Interpretability: Some machine learning models, particularly deep learning models, act as
"black boxes," making it dif cult to understand how decisions are made.
• Computational Costs: Training machine learning models, especially with large datasets and
complex algorithms, requires substantial computational resources.
• Generalization: Ensuring that a model performs well on unseen data, rather than just
memorizing the training data, is critical for practical applications.
◦ Predicting house prices based on features like size, location, and number of
bedrooms.
◦ Classifying emails as spam or not spam.
2. Unsupervised Learning
In unsupervised learning, the model is given data without any labeled outputs. The goal is to
nd hidden patterns or structures in the data. Common techniques include clustering and
dimensionality reduction.
Examples:
◦ Image classi cation where only a few images are labeled, but a large set of unlabeled
images is available.
◦ Speech recognition systems trained with a small amount of transcribed audio and a
large amount of untranscribed audio.
fi
fi
fi
fi
fl
Review of Probability in Machine Learning
Probability plays a key role in machine learning, especially in models that deal with uncertainty,
such as classi cation problems and probabilistic models. A basic understanding of probability is
crucial for interpreting machine learning algorithms.
2. Probability Distribution: Describes how probabilities are distributed over the possible
values of a random variable. Common distributions include:
◦ Discrete Distributions: Examples include the binomial distribution and the Poisson
distribution, where probabilities are assigned to distinct outcomes.
◦ Continuous Distributions: Examples include the normal (Gaussian) distribution
and the uniform distribution, where probabilities are assigned over continuous
ranges.
3. Conditional Probability: The probability of an event occurring given that another event has
already occurred. Conditional probability is fundamental in machine learning models like
Naive Bayes.
4. Bayes’ Theorem: A key principle in probabilistic models, used to update the probability
estimate for an event as more evidence becomes available.
• Expectation (Mean): The average value of a random variable, used to predict the expected
outcome.
• Variance: Measures how much the random variable varies from its expected value. It is key
in understanding the spread or uncertainty in predictions.
fi
Basic Linear Algebra in Machine Learning Techniques
Linear algebra forms the foundation of many machine learning techniques, especially those dealing
with high-dimensional data.
2. Matrices: A matrix is a rectangular array of numbers. Matrices are used in machine learning
to represent datasets, where rows typically represent data points and columns represent
features.
3. Matrix Operations:
4. Dot Product: A scalar product of two vectors, used in calculating distances and similarities in
machine learning.
In machine learning, a dataset is a collection of data points that are used to train and evaluate
models. A dataset generally consists of features (input) and labels (output).
1. Training Dataset: The data used to train the machine learning model. It consists of features
and corresponding labels (in supervised learning).
2. Validation Dataset: A subset of data used to tune hyperparameters and prevent over tting
during training.
3. Test Dataset: This dataset is used to evaluate the model after it has been trained. It should
consist of data that the model has never seen before to give an unbiased estimate of
performance.
4. Types of Datasets:
◦ Structured Data: Data that is organized in rows and columns, like in spreadsheets
or databases (e.g., CSV les). This includes tabular data with de ned feature
columns.
◦ Unstructured Data: Data that does not have a prede ned structure, such as text,
images, videos, and audio.
◦ Semi-Structured Data: Data that does not conform to the structure of traditional
databases but has some organization, like JSON or XML les.
Data Preprocessing
Data preprocessing is an essential step in the machine learning pipeline, as raw data is often noisy,
incomplete, or inconsistent. Preprocessing transforms this raw data into a format suitable for
machine learning models.
1. Data Cleaning:
◦ Handling Missing Data: Methods include removing data points with missing values
or imputing missing values using techniques like the mean, median, or more
sophisticated approaches like k-nearest neighbors (KNN) imputation.
◦ Handling Outliers: Outliers can skew model performance. They can be removed or
transformed using techniques like winsorization or scaling.
2. Data Transformation:
◦ Normalization: This technique scales the features so that they are within a speci c
range, usually [0, 1]. Normalization is particularly important for algorithms like k-
NN and neural networks.
◦ Standardization: This technique centers the features around the mean with unit
variance (zero mean and unit variance), which is crucial for models that rely on
distance measures, such as SVM and logistic regression.
fi
fi
fi
fi
fi
fi
• Encoding Categorical Data: For algorithms that require numerical input, categorical
variables need to be converted into numerical format:
• The process of selecting the most important features to use in the model. Techniques include
correlation analysis, recursive feature elimination (RFE), and feature importance from tree-
based methods like random forests.
Dimensionality Reduction:
• Principal Component Analysis (PCA): Reduces the number of features while retaining the
variance in the data, improving computational ef ciency and performance.
• t-SNE: Useful for visualizing high-dimensional data by reducing it to two or three
dimensions.
Data Splitting:
• Dividing the dataset into training, validation, and test sets. A typical split is 70% for training,
15% for validation, and 15% for testing.
Bias and variance are two key sources of error in machine learning models, and understanding their
trade-off is essential to creating models that generalize well to unseen data.
1. Bias:
◦ Bias refers to the error due to overly simplistic assumptions in the learning
algorithm. When a model has high bias, it cannot capture the underlying patterns in
the data well, leading to under tting.
◦ Models with high bias are too rigid and tend to ignore the complexities of the data.
◦ Example: A linear regression model applied to non-linear data will likely have high
bias because it assumes a linear relationship that may not exist in the data.
◦ Linear regression.
◦ Logistic regression (in the case of non-linear separable data).
4. Variance:
◦ Variance refers to the error due to excessive sensitivity to small uctuations in the
training data. A model with high variance learns the noise in the training data,
leading to over tting.
fi
fi
fi
fi
fl
◦ Models with high variance are too exible and adapt to every detail in the training
data, including noise, making them perform poorly on new, unseen data.
◦ Decision trees.
◦ k-nearest neighbors (k-NN) with small values of k
7. Bias-Variance Trade-off:
• Irreducible error is the noise inherent in the data that cannot be reduced by any model.
Function approximation in machine learning involves creating a model that approximates the
underlying relationship between input features (independent variables) and the target variable
(dependent variable). The goal is to nd a function The goal is to nd a function f(x) that maps the
input data x to the output labels y.
fi
fi
fl
fi
fi
fi
Generalization:
• Regularization techniques like L1 (Lasso) and L2 (Ridge) regularization are used to prevent
over tting by adding a penalty term to the model’s objective function that discourages
overly complex models.
◦ L1 Regularization adds the absolute value of the coef cients to the loss function.
◦ L2 Regularization adds the square of the coef cients to the loss function.
These techniques help control the complexity of the function approximation, leading to better
generalization.
Over tting
Over tting is one of the most common issues in machine learning and occurs when a model learns
not only the underlying pattern of the training data but also the noise and outliers. This results in
poor performance on new, unseen data.
◦Complex Models: When the model has too many parameters or is too exible, it can
t the training data too closely.
◦ Insuf cient Training Data: With a limited amount of data, the model may
"memorize" the speci c details of the training set rather than generalize to new data.
◦ Training for Too Long: In iterative algorithms like neural networks, training the
model for too many epochs can lead to over tting.
2. Indicators of Over tting:
◦
The model performs very well on the training data but signi cantly worse on the test
data.
◦ A large difference between training accuracy and validation/test accuracy.
3. Methods to Prevent Over tting:
◦ Pruning (for Decision Trees): In decision trees, pruning is used to remove branches
that have little importance and are likely tting noise in the data.
◦ Early Stopping: In algorithms that learn iteratively (e.g., neural networks), early
stopping involves monitoring the model’s performance on the validation set and
stopping training when performance no longer improves.
◦ The model captures random noise in the training data, making it highly speci c to
that dataset.
◦ Predictions on new data are unreliable, leading to poor generalization.
Summary
• Bias refers to a model's tendency to make systematic errors due to overly simplistic
assumptions, while variancerefers to a model's sensitivity to uctuations in the training
data.
• Function approximation involves nding a mathematical function that maps input features
to output labels. It is essential for supervised learning and can range from simple linear
functions to more complex non-linear functions.
• Over tting occurs when a model becomes too complex and learns noise in the data, leading
to poor performance on new data. Various techniques like regularization, early stopping, and
cross-validation help mitigate over tting.
fi
fi
fi
fi
fi
fl
fi
fl
fi
fi
fi