Seminar Presentation
Seminar Presentation
PREDICTION
USING
MACHINE LEARNING
1. Machine Learning
● Machine learning is a branch of artificial intelligence (AI) and
computer science which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually
improving its accuracy.
● Machine learning is an important component of the growing
field of data science.
● Through the use of statistical methods, algorithms are trained
to make classifications or predictions, uncovering key insights
within data mining projects.
3
2. Ensemble Learning
● Ensemble learning helps improve machine learning results by
combining several models.
● This approach allows the production of better predictive
performance compared to a single model.
● Basic idea is to learn a set of classifiers (experts) and to allow
them to vote.
5
3. Basic Algorithms Used
3.1 Linear Regression
● Linear regression is one of the easiest and most
popular Machine Learning algorithms.
● It is a statistical method that is used for predictive
analysis.
● Linear regression makes predictions for
continuous/real or numeric variables such as sales,
salary, age, product price, etc.
7
● Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (y) variables, hence called
as linear regression.
● Since linear regression shows the linear relationship, which means it
finds how the value of the dependent variable is changing according to
the value of the independent variable.
● The linear regression model provides a sloped straight line representing
the relationship between the variables.
8
3.2 Random Forest
● Random Forest is a popular machine learning algorithm that
belongs to the supervised learning technique.
● It can be used for both Classification and Regression problems
in ML.
● It is based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
9
● "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of
that dataset.“
● Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final
output.
10
3.3 Gradient Boost
● Gradient boosting is a machine learning technique used
in regression and classification tasks, among others.
● When a decision tree is the weak learner, the resulting algorithm is called
gradient-boosted trees; it usually outperforms random forest.
● It is similar to grid search, and yet it has proven to yield better results
comparatively.
15
4.3 Matplotlib
● Matplotlib is easy to use and an amazing visualizing library in
Python.
● It is built on NumPy arrays and designed to work with the
broader SciPy stack and consists of several plots like line, bar,
scatter, histogram, etc.
● Matplotlib is a low level graph plotting library in python that
serves as a visualization utility.
● Matplotlib is open source and we can use it freely.
16
4.4 Seaborn
● Seaborn is a data visualization library built on top of
matplotlib and closely integrated with pandas data
structures in Python.
● Visualization is the central part of Seaborn which
helps in exploration and understanding of data.
17
4.6 Sklearn
● Scikit-learn (Sklearn) is the most useful and robust library
for machine learning in Python.
● It provides a selection of efficient tools for machine
learning and statistical modeling including classification,
regression, clustering and dimensionality reduction via a
consistence interface in Python.
● This library, which is largely written in Python, is built upon
NumPy, SciPy and Matplotlib.
18
4.6.1 Mean Squared Error
● The Mean Squared Error (MSE) or Mean Squared Deviation
(MSD) of an estimator measures the average of error squares i.e. the
average squared difference between the estimated values and true
value.
● It is a risk function, corresponding to the expected value of the
squared error loss.
● It is always non – negative and values close to zero are better.
● The MSE is the second moment of the error (about the origin) and
thus incorporates both the variance of the estimator and its bias. 19
4.6.2 R2 Score
● Coefficient of determination also called as R2 score is used to
evaluate the performance of a linear regression model.
● It is the amount of the variation in the output dependent
attribute which is predictable from the input independent
variable(s).
● It is used to check how well-observed results are reproduced
by the model, depending on the ratio of total deviation of
results described by the model.
20
5. Results
Plots
22
23
24
R2 value for various models
25