Unit-1
Unit-1
INTRODUCTION
ARTIFICIAL INTELLIGENCE
Artificial Intelligence is one of the booming technologies of computer science, which is ready to create a
new revolution in the world by making intelligent machines. AI holds a tendency to cause a machine to work as
a human. The word "Artificial Intelligence" first adopted by American Computer scientist John McCarthy at
the Dartmouth Conference in the year 1956.
Goals of Artificial Intelligence: Following are the main goals of Artificial Intelligence:
1. Replicate human intelligence
2. Solve Knowledge-intensive tasks
3. An intelligent connection of perception and action
4. Building a machine which can perform tasks that requires human intelligence such as:
Proving a theorem
Playing chess
Plan some surgical operation
Driving a car in traffic
1
5. Creating some system which can exhibit intelligent behavior, learn new things by itself, demonstrate,
explain, and can advise to its user.
Definition: “AI is the study of how to make computers do things at which, at the moment, people are better”.
2
Now AI has developed to a remarkable level. The concept of Deep learning, big data, and data science
are now trending like a boom. Nowadays companies like Google, Facebook, IBM, and Amazon are working
with AI and creating amazing devices. The future of Artificial Intelligence is inspiring and will come with high
intelligence.
Applicationsof AI:
1. Gaming: AI plays crucial role in strategic games such as chess, poker, tic-tac-toe, etc., where machine can
think of large number of possible positions based on heuristic knowledge.
2. Natural Language Processing: It is possible to interact with the computer that understands natural
language spoken by humans.
3. Expert Systems: There are some applications which integrate machine, software and special information
to impart reasoning and advising. They provide explanation and advice to the users.
4. Vision Systems: These systems understand, interpret and comprehend visual input on the computer. For
example,
a. A spying aeroplane takes photographs, which are used to figure out spatial information or map of the
areas.
b. Doctors use clinical expert system to diagnose the patient.
c. Police use computer software that can recognize the face of criminal with the stored portrait made by
forensic artist.
5. Speech Recognition: Some intelligent systems are capable of hearing and comprehending the language in
terms of sentences and their meanings while a human talks to it. It can handle different accents, slang
words, noise in the background, change in human’s noise due to cold, etc.
6. Handwriting Recognition: The handwriting recognition software reads the text written on paper by a pen
or on screen by a stylus. It can recognize the shapes of the letters and converts it into editable text.
7. Intelligent Robots: Robots are able to perform the tasks given by humans. They have special sensors to
detect physical data from the real world such as light, heat, temperature, movement, sound, bump, and
pressure. They have efficient processors, multiple sensors and huge memory, to exhibit intelligence. In
addition, they are capable of learning from their mistakes and they can adapt to the new environment.
MACHINE LEARNING
Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making predictions using
3
historical data or information. Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many more.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on their own.
The term machine learning was first introduced by Arthur Samuel in 1959. We can define it as
“Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.”
A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the huge amount of data helps to
build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output.
Machine learning has changed our way of thinking about the problem.
Definition: “A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.
4
Features of Machine Learning:
Machine learning uses data to detect various patterns in a given dataset
It can learn from past data and improve automatically
It is a data-driven technology
Machine learning is much similar to data mining as it also deals with the huge amount of the data
Supervised Learning: Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each data, once
the training and processing are done then we test the model by providing a sample data to check whether it is
predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.Supervised learning can be grouped further in two categories of
algorithms:
Regression: Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc
Linear Regression
Regression Trees
5
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression
Classification: Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Random Forest
Decision Trees
Logistic Regression
Support vector Machines
Unsupervised Learning: Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.
6
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs together
in the dataset. Association rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis.
Apriori algorithm
FP-growth algorithm
Semi-Supervised Learning:To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced.The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labeled data like in supervised learning. Initially,
similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the
unlabeled data into labeled data.
Machine learning Life cycle:Machine learning life cycle involves seven major steps
8
1. Gathering Data:
The goal of this step is to identify and obtain all data-related problems
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices
The more will be the data, the more accurate will be the prediction
2. Data Preparation:
Data preparation is a step where we put our data into a suitable place and prepare it to use in our
machine learning training
First, we put all data together, and then randomize the ordering of data
This step can be further divided into two processes:
Data exploration:It is used to understand the nature of data that we have to work with.
We need to understand the characteristics, format, and quality of data.A better
understanding of data leads to an effective outcome
Data pre-processing:Now the next step is preprocessing of data for its analysis
3. Data Wrangling:
Data wrangling is the process of cleaning and converting raw data into a useable format
Cleaning of data is required to address the quality issues
In real-world applications, collected data may have various issues like Missing Values, Duplicate
data, Invalid data, Noise
4. Analyse Data:
This step involves: Selection of analytical techniques, Building models, Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome
5. Train the model:
In this step we train our model to improve its performance for better outcome of the problem
We use datasets to train the model using various machine learning algorithms
Training a model is required so that it can understand the various patterns, rules, and, features
6. Test the model:
In this step, we check for the accuracy of our model by providing a test dataset to it
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem
7. Deployment:
9
Here, we deploy the model in the real-world system
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system.
But before deploying the project, we will check whether it is improving its performance using
available data or not
10
Training Set: A subset of dataset to train the machine learning model, and we already know the
output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
7. Feature scaling: It is a technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.
DEEP LEARNING
Deep learning is based on the branch of machine learning, which is a subset of artificial intelligence.
Since neural networks imitate the human brain and so deep learning will do. In deep learning, nothing is
programmed explicitly. Basically, it is a machine learning class that makes use of numerous nonlinear
processing units so as to perform feature extraction as well as transformation. The output from each preceding
layer is taken as input by each one of the successive layers.
Deep learning algorithms are used, especially when we have a huge no of inputs and outputs.Deep
learning is implemented with the help of Neural Networks, and the idea behind the motivation of Neural
Network is the biological neurons, which is nothing but a brain cell.
Definition: “Deep learning is a collection of statistical techniques of machine learning for learning feature
hierarchies that are actually based on artificial neural networks”.
Here, we provide the raw data of images to the first layer of the input layer.
After then, input layer will determine the patterns of local contrast that means it will differentiate on the
basis of colors, luminosity, etc.
11
Then the 1st hidden layer will determine the face feature, i.e., it will fixate on eyes, nose, and lips, etc.
And then, it will fixate those face features on the correct face template.
So, in the 2nd hidden layer, it will actually determine the correct face here as it can be seen in the above
image, after which it will be sent to the output layer.
Likewise, more hidden layers can be added to solve more complex problems, for example, if you want
to find out a particular kind of face having large or light complexions.
So, as and when the hidden layers increase, we are able to solve complex problems.
12
The applications of CNN are Identify Faces, Street Signs, Tumors, Image Recognition, Video Analysis,
NLP, Anomaly Detection, Drug Discovery, Checkers Game, Time Series Forecasting
Autoencoders:
An autoencoder neural network is another kind of unsupervised machine learning algorithm
Here the number of hidden cells is merely small than that of the input cells
But the number of input cells is equivalent to the number of output cells
The autoencoders are mainly used for the smaller representation of the input
It helps in the reconstruction of the original data from compressed data
13
Disadvantages of Deep Learning:
It requires an ample amount of data
It is quite expensive to train
It does not have strong theoretical groundwork
14
AI is the broader family
consisting of ML and DL as it’s ML is the subset of AI. DL is the subset of ML.
components.
DL is a ML algorithm that uses
AI is a computer algorithm
ML is an AI algorithm which deep (more than one layer)
which exhibits intelligence
allows system to learn from data. neural networks to analyze data
through decision making.
and provide output accordingly.
If you have a clear idea about the If you are clear about the math
logic (math) involved in behind involved in it but don’t have idea
and you can visualize the about the features, so you break
Search Trees and much complex
complex functionalities like K- the complex functionalities into
math is involved in AI.
Mean, Support Vector Machines, linear/lower dimension features
etc., then it defines the ML by adding more layers, then it
aspect. defines the DL aspect.
It attains the highest rank in
The aim is to basically increase The aim is to increase accuracy
terms of accuracy when it is
chances of success and not not caring much about the
trained with large amount of
accuracy. success ratio.
data.
DL can be considered as neural
networks with a large number of
Three broad categories/types Of
parameters layers lying in one of
AI are: Artificial Narrow Three broad categories/types Of
the four fundamental network
Intelligence (ANI), Artificial ML are: Supervised Learning,
architectures: Unsupervised Pre-
General Intelligence (AGI) and Unsupervised Learning and
trained Networks, Convolutional
Artificial Super Intelligence Reinforcement Learning
Neural Networks, Recurrent
(ASI)
Neural Networks and Recursive
Neural Networks
The efficiency Of AI is basically Less efficient than DL as it can’t More powerful than ML as it can
the efficiency provided by ML work for longer dimensions or easily work for larger sets of
and DL respectively. higher amount of data. data.
Examples of AI applications Examples of ML applications Examples of DL applications
include: Google’s AI-Powered include: Virtual Personal include: Sentiment based news
15
Predictions, Ridesharing Apps Assistants: Siri, Alexa, Google, aggregation, Image analysis and
Like Uber and Lyft, Commercial etc., Email Spam and Malware caption generation, etc.
Flights Use an AI Autopilot, etc. Filtering.
16
Increase model complexity
Remove noise from the data
Trained on increased and better features
Reduce the constraints
Increase the number of epochs to get better results
5. Monitoring and maintenance: Different results for different actions require data change; hence editing of
codes as well as resources for monitoring them also become necessary.
6. Getting bad recommendations: A machine learning model operates under a specific context which
results in bad recommendations and concept drift in the model. Let's understand with an example where
at a specific time customer is looking for some gadgets, but now customer requirement changed over
time but still machine learning model showing same recommendations to the customer while customer
expectation has been changed. This incident is called a Data Drift.
7. Lack of skilled resources: Although Machine Learning is continuously growing in the market, the
absence of skilled resources in the form of manpower is also an issue. Hence, we need manpower
having in-depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.
8. Customer Segmentation:Customer segmentation is also an important challenge to identify the customers
who paid for the recommendations shown by the model and who don't even check them. Hence, an
algorithm is necessary to recognize the customer behavior and trigger a relevant recommendation for the
user based on past experience.
9. Process Complexity of Machine Learning: The machine learning process is very complex. It also
includes analyzing the data, removing data bias, training data, applying complex mathematical
calculations, etc., making the procedure more complicated and quite tedious.
10. Data Bias: These errors exist when certain elements of the dataset are heavily weighted or need more
importance than others. Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors. The methods to remove Data Bias are:
Research more for customer segmentation
Be aware of your general use cases and potential outliers
Combine inputs from multiple sources to ensure data diversity
Include bias testing in the development process
Analyze data regularly and keep tracking errors to resolve them easily
Review the collected and annotated data
17
11. Lack of Explain ability: This basically means the outputs cannot be easily comprehended as it is
programmed in specific ways to deliver for certain conditions. Hence, a lack of explain ability is also
found in machine learning algorithms which reduce the credibility of the algorithms.
12. Slow implementations and results: Machine learning models are highly efficient in producing accurate
results but are time-consuming. Slow programming, excessive requirements' and overloaded data take
more time to provide accurate results than expected.
13. Irrelevant features: Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should use relevant
features in our training sample.
18
1. STATISTICAL LEARNING
Training Dataset:
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model.
Firstly, the training data is fed to the ML algorithms, which lets them learn how to make predictions for
the given task.
The training data varies depending on whether we are using Supervised Learning or Unsupervised
Learning Algorithms.
The type of training data that we provide to the model is highly responsible for the model's accuracy and
prediction ability.
Training data is approximately more than or equal to 60% of the total data for an ML project.
Test Dataset:
Once we train the model with the training dataset, it's time to test the model with the test dataset.
The test dataset is another subset of original data, which is independent of the training dataset.
Usually, the test dataset is approximately 20-25% of the total original data for an ML project.
Need of Splitting dataset into Train and Test set: Splitting the dataset into train and test sets is one of the
important parts of data pre-processing, as by doing so, we can improve the performance of our model and hence
give better predictability.
If we train our model with a training set and then test it with a completely different test dataset, and then
our model will not be able to understand the correlations between the features.Therefore, if we train and test the
model with two different datasets, then it will decrease the performance of the model. Hence it is important to
split a dataset into two parts, i.e., train and test set.
19
In this way, we can easily evaluate the performance of our model. Such as, if it performs well with the
training data, but does not perform well with the test dataset, then it is estimated that the model may be
overfitted.
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's
prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is
greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across
all examples.
We can notice that the arrows in the left plot are much longer than their counterparts in the right plot.
Clearly, the line in the right plot is a much better predictive model than the line in the left plot.
Mean square error (MSE): MSE is a popular loss function. It is the average squared loss per example over the
whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the
number of examples:
Where,
(x , y) is an example in which x is the set of features that the model uses to make predictions& y is the
example's label
prediction(x) is a function of the weights and bias in combination with the set of features x
D is a data set containing many labeled examples, which are (x , y) pairs
N is the number of examples in D
20
TRADEOFFS IN STATISTICAL LEARNING
Bias:
The bias is known as the difference between the prediction of the values by the ML model and the
correct value. Being high in biasing gives a large error in training as well as testing data. Its recommended that
an algorithm should always be low biased to avoid the problem of underfitting.
By high bias, the data predicted is in a straight line format, thus not fitting accurately in the data in the
data set. Such fitting is known as Underfitting of Data. This happens when the hypothesis is too simple or linear
in nature. Refer to the graph given below for an example of such a situation
Variance:
The variability of model prediction for a given data point which tells us spread of our data is called the
variance of the model. The model with high variance has a very complex fit to the training data and thus is not
able to fit accurately on the data which it hasn’t seen before. As a result, such models perform very well on
training data but has high error rates on test data.
When a model is high on variance, it is then said to as Overfitting of Data. Overfitting is fitting the
training set accurately via complex curve and high order hypothesis but is not the solution as the error with
unseen data is high.While training a data model variance should be kept low.The high variance data looks like
follows
21
In such a problem, a hypothesis looks like follows
The best fit will be given by hypothesis on the tradeoff point.The error to complexity graph to show
trade-off is given as
This is referred to as the best point chosen for the training of the algorithm which gives low error in
training as well as testing data.
22
ESTIMATING RISK STATISTICS
Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used, and these
metrics are known as performance metrics or evaluation metrics. These performance metrics help us understand
how well our model has performed for the given data.
In machine learning, each task or problem is divided into classification and Regression. Different
evaluation metrics are used for both Regression and Classification tasks.
Confusion Matrix: The confusion matrix is a matrix used to determine the performance of the
classification models for a given set of test data. It can only be determined if the true values for test data
are known.
N=total Predictions Actual : Positive Actual : Negative
Predicted : Positive True Positive False Positive
Predicted : Negative False Negative True Negative
True Negative: Model has given prediction No, and the real or actual value was also No.
True Positive: The model has predicted Yes, and the actual value was also Yes.
False Negative: The model has predicted No, but the actual value was Yes, it is also called as
Type-II error.
False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I
error.
Example of Confusion Matrix is as follows:
In this example, the total number of predictions are 165, out of which 110 time predicted yes,
whereas 55 times predicted No. However, in reality, 60 cases in which patients don't have the disease,
whereas 105 cases in which patients have the disease.
23
Classification Accuracy: It is one of the important parameters to determine the accuracy of the
classification problems. It defines how often the model predicts the correct output. It can be calculated
as the ratio of the number of correct predictions made by the classifier to all number of predictions made
by the classifiers.
Precision: It can be defined as the number of correct outputs provided by the model or out of all positive
classes that have predicted correctly by the model, how many of them were actually true.
Recall (or) Sensitivity: It is defined as the out of total positive classes, how our model predicted
correctly. The recall must be as high as possible.
F-Score: If two models have low precision and high recall or vice versa, it is difficult to compare these
models. So, for this purpose, we can use F-score. This score helps us to evaluate the recall and precision
at the same time. The F-score is maximum if the recall is equal to the precision.
AUC(Area Under the Curve)-ROC: Sometimes we need to visualize the performance of the
classification model on charts; then, we can use the AUC-ROC curve. Firstly, ROC means Receiver
Operating Characteristic curve. ROC represents a graph to show the performance of a classification
model at different threshold levels. The curve is plotted between two parameters, which are:
True Positive Rate: TPR or true Positive rate is a synonym for Recall, hence can be calculated
as:
False Positive Rate: FPR or False Positive Rate can be calculated as:
24
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-
dimensional area under the entire ROC curve, as shown below image:
AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC
of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.
Here, Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
Mean Squared Error: Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted values and the actual
value given by the model.
R2 Score: R squared error is also known as Coefficient of Determination, which is another popular
metric used for Regression model evaluation. The R-squared metric enables us to compare model with
a constant baseline to determine the performance of the model. To select the constant baseline, we
need to take the mean of the data and draw the line at the mean. The R squared score will always be
less than or equal to 1 without concerning if the values are too large or small.
25
Adjusted R2: Adjusted R squared, as the name suggests, is the improved version of R squared error. R
square has a limitation of improvement of a score on increasing the terms, even though the model is
not improving, and it may mislead the data scientists.
To overcome the issue of R square, adjusted R squared is used, which will always show a lower
value than R². It is because it adjusts the values of increasing predictors and only shows improvement
if there is a real improvement.
Here, n is the number of observations, k denotes the number of independent variables and Ra2
denotes the adjusted R2
Where is the standard deviation of the sampling mean, is the population standard deviation and
n is the sample size.
As the size of the sample increases, the spread of the sampling distribution of the mean decreases. But
the mean of the distribution remains the same and it is not affected by the sample size. he sampling distribution
of the standard deviation is the standard error of the standard deviation. It is defined as:
26
EMPIRICAL RISK MINIMIZATION
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of
learning algorithms and is used to give theoretical bounds on their performance.
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of
learning algorithms and is used to give theoretical bounds on their performance. The idea is that we don’t know
exactly how well an algorithm will work in practice because we don't know the true distribution of data that the
algorithm will work on but as an alternative we can measure its performance on a known set of training data.
We assumed that our samples come from this distribution and use our dataset as an approximation. If we
compute the loss using the data points in our dataset, it’s called empirical risk. It is “empirical” and not “true”
because we are using a dataset that’s a subset of the whole population.
When our learning model is built, we have to pick a function that minimizes the empirical risk that is the
delta between predicted output and actual output for data points in the dataset. This process of finding this
function is called empirical risk minimization (ERM). We want to minimize the true risk. We don’t have
information that allows us to achieve that, so we hope that this empirical risk will almost be the same as the true
empirical risk.
However, we can compute an approximation, called empirical risk, by averaging the loss function on the
training set; more formally, computing the expectation with respect to the empirical measure:
For example if we to build a model that can differentiate between a male and a female based on specific
features. If we select 150 random people where women are really short, and men are really tall, then the model
might incorrectly assume that height is the differentiating feature. For building a truly accurate model, we have
to gather all the women and men in the world to extract differentiating features. Unfortunately, that is not
possible! So we select a small number of people and hope that this sample is representative of the whole
population.
27