0% found this document useful (0 votes)
3 views

ML-Unit-1

The document provides an introduction to Machine Learning, covering its definitions, importance, types, applications, and tools. It explains how Machine Learning allows systems to learn from data and improve performance over time, with examples from various industries such as banking, insurance, and healthcare. Additionally, it highlights the significance of Machine Learning in decision-making and data analysis, along with popular programming languages and libraries used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML-Unit-1

The document provides an introduction to Machine Learning, covering its definitions, importance, types, applications, and tools. It explains how Machine Learning allows systems to learn from data and improve performance over time, with examples from various industries such as banking, insurance, and healthcare. Additionally, it highlights the significance of Machine Learning in decision-making and data analysis, along with popular programming languages and libraries used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MACHINE LEARNING

UNIT-I
By
B.RUPA
Assistant Professor, Dept of CSE(DS)
Vardhaman College of Engineering
Unit-I: Contents
Introduction to Machine Learning:
▪ Types of Machine Learning
▪ Problems not to be solved using Machine Learning
▪ Applications of Machine Learning
▪ Tools in Machine Learning
▪ Issues in Machine Learning
▪ Machine learning Activities
▪ Basic Types of Data in Machine Learning
▪ Exploring Structure of data
▪ Data Quality & Remediation
▪ Data Pre-Processing
2

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


Introduction to
Machine Learning
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 5
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 6
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 7
Definitions of Machine Learning

➢ Learning is any process by which system improves performance from experience.

➢ Machine learning (ML) is a subset / branch of Artificial Intelligence.

1. Machine learning is a "Field of study that gives computers the ability to learn without being
explicitly programmed“ defined by Arthur Samuel in 1959.

In machine learning, algorithms are trained to find patterns and correlations in large data
sets and to make the best decisions and predictions based on that analysis.

(OR)

2. Machine learning is a Computer Program is said to learn from Experience E with respect to
small Class of tasks T and Performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E – by Tom Mitchell in 1998

Machine Learning is the study of algorithms that improves its Performance P, at some Task T with
Experience E.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 8
Continued..
 Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in
performing task T increases, which results in higher performance measure (P).

 For instance, we give a “shape sorting block” toy to a child. (We know that the toy has different
shapes and shape holes).

 In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child
observes the shape and tries to fit it in a shaped hole.

 Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at
finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out
of 3 correct shape holes.

 Second, the child tries it another time and notices that she is a little experienced in this task.
Considering the experience gained (E), the child tries this task another time, and when measuring
the performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now
figured out which shape goes into which shape hole. 9

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


Continued..
 So her experience (E) increased, her performance(P) also increased, and then we noticed
that as the number of attempts at this toy increased. The performance also increases, which
results in higher accuracy.

 Such execution is similar to machine learning. What a machine does is, it takes a task (T),
executes it, and measures its performance (P). Now a machine has a large number of data,
so as it processes that data, its experience (E) increases over time, resulting in a higher
performance measure (P). So after going through all the data, our machine learning model’s
accuracy increases, which means that the predictions made by our model will be very
accurate.

 3. Machine Learning is the ability of systems to learn from data, identify patterns, and
enact lessons from that data without human interaction or with minimal human interaction.

 Machine learning makes day-to-day and repetitiveUNIT-I


work much easier!
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering 10
Machine Learning
 Need For Machine Learning Data Information Knowledge

 Machine learning is a tool for turning information into knowledge.


 Ever since the technical revolution, we’ve been generating an immeasurable amount of
data.
 Google gets over 3.5 billion searches daily.
 WhatsApp users exchange up to 65 billion messages daily.
 As per research, we generate around 2.5 quintillion bytes of data every single day! It is
estimated that by 2020, 1.7MB of data will be created every second for every person.
 Facebook generates 4 petabytes of data per day
 With the availability of so much data, it is finally possible to build predictive models that
can study and analyze complex data to find useful insights and deliver more accurate
results.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 11
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 13
List of reasons why Machine Learning is so important:

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 14


List of reasons why Machine Learning is so important:

• Increase in Data Generation: Due to excessive production of data, we need a method


that can be used to structure, analyze and draw useful insights from data. This is where
Machine Learning comes in. It uses data to solve problems and find solutions to the most
complex tasks faced by organizations.
• Improve Decision Making: By making use of various algorithms, Machine Learning can be
used to make better business decisions. For example, Machine Learning is used to
forecast sales, predict downfalls in the stock market, identify risks and anomalies, etc.
• Uncover patterns & trends in data: Finding hidden patterns and extracting key insights
from data is the most essential part of Machine Learning. By building predictive models
and using statistical techniques, Machine Learning allows you to dig beneath the surface
and explore the data at a minute scale. Understanding data and extracting patterns
manually will take days, whereas Machine Learning algorithms can perform such
computations in less than a second.
• Solve complex problems: From detecting the genes linked to the deadly ALS disease to
building self-driving cars, Machine Learning can be used to solve the most complex
problems.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 15


Features of Machine Learning

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 16


B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 17
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 18
Machine Learning Vs Traditional Programming
 Traditional Programming:
Data and program is run on the computer to produce the output.

Traditional
Programming

Data
Output
Program Computer

 Machine Learning:
Data and output is run on the computer to create a program.
 This program can be used in traditional programming.
Machine
Learning
Data
Computer Program
Output
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 19
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 20
Application of ML
Wherever there is a substantial amount of past data, machine learning can be
used to generate actionable insight from the data.
Though machine learning is adopted in multiple forms in every business domain,
below three major domains just to give some idea about what type of actions can
be done using machine learning.

 Banking and finance


 Insurance
 Healthcare

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 21


Application of ML
 Banking and finance: In the banking industry, fraudulent transactions, especially the
ones related to credit cards, are extremely prevalent. The models work on a real-time
basis, i.e. the fraudulent transactions are spotted and prevented right at the time of
occurrence.
Customers of a bank are often offered lucrative proposals like higher bank interest,
lower processing charge of loans, zero balance savings accounts, no overdraft penalty,
etc. are offered to customers. with the intent that the customer switches over to the
competitor bank.
 Insurance: Insurance industry is extremely data intensive. For that reason, machine
learning is extensively used in the insurance industry. Two major areas in the insurance
industry where machine learning is used are risk prediction during new customer
onboarding and claims management.
 Healthcare: Wearable device data form a rich source for applying machine learning and
predict the health conditions of the person real time. In case there is some health
issue which is predicted by the learning model, immediately the person is alerted to
take preventive action. Suppose an elderly person goes for a morning walk in a park
close to his house.
Suddenly, while walking, his blood pressure shoots up beyond a certain limit, which
is tracked by the wearable. The wearable data is sent to a remote server and a machine
learning algorithm is constantly analyzing the streaming data.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 22
How important Machine Learning and Machine Learning Applications:

• Netflix’s Recommendation Engine: The core of Netflix is its infamous


recommendation engine. Over 75% of what you watch is recommended by
Netflix and these recommendations are made by implementing Machine
Learning.
• Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face
verification system is Machine Learning and Neural Networks. DeepMind studies
the facial features in an image to tag your friends and family.
• Amazon’s Alexa: The infamous Alexa, which is based on Natural Language
Processing and Machine Learning is an advanced level Virtual Assistant that does
more than just play songs on your playlist. It can book you an Uber, connect
with the other IoT devices at home, track your health, etc.
• Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam
messages. It uses Machine Learning algorithms and Natural Language Processing
to analyze emails in real-time and classify them as either spam or non-spam.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 23


B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 24
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 25
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 26
ML Tools

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 27


ML Tools

The algorithms related to different machine learning tasks are known to all and can be
implemented using any language/platform.

Python: Python is one of the most popular, open source programming language widely
adopted by machine learning community.
• Python has very strong libraries for advanced mathematical functionalities (NumPy),
algorithms and mathematical tools (SciPy) and numerical plotting (matplotlib). Built on
these libraries, there is a machine learning library named scikit learn, which has various
classification, regression, and clustering algorithms embedded in it.

R: R is a language for statistical computing and data analysis. It is an open source language,
extremely popular in the academic community – especially among statisticians and data
miners.
• R is a very simple programming language with a huge set of libraries available for
different stages of machine learning.
• Some of the libraries standing out in terms of popularity are plyr/dplyr (for data
transformation), caret (‘Classification and Regression Training’ for classification), RJava
(to facilitate integration with Java), tm (for text mining), ggplot2 (for data
visualization).
• Other than the libraries, certain packages like Shiny and R Markdown have been
28
developed around R to develop interactive web applications, documents and dashboards
on R without much effort.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
ML Tools

Matlab: MATLAB (matrix laboratory) is a licenced commercial software with a robust


support for a wide range of numerical computing. MATLAB has a huge user base across
industry and academia.
• MATLAB also provides extensive support of statistical functions and has a huge number of
machine learning algorithms in-built.
• It also has the ability to scale up for large datasets by parallel processing on clusters and
cloud

SAS: SAS (earlier known as ‘Statistical Analysis System’) is another licenced commercial
software which provides strong support for machine learning functionalities.
• SAS is a software suite comprising different components.
• The basic data management functionalities are embedded in the Base SAS component
whereas
• the other components like SAS/INSIGHT, Enterprise Miner, SAS/STAT, etc. help in
specialized functions related to data mining and statistical analysis

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 29


B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 30
Types of
Machine Learning
Types of Machine Learning
 Machine learning: provides systems the ability to learn on their own and improve from experiences
without being programmed externally.
 Machine Learning enables systems to learn from vast volumes of data and solve specific problems. It
uses computer algorithms that improve their efficiency automatically through experience.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 32


Types of Machine Learning
 How Machine Learning Works

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 33


Types of Machine Learning
 How Machine Learning Works
 Consider a system with input data that contains photos of various kinds of fruits.
Now the system wants to group the data according to the different types of
fruits.
 First, the system will analyze the input data. Next, it tries to find patterns, like
shapes, size, and color.
 Based on these patterns, the system will try to predict the different types of fruit
and segregate them.
 Finally, it keeps track of all the decisions it made during the process to ensure it
is learning. The next time you ask the same system to predict and segregate the
different types of fruits, it won't have to go through the entire process again.
That’s how machine learning works.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 34


Types of Machine Learning
Machine Learning

Supervised Learning Unsupervised Reinforcement Learning


Learning

Classification Regression Clustering Q-Learning

Markov Decision
Decision trees Simple Linear K-Means Process

KNN K-Modes
Multiple Linear

Naïve Bayes K-Medoids


Polynomial
SVM DBScan

Logistic Regression Agglomerative

Divisive
Multinomial Logistic
Regression

Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Types of Machine Learning
 There are primarily three types of machine learning: Supervised, Unsupervised, and
Reinforcement Learning.
• Supervised machine learning: User supervise the machine while training it to work on its
own. This requires labeled training data
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 36


1.Supervised Learning
 Supervised learning is a type of machine learning that uses labeled data to train
machine learning models. In labeled data, the output is already known. The model
just needs to map the inputs to the respective outputs.
 An example of supervised learning is to train a system that identifies the image of
an animal.
 Supervised learning algorithms take labeled inputs and map them to the known
outputs, which means you already know the target variable.
 Supervised Learning methods need external supervision to train machine learning
models. Hence, the name supervised. They need guidance and additional
information to return the desired result.
 First, you have to provide a data set that contains pictures of a kind of fruit, e.g.,
apples.
 Then, provide another data set that lets the model know that these are pictures of
apples. This completes the training phase.
 Next, provide a new set of data that only contains pictures of apples. At this point,
the system can recognize what the fruit it is and will remember it.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 37
1.Supervised Learning Process

• Labelled training data containing past information comes as an


input.
• Based on the training data, the machine builds a predictive model
that can be used on test data to assign a label for each record in38
the test data.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
1.Supervised Learning

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 39


1.Supervised Learning
Some examples of supervised learning are
 Predicting the results of a game
 Predicting whether a tumour is malignant or benign
 Predicting the price of domains like real estate, stocks, etc.
 Classifying texts such as classifying a set of emails as spam or non-spam
Risk Assessment-to assess the risk in financial services or insurance domains
Image Classification--Facebook can recognize your friend in a picture from an album of
tagged photos.
Fraud Detection--To identify whether the transactions made by the user are authentic or
not.
Visual Recognition--The ability of a machine learning model to identify objects, places,
people, actions, and images

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 40


1.Supervised Learning
 Supervised learning algorithms are generally used for solving
classification and regression problems.
• Classification-- Predicts a Class Label (Categorical)
• Regression--Predicts a Class Label (Numerical)
 Classification: Classification is used when the output variable is
categorical i.e. with 2 or more classes. For example, yes or no,
male or female, true or false, etc.
 In order to predict whether a mail is spam or not, we need to
first teach the machine what a spam mail is. This is done
based on a lot of spam filters - reviewing the content of the
mail, reviewing the mail header, and then searching if it
contains any false information. Certain keywords and blacklist
filters that blackmails are used from already blacklisted
spammers.
 All of these features are used to score the mail and give it a
spam score. The lower the total spam score of the email, the
more likely that it is not a scam.
 Based on the content, label, and the spam score of the new
incoming mail, the algorithm decides whether it should land in
the inbox or spam folder. 41

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


1.Supervised Learning
 Regression:
 Regression is used when the output variable is a real
or continuous value. In this case, there is a
relationship between two or more variables i.e., a
change in one variable is associated with a change in
the other variable. For example, salary based on work
experience or weight based on height, etc.
 Let’s consider two variables - humidity and
temperature. Here, ‘temperature’ is the independent
variable and ‘humidity' is the dependent variable. If
the temperature increases, then the humidity
decreases.
 These two variables are fed to the model and the
machine learns the relationship between them. After
the machine is trained, it can easily predict the
humidity based on the given temperature. 42

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


1.Supervised Learning
Classification
Regression

Typical classification problems include:


• Image classification Typical applications of regression:
• Prediction of disease • Demand forecasting in retails
• Win–loss prediction of games • Sales prediction for managers
• Prediction of natural calamity like earthquake, • Price prediction in real estate
flood, etc. • Weather forecast
43

• Recognition of handwriting • Skill demand forecast in job market


B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
2.Unsupervised Learning
 Unsupervised learning is a type of machine learning that uses unlabeled data to train machines.
 Unlabeled data doesn’t have a fixed output variable.
 The model learns from the data, discovers the patterns and features in the data, and returns the
output.
 Consider a cluttered dataset: a collection of pictures of different spoons.
 Feed this data to the model, and the model analyzes it to recognize any patterns.
 The machine categorizes the photos into two types, as shown in the image, based on their
similarities.
 Flipkart uses this model to find and recommend products that are well suited for you.

44

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


2.Unsupervised Learning
 Depicted below is an example of an unsupervised learning technique that uses the images
of vehicles to classify if it’s a bus or a truck.
 The model learns by identifying the parts of a vehicle, such as a length and width of the
vehicle, the front, and rear end covers, roof hoods, the types of wheels used, etc.
 Based on these features, the model classifies if the vehicle is a bus or a truck.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 45


2.Unsupervised Learning
 Unsupervised learning finds patterns and understands the trends in the data to
discover the output. So, the model tries to label the data based on the features
of the input data.
 The training process used in unsupervised learning techniques does not need
any supervision to build models. They learn on their own and predict the output.
 Unsupervised learning can be further grouped into types:
1. Clustering
2. Association

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 46


2.Unsupervised Learning
1. Clustering: Clustering is the method of dividing the objects into clusters that are similar between
them and are dissimilar to the objects belonging to another cluster. For example, finding out which
customers made similar product purchases.

• Suppose a telecom company wants to reduce its


customer churn rate by providing personalized call and
data plans.
• The behavior of the customers is studied and the
model segments the customers with similar traits.
Several strategies are adopted to minimize churn rate
and maximize profit through suitable promotions and
campaigns.

• On the right side of the image, you can see a graph where customers are
grouped.
• Group A customers use more data and also have high call durations.
• Group B customers are heavy Internet users, while
• Group C customers have high call duration.
• So, Group B will be given more data benefit plans, while Group C will be given
cheaper called call rate plans and group A will be given the benefit of both. 47

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


2.Unsupervised Learning
2. Association:
 Association is a rule-based machine learning to discover the probability of the co-
occurrence of items in a collection. For example, finding out which products were
purchased together.
 Let’s say that a customer goes to a supermarket and buys bread, milk, fruits, and wheat.
 Another customer comes and buys bread, milk, rice, and butter.
 Now, when another customer comes, it is highly likely that if he buys bread, he will buy
milk too.
 Hence, a relationship is established based on customer behavior and recommendations
are made.

48

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


2.Unsupervised Learning Process

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 49


2.Unsupervised Learning
Applications of Unsupervised Learning:
• Market Basket Analysis: It is a machine learning model based on the algorithm that if
you buy a certain group of items, you are less or more likely to buy another group of
items.
• Semantic Clustering: Semantically similar words share a similar context. People post
their queries on websites in their own ways. Semantic clustering groups all these
responses with the same meaning in a cluster to ensure that the customer finds the
information they want quickly and easily. It plays an important role in information
retrieval, good browsing experience, and comprehension.
• Delivery Store Optimization: Machine learning models are used to predict the demand
and keep up with supply. They are also used to open stores where the demand is higher
and optimizing roots for more efficient deliveries according to past data and behavior.
• Identifying Accident Prone Areas: Unsupervised machine learning models can be used
to identify accident-prone areas and introduce safety measures based on the intensity
of those accidents.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 50
Difference between Supervised and Unsupervised Learning:

S. No Supervised Learning Un Supervised Learning

1 The data used in supervised learning is This algorithm does not require any
labeled. labeled data because its job is to look for
The system learns from the labeled data patterns in the input data and organize it
and makes future predictions

2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation

3 Supervised learning is mostly used to Unsupervised learning is used to find


predict data hidden patterns or structures in data.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 51


Reinforcement learning:

 Reinforcement learning is a sub-branch of Machine Learning that trains a


model to return an optimum solution for a problem by taking a sequence of
decisions by itself.
 Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the actions
and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data.
 Since there is no labeled data, so the agent is bound to learn by its
experience only.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 52


Reinforcement learning:

• RL solves a specific type of problem where decision making is sequential, and


the goal is long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary
goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience,
it learns to perform the task in a better way.
• Hence, we can say that "Reinforcement learning is a type of machine
learning method where an intelligent agent (computer program) interacts
with the environment and learns to act within that."
• How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 53


Reinforcement learning:
• It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it
learns from its own experience without any human intervention.
• Example: Suppose there is an AI agent present within a maze environment, and
his goal is to find the diamond. The agent interacts with the environment by
performing some actions, and based on those actions, the state of the agent gets
changed, and it also receives a reward or penalty as feedback.
• The agent continues doing these three things (take action, change state/remain
in the same state, and get feedback), and by doing these actions, he learns and
explores the environment.
• The agent learns that what actions lead to positive feedback or rewards and what
actions lead to negative feedback penalty. As a positive reward, the agent gets a
positive point, and as a penalty, it gets a negative point.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 54


Reinforcement learning Process

• One contemporary example of reinforcement learning is self-driving cars.


• The critical information which it needs to take care of are speed and speed limit
in different road segments, traffic conditions, road conditions, weather
conditions, etc.
• The tasks that have to be taken care of are start/stop, accelerate/decelerate,55
turn to left / right, etc.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
Terms used in Reinforcement Learning:
•Agent(): An entity that can perceive/explore the environment and act upon it.
•Environment(): A situation in which an agent is present or surrounded by. In
RL, we assume the stochastic environment, which means it is random in nature.
•Action(): Actions are the moves taken by an agent within the environment.
•State(): State is a situation returned by the environment after each action
taken by the agent.
•Reward(): A feedback returned to the agent from the environment to evaluate
the action of the agent.
•Policy(): Policy is a strategy applied by the agent for the next action based on
the current state.
•Value(): It is expected long-term retuned with the discount factor and opposite
to the short-term reward.
•Q-value(): It is mostly similar to the value, but it takes one additional
56
parameter as a current action (a).
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
Reinforcement Learning:
 Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what actions need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback of the previous
action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
 Applications of Reinforcement Learning

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 57


Difference between Supervised, Unsupervised and Reinforcement Learning:
S. No Supervised Unsupervised Reinforcement
1 Data provided is labeled Data provided is unlabeled The machine learns from
data with output values data, the output values not its environment using
specified specified, machine makes its rewards and errors
own predictions
2 Used to solve classification Used to solve clustering and Used to solve resolve based
and regression problems association problems problems

3 Labeled data is used unlabeled data is used No predefined data is used


4 External supervision No Supervision No Supervision
5 Solves the problems by Solves problems by Follows trail and error
mapping labelled input to understanding patterns and problem solving approach
known output discovering output

6 Agorithms: KNN,SVM etc… Algorithms: K-means,Dbscan Algorithms: Q-learning


etc…
7 Applications: Hand writting Applications: Market basket Applications: Self driving
recognition,stock prediction analysis,Recommender System cars, Intelligent Robots
etc.. etc… etc… 58
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
Machine Learning algorithms

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 59


PROBLES NOT TO BE SOLVED USING MACHINE LEARNING
 Step 1: What is the Problem? A number of information should be collected to know what is
the problem. Informal description of the problem, e.g. I need a program that will prompt the
next word as and when I type a word.
 Assumptions - Create a list of assumptions about the problem. Similar problems What other
problems have you seen or can you think of that are similar to the problem that you are
trying to solve?
 Step 2: Why does the problem need to be solved?
 What is the motivation for solving the problem? What requirement will it fulfil?
 For example, does this problem solve any long-standing business issue like finding out
potentially fraudulent transactions?
 Or the purpose is more trivial like trying to suggest some movies for upcoming weekend.
 Solution benefits: Consider the benefits of solving the problem. What capabilities does it
enable? It is important to clearly understand the benefits of solving the problem. These
benefits can be articulated to sell the project.
 Solution use: How will the solution to the problem be used and the life time of the solution is
expected to have?
 Step 3: How would I solve the problem? Try to explore how to solve the problem manually.
Detail out step-by-step data collection, data preparation, and program design to solve the
problem. Collect all these details and update the previous sections of the problem60definition,
especially the assumptions.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
PROBLES NOT TO BE SOLVED USING MACHINE LEARNING
 Machine learning should not be applied to tasks in which humans are very effective or
frequent human intervention is needed.
 For example, air traffic control is a very complex task needing intense human
involvement.
 At the same time, for very simple tasks which can be implemented using traditional
programming paradigms, there is no sense of using machine learning.
 For example, simple rule-driven or formula-based applications like price calculator
engine, dispute tracking application, etc. do not need machine learning techniques.
 Machine learning should be used only when the business process has some lapses.
 If the task is already optimized, incorporating machine learning will not serve to justify
the return on investment.
 For situations where training data is not sufficient, machine learning cannot be used
effectively.
 This is because, with small training data sets, the impact of bad data is exponentially
worse.
 For the quality of prediction or recommendation to be good, the training data61 should be
sizeable.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
ISSUES IN MACHINE LEARNING
 Machine learning is a field which is relatively new and still evolving.
 Also, the level of research and kind of use of machine learning tools and technologies varies
drastically from country to country.
 The laws and regulations, cultural background, emotional maturity of people differ
drastically in different countries.
 All these factors make the use of machine learning and the issues originating out of machine
learning usage are quite different.
 The biggest fear and issue arising out of machine learning is related to privacy and the
breach of it.
 The primary focus of learning is on analyzing data, both past and current, and coming up
with insight from the data.
 This insight may be related to people and the facts revealed might be private enough to be
kept confidential.
 Also, different people have a different preference when it comes to sharing of information.
While some people may be open to sharing some level of information publicly, some other
people may not want to share it even to all friends and keep it restricted just to family
members.
 Classic examples are a birth date (not the day, but the date as a whole), photographs of a
dinner date with family, educational background, etc. Some people share them with all in the
social platforms like Facebook while others do not, or if they do, they may restrict it to
friends only. 62

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


ISSUES IN MACHINE LEARNING
 When machine learning algorithms are implemented using those information,
inadvertently people may get upset.
 For example, if there is a learning algorithm to do preference-based customer
segmentation and the output of the analysis is used for sending targeted marketing
campaigns, it will hurt the emotion of people and actually do more harm than good.
 In certain countries, such events may result in legal actions to be taken by the people
affected.
 Even if there is no breach of privacy, there may be situations where actions were taken
based on machine learning may create an adverse reaction.
 Let’s take the example of knowledge discovery exercise done before starting an
election campaign.
 If a specific area reveals an ethnic majority or skewness of a certain demographic
factor, and the campaign pitch carries a message keeping that in mind, it might actually
upset the voters and cause an adverse result.
 So a very critical consideration before applying machine learning is that proper human
judgement should be exercised before using any outcome from machine learning.
 Only then the decision taken will be beneficial and also not result in any adverse
impact. 63

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


ML Activities

 The first step in machine learning activity starts with data.


 In case of supervised learning, it is the labelled training data set followed
by test data which is not labelled.
 In case of unsupervised learning, there is no question of labelled data but
the task is to find patterns in the input data.
 A thorough review and exploration of the data is needed to understand
the type of the data, the quality of the data and relationship between
the different data elements.
 Based on that, multiple pre-processing activities may need to be done on
the input data before we can go ahead with core machine learning
activities.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 64


ML Activities

Following are the typical preparation activities done once the input data comes
into the machine learning system:
 Understand the type of data in the given input data set.
 Explore the data to understand the nature and quality.
 Explore the relationships amongst the data elements, e.g. inter-feature
relationship.
 Find potential issues in data.
 Do the necessary remediation, e.g. impute missing data values, etc., if needed.
 Apply pre-processing steps, as necessary.
 Once the data is prepared for modelling, then the learning tasks start off.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 65


ML Activities
 As a part of it, do the following activities:
❑ The input data is first divided into parts – the training data and the test
data (called holdout).
❑ This step is applicable for supervised learning only.
❑ Consider different models or learning algorithms for selection.
❑ Train the model based on the training data for supervised learning problem
and apply to unknown data.
❑ Directly apply the chosen unsupervised model on the input data for
unsupervised learning problem.
 After the model is selected, trained (for supervised learning), and
applied on input data, the performance of the model is evaluated.
 Based on options available, specific actions can be taken to
improve the performance of the model, if possible 66

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


ML Activities
Detailed Process of ML:

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 67


ML Activities

68

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


Basic Types of Data in ML
What is Dataset?
A data set is an organized collection of data.
(Or) A data set is a collection of related information
or records.
• The information may be on some entity or some
subject area.
• Each data set has one output variable and
one/more input variables.
• Each row of a data set is called as
Instance/observation/rows/RECORDS/SAMPLES/o
bjects/predictors
• Each data set also has multiple
Columns/features/attributes/variables/fields/
dimensions and characteristics.

• Independent variables—Input variable-predictor


variable

• Dependent variables-output variable/target 69


variable/response variable
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
Data
 Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed.
 Data is the most important part of all Data Analytics, Machine Learning, Artificial
Intelligence.
 Without data, we can’t train any model and all modern research and automation will go
in vain. Big Enterprises are spending lots of money just to gather as much certain data
as possible.
 Data is typically divided into two types: labeled and unlabeled. Labeled data includes a
label or target variable that the model(Supervised) is trying to predict, whereas
unlabeled data does not include a label or target variable (UnSupervised) .

 The data used in machine learning is typically numerical or categorical.


• Numerical data includes values that can be ordered and measured, such as age or
income.(Regression-if target variable is numerical)
• Categorical data/Nominal data: includes values that represent categories, such as
gender or type of fruit.(Classification-if target variable is Categorical)
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 70
Types of Data
 understand the different types of data that we generally come across
in machine learning problems.
 Data can broadly be divided into following two types:
1. Qualitative data
2. Quantitative data

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 71


Qualitative Data/ Categorical
Qualitative data: provides information about the quality of an object or information
which cannot be measured.
 For example, if we consider the quality of performance of students in terms of
‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data.
 Also, name or roll number of students are information that cannot be measured using
some scale of measurement. So they would fall under qualitative data.
 Qualitative data is also called categorical data.
 Qualitative data can be further subdivided into two types as follows:
1. Nominal data 2. Ordinal data
1. Nominal data: is one which has no numeric value, but a named value. It is used for
assigning named values to attributes. Nominal values cannot be quantified.
 Examples of nominal data are 1. Blood group: A, B, O, AB, etc. 2. Nationality: Indian,
American, British, etc. 3. Gender: Male, Female, Other
 It is obvious, mathematical operations such as addition, subtraction, multiplication,
etc. cannot be performed on nominal data. For that reason, statistical functions such
as mean, variance, etc. can also not be applied on nominal data. 72

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


Qualitative Data/ Categorical
2. Ordinal data: in addition to possessing the properties of nominal data, can
also be naturally ordered.
 This means ordinal data also assigns named values to attributes but unlike
nominal data, they can be arranged in a sequence of increasing or decreasing
value so that we can say whether a value is better than or greater than
another value.
 Examples of ordinal data are 1. Customer satisfaction: ‘Very Happy’, ‘Happy’,
‘Unhappy’, etc. 2. Grades: A, B, C, etc. 3. Hardness of Metal: ‘Very Hard’,
‘Hard’, ‘Soft’, etc.
 Like nominal data, basic counting is possible for ordinal data. Hence, the
mode can be identified. Since ordering is possible in case of ordinal data,
median, and quartiles can be identified in addition. Mean can still not be
calculated.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 73


Quantitative Data/ Numerical
Quantitative data: relates to information about the quantity of an object – hence it
can be measured.
 For example, if we consider the attribute ‘marks’, it can be measured using a
scale of measurement.
 Quantitative data is also termed as numeric data. There are two types of
quantitative data:
1. Interval data 2. Ratio data
1.Interval data: is numeric data for which not only the order is known, but the exact
difference between values is also known.
 An ideal example of interval data is Celsius temperature. The difference between
each value remains the same in Celsius temperature, date, time etc..
 For interval data, mathematical operations such as addition and subtraction are
possible. For that reason, for interval data, the central tendency can be
measured by mean, median, or mode. Standard deviation can also be calculated.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 74
Quantitative Data/ Numerical

2. Ratio data: represents numeric data for which exact value can be measured.
Absolute zero is available for ratio data.
 Also, these variables can be added, subtracted, multiplied, or divided. The
central tendency can be measured by mean, median, or mode and methods of
dispersion such as standard deviation.
 Examples of ratio data include height, weight, age, salary, etc.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 75


Types of Data
➢ Apart from the approach detailed above, attributes can also be
categorized into types based on a number of values that can be
assigned.
➢ The attributes can be either discrete or continuous
Discrete attributes: can assume a finite or countably infinite number of
values.
➢ Nominal attributes such as roll number, street number, pin code, etc.
can have a finite number of values whereas numeric attributes such as
count, rank of students, etc. can have countably infinite values.
➢ A special type of discrete attribute which can assume two values only
is called binary attribute.
➢ Examples of binary attribute include male/ female, positive/negative,
yes/no, etc.
Continuous attributes: can assume any possible value which is a real
number.
➢ Examples of continuous attribute include length, height, weight, 76
price, etc.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
Example datasets:
 1.Data set consists of only numerical attributes
 2.Data set consists of only categorical attributes
 3.Data set consists of both numerical and categorical attributes
Dataset1: Dataset2: Dataset3:
age incom heig weight age income studen age income Credit
e ht t rating
20 12000 6.3 30 youth Fair Yes youth 12000 Yes
40 15000 5.2 70 youth Good No senior 15000 No
35 20000 5.6 65 senior excellent Yes middle 20000 Yes

60 100000 5.4 59 middle Good Yes youth 100000 Yes


senior Fair No
middle good no
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 77
Types of Data
Structured Data:
 This type of data is either number or words. This can
take numerical values but mathematical operations
cannot be performed on it. This type of data is
expressed in tabular format.
 Ex: Sunny=1, Cloudy=2, Windy=3 Or Binary Form
Data Like 0 Or1, Good Or Bad, Etc.

Unstructured Data:
 This type of data does not have the proper format
and therefore known as unstructured data.
 Ex: This comprises textual data, sounds, images,
videos, etc.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 78


Exploring Structure of data
➢ we come across two basic data types – numeric and categorical.
➢ With this context in mind, we can go deeper into understanding a data set.
➢ We need to understand that in a data set, which of the attributes are
numeric and which are categorical in nature.
➢ This is because, the approach of I.exploring numeric data is different than
the approach of II.exploring categorical data.
I. Exploring numerical data: There are two most effective mathematical
plots to explore numerical data – box plot and histogram.
1. Understanding central tendency
2. Understanding data spread

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 80


1.Understanding central tendency
To understand the nature of numeric variables, we can apply the measures of central
tendency of data, i.e. mean and median.
➢ In statistics, measures of central tendency help us understand the central point of
a set of data.
➢ Mean, by definition, is a sum of all data values divided by the count of data
elements.
➢ For example, mean of a set of observations 21, 89, 34, 67, and 96 is calculated as
below.

 Median, on contrary, is the value of the element appearing in the middle of an


ordered list of data elements.
➢ If we consider the above 5 data elements, the ordered list would be 21, 34, 67,
89, and 96.
➢ Since there are 5 data elements, the 3rd element in the ordered list is considered
as the median. Hence, the median value of this set of data is 67. 81

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


1.Understanding central tendency

➢ There might be a natural curiosity to understand why two measures of central


tendency are reviewed.
➢ The reason is mean and median are impacted differently by data values appearing at
the beginning or at the end of the range.
➢ Mean being calculated from the cumulative sum of data values, is impacted if too
many data elements are having values closer to the far end of the range, i.e. close to
the maximum or minimum values.
➢ It is especially sensitive to outliers, i.e. the values which are unusually high or low,
compared to the other values.
➢ Mean is likely to get shifted drastically even due to the presence of a small number
of outliers.
➢ If we observe that for certain attributes the deviation between values of mean and
median are quite high, we should investigate those attributes further and try to find
out the root cause along with the need for remediation 82

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


2. Understanding data spread
 Now that we have explored the central tendency of the different numeric
attributes, we have a clear idea of which attributes have a large deviation
between mean and median.
 Let’s look closely at those attributes. To drill down more, we need to look
at the entire range of values of the attributes, though not at the level of
data elements as that may be too vast to review manually.
 So we will take a granular view of the data spread in the form of
1. Dispersion of data
2. Position of the different data values

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 83


2. Understanding data spread
1. Measuring data dispersion:
 Consider the data values of two attributes
1. Attribute 1 values : 44, 46, 48, 45, and 47 (total 230/5=46)(order 44,45,46,47,48)
2. Attribute 2 values : 34, 46, 59, 39, and 52
 Both the set of values have a mean and median of 46. However, the first set of values that
is of attribute 1 is more concentrated or clustered around the mean/median value whereas
the second set of values of attribute 2 is quite spread out or dispersed.
 To measure the extent of dispersion of a data, or to find out how much the different values
of a data are spread out, the variance of the data is measured.
 The variance of a data is measured using
the formula given below:

 Larger value of variance or standard deviation


indicates more dispersion in the data and
vice versa.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
84
2.Understanding data spread
For Attribute 1

 So it is quite clear from the measure that attribute 1 values are quite concentrated
around the mean while attribute 2 values are extremely spread out.
 Since this data was small, a visual inspection and understanding were possible and
that matches with the measured value.
85

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


2.Understanding data spread
2. Measuring data value position
 When the data values of an attribute are arranged in an increasing order, we have seen
earlier that median gives the central data value, which divides the entire data set into
two halves.
 Similarly, if the first half of the data is divided into two halves so that each half
consists of one-quarter of the data set, then that median of the first half is known as
first quartile or Q.
 In the same way, if the second half of the data is divided into two halves, then that
median of the second half is known as third quartile or Q.
 The overall median is also known as second quartile or Q. So, any data set has five
values minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
 Quantiles refer to specific points in a data set which divide the data set into equal
parts or equally sized quantities.
 There are specific variants of quantile, the one dividing data set into four parts being
termed as quartile. Another such popular variant is percentile, which divides the data
set into 100 parts. 86

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


Plotting and exploring numerical data
Box plots:
 A box plot is an extremely effective mechanism to get a one-shot view and
understand the nature of the data.
Histogram:
 Histogram is another plot which helps in effective visualization of numeric
attributes.
 It helps in understanding the distribution of a numeric data into series of
intervals, also termed as ‘bins’.
 The important difference between histogram and box plot is
 The focus of histogram is to plot ranges of data values (acting as ‘bins’), the
number of data elements in each range will depend on the data distribution.
Based on that, the size of each bar corresponding to the different ranges will
vary.
 The focus of box plot is to divide the data elements in a data set into four
equal portions, such that each portion contains an equal number of data
elements
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 87
II.Exploring categorical data
 We have seen there are multiple ways to explore numeric data.
 However, there are not many options for exploring categorical data.
 We may also look for a little more details and want to get a table
consisting the categories of the attribute and count of the data
elements falling into that category.
 From this count will get some proportion of data elements belongs
to category.
 statistical measure “mode” is applicable on categorical attributes.
 As we know, like mean and median, mode is also a statistical
measure for central tendency of a data.
 Mode of a data is the data value which appears most often. In
context of categorical attribute, it is the category which has highest
number of data values.
 Since mean and median cannot be applied for categorical variables, 88

mode is the sole measure of central tendency.


B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
II.Exploring categorical data

• Find out the mode for the attributes


‘car name’ and ‘cylinders’.
• For cylinders, we can see that the
mode is 4, as that is the data value
for which frequency is highest. More
than 50% of data elements belong to
the category 4.
• However, it is not so evident for the Car name

attribute ‘car name’ from the


information given.
• When we probe and try to find the
mode, it is found to be category ‘ford
pinto’ for which frequency is of
highest value 6.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 89


II.Exploring categorical data
 An attribute may have one or more modes.
 Frequency distribution of an attribute having single mode is called
‘unimodal’, two modes are called ‘bimodal’ and multiple modes are
called ‘multimodal’.
 One more important angle of data exploration is to explore
relationship between attributes. There are multiple plots to enable us
explore the relationship between variables.
 The basic and most commonly used plot is scatter plot.
 A scatter plot helps in visualizing bivariate relationships, i.e.
relationship between two variables.
 It is a two-dimensional plot in which points or dots are drawn on
coordinates provided by values of the attributes.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 90


II.Exploring categorical data
 An attribute may have one or more modes.
 Frequency distribution of an attribute having single mode is called
‘unimodal’, two modes are called ‘bimodal’ and multiple modes are
called ‘multimodal’.
 One more important angle of data exploration is to explore
relationship between attributes. There are multiple plots to enable us
explore the relationship between variables.
 The basic and most commonly used plot is scatter plot.
 A scatter plot helps in visualizing bivariate relationships, i.e.
relationship between two variables.
 It is a two-dimensional plot in which points or dots are drawn on
coordinates provided by values of the attributes.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 91


Summary
Exploring Structure of data
1. Exploring Numerical Data
Understanding central tendency
statistical measures: Mean, Median
Understanding data spread
1. Dispersion of data
statistical measure: variance
2. Position of the different data values
statistical measure: quartile
Plotting- Box plot and Histogram

2. Exploring Categorical Data


statistical measure: mode
Plotting- Scatter Plot 92

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


DATA QUALITY AND REMEDIATION
Data quality:
Success of machine learning depends largely on the quality of data. A data which has the
right quality helps to achieve better prediction accuracy, in case of supervised learning.
However, it is not realistic to expect that the data will be flawless.
We have already come across at least two types of problems:
1. Certain data elements without a value or data with a missing value.
2. Data elements having value surprisingly different from the other elements, which we
term as outliers.

There are multiple factors which lead to these data quality issues.
Following are some of them:
1. Incorrect sample set selection
2. Errors in data collection

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 93


DATA QUALITY AND REMEDIATION
1. Incorrect sample set selection:
➢ The data may not reflect normal or regular quality due to incorrect selection of sample
set.
➢ For example, if we are selecting a sample set of sales transactions from a festive period
and trying to use that data to predict sales in future.
➢ In this case, the prediction will be far apart from the actual scenario, just because the
sample set has been selected in a wrong time.
➢ Similarly, if we are trying to predict poll results using a training data which doesn’t
comprise of a right mix of voters from different segments such as age, sex, ethnic
diversities, etc., the prediction is bound to be a failure.
➢ It may also happen due to incorrect sample size.
➢ For example, a sample of small size may not be able to capture all aspects or
information needed for right learning of the model.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 94


DATA QUALITY AND REMEDIATION
2. Errors in data collection:
➢ In many cases, a person or group of persons are responsible for the collection of
data to be used in a learning activity resulting in outliers and missing values
➢ In this manual process, there is the possibility of wrongly recording data either
in terms of value (say 20.67 is wrongly recorded as 206.7 or 2.067) or in terms
of a unit of measurement (say cm. is wrongly recorded as m. or mm.).
➢ This may result in data elements which have abnormally high or low value from
other elements. Such records are termed as outliers.
➢ It may also happen that the data is not recorded at all. In case of a survey
conducted to collect data, it is all the more possible as survey responders may
choose not to respond to a certain question.
➢ So the data value for that data element in that responder’s record is missing

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 95


DATA QUALITY AND REMEDIATION
Data remediation:
 The issues in data quality, as mentioned above, need to be remediated, if the right
amount of efficiency has to be achieved in the learning activity.
 Out of the two major areas mentioned above, the first one can be remedied by
proper sampling technique.(Random sampling, stratified sampling, Systematic
sampling, Cluster sampling, quota sampling etc…)
 This is a completely different area covered as a specialized subject area in statistics.
 These techniques are not cover here.
 However, human errors are bound to happen, no matter whatever checks and
balances we put in.
 Hence, proper remedial steps need to be taken for the second area mentioned
above. We will discuss how to handle outliers and missing values.
1. Handling outliers
2. Handling missing values

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 96


DATA QUALITY AND REMEDIATION
1. Handling outliers:
 Outliers are data elements with an abnormally high value which may impact
prediction accuracy, especially in regression models.
 Once the outliers are identified and the decision has been taken to amend those
values, you may consider one of the following approaches.
 However, if the outliers are natural, i.e. the value of the data element is surprisingly
high or low because of a valid reason, then we should not amend it.
Remove outliers: If the number of records which are outliers is not many, a simple
approach may be to remove them.
Imputation: One other way is to impute the value with mean or median or mode. The
value of the most similar data element may also be used for imputation.
Capping: For values that lie outside the limits, we can cap them by replacing those
observations below the lower limit with the value of 5th percentile and those that lie
above the upper limit, with the value of 95th percentile.
If there is a significant number of outliers, they should be treated separately in the
statistical model.
In that case, the groups should be treated as two different groups, the model should
97
be built for both groups and then the output can be combined.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
DATA QUALITY AND REMEDIATION
2.Handling missing values
 In a data set, one or more data elements may have missing values in multiple records.
 As discussed above, it can be caused by omission on part of the surveyor or a person
who is collecting sample data or by the responder, primarily due to his/her
unwillingness to respond or lack of understanding needed to provide a response.
 It may happen that a specific question (based on which the value of a data element
originates) is not applicable to a person or object with respect to which data is
collected.
 There are multiple strategies to handle missing value of data elements. Some of those
strategies have been discussed below.
1. Eliminate records having a missing value of data elements
 In case the proportion of data elements having missing values is within a tolerable
limit, a simple but effective approach is to remove the records having such data
elements.
 This is possible if the quantum of data left after removing the data elements having
missing values is sizeable. 98

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


DATA QUALITY AND REMEDIATION
 So, we can very well eliminate the records and keep working with the remaining
data set.
 However, this will not be possible if the proportion of records having data elements
with missing value is really high as that will reduce the power of model because of
reduction in the training data size.
2.Imputing missing values:
 Imputation is a method to assign a value to the data elements having missing
values. Mean/mode/median is most frequently assigned value.
 For quantitative attributes, all missing values are imputed with the mean, median,
or mode of the remaining values under the same attribute.
 For qualitative attributes, all missing values are imputed by the mode of all
remaining values of the same attribute.
 However, another strategy may be identify the similar types of observations whose
values are known and use the mean/median/mode of those known values

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 99


DATA QUALITY AND REMEDIATION
3. Estimate missing values:
 If there are data points similar to the ones with missing attribute values, then the
attribute values from those similar data points can be planted in place of the
missing value.
 For finding similar data points or observations, distance function can be used. For
example, let’s assume that the weight of a Russian student having age 12 years
and height 5 ft. is missing.
 Then the weight of any other Russian student having age close to 12 years and
height close to 5 ft. can be assigned.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 100


.

Data Preprocessing
 Data preprocessing is the process of transforming raw data into a useful, understandable format.
Real-world or raw data usually has inconsistent formatting, human errors, and can also be
incomplete. Data preprocessing resolves such issues and makes datasets more complete and
efficient to perform data analysis.
 In other words, data preprocessing is transforming data into a form that computers can easily work
on. It makes data analysis or visualization easier and increases the accuracy and speed of the
machine learning algorithms that train on the data.
Why is data preprocessing required?
 A database is a collection of data points. Data points are also called observations, data samples,
events, and records.
 Each sample is described using different characteristics, also known as features or attributes. Data
preprocessing is essential to effectively build models with these features.
 If you’re aggregating data from two or more independent datasets, the gender field may have two
different values for men: man and male. Likewise, if you’re aggregating data from ten different
datasets, a field that’s present in eight of them may be missing in the rest two.
 By preprocessing data, we make it easier to interpret and use. This process eliminates
inconsistencies or duplicates in data, which can otherwise negatively affect a model’s accuracy.
Data preprocessing also ensures that there aren’t any incorrect or missing values due to human
error or bugs. In short, employing data preprocessing techniques makes the database more
complete and accurate.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 101
.

The four stages of data preprocessing


 There are four stages of data processing: cleaning, integration, reduction, and transformation.
1. Data cleaning: or cleansing/scrubbing is the process of cleaning datasets by accounting for
missing values, removing outliers, correcting inconsistent data points, and smoothing noisy
data. In essence, the motive behind data cleaning is to offer complete and accurate samples for
machine learning models.
• Missing values
• Noisy data
i) Missing values:
 The problem of missing data values is quite common. It may happen during data collection or
due to some specific data validation rule. In such cases, you need to collect additional data
samples or look for additional datasets.
 The issue of missing values can also arise when you concatenate two or more datasets to form
a bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields
before merging.

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 102


.

 Here are some ways to account for missing data:


• Manually fill in the missing values. This can be a tedious and time-consuming approach and is
not recommended for large datasets.
• Make use of a standard value to replace the missing data value. You can use a global constant
like “unknown” or “N/A” to replace the missing value. Although a straightforward approach, it
isn’t foolproof.
• Fill the missing value with the most probable value. To predict the probable value, you can use
algorithms like logistic regression or decision trees.
• Use a central tendency to replace the missing value. Central tendency is the tendency of a
value to cluster around its mean, mode, or median.
ii) Noisy data
 A large amount of meaningless data is called noise. More precisely, it’s the random variance in a
measured variable or data having incorrect attribute values. Noise includes duplicate or semi-
duplicates of data points, data segments of no value for a specific research process, or
unwanted information fields.
 For example, if you need to predict whether a person can drive, information about their hair
color, height, or weight will be irrelevant. 103

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


.

 An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of
turtles wrongly labeled as tortoises. This can be considered noise.
 However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That
sample can be considered an outlier and not necessarily noise. This is because we want to
teach the algorithm all possible ways to detect tortoises, and so, deviation from the group is
essential.
 For numeric values, you can use a scatter plot or box plot to identify outliers.
 The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This
will enable you to work with only the essential features instead of analyzing large volumes of
data. Both linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted
value by looking at the values around it. The sorted values are then divided into “bins,” which
means sorting data into smaller segments of the same size. There are different techniques for
binning, including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and
detect outliers in the process. 104

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


.

2. Data integration

It is involved in a data analysis task that combines data from multiple sources into a coherent data store.
These sources may include multiple databases. Do you think how data can be matched up ?? For a data
analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about
them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is
the data about data) it helps in avoiding errors.
Since data is collected from various sources, data integration is a crucial part of data preparation.
Integration may lead to several inconsistent and redundant data points, ultimately leading to
models with inferior accuracy.
 Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all
data in one place increases efficiency and productivity. This step typically involves using data
warehouse software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data
from multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
applications. This process can be synchronous or asynchronous and is usually event-driven.
105
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.

3. Data reduction

 As the name suggests, data reduction is used to reduce the amount of data and thereby reduce
the costs associated with data mining or data analysis.
 It offers a condensed representation of the dataset. Although this step reduces the volume, it
maintains the integrity of the original data. This data preprocessing step is especially crucial
when working with big data as the amount of data involved would be gigantic.
 The following are some techniques used for data reduction.
 Dimensionality reduction, also known as dimension reduction, reduces the number of features
or input variables in a dataset.
 The number of features or input variables of a dataset is called its dimensionality. The higher
the number of features, the more troublesome it is to visualize the training dataset and create
a predictive model.
 In some cases, most of these attributes are correlated, hence redundant; therefore,
dimensionality reduction algorithms can be used to reduce the number of random variables and
obtain a set of principal variables.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 106
.

3. Data reduction
 There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of
features. This allows us to get a smaller subset that can be used to visualize the problem
using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-
dimensional space to a lower-dimensional space, or in other words, space with a lesser number of
dimensions.
 The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables
from a large set of variables. The newly extracted variables are called principal components. This
method works only for features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them;
otherwise, a pair of highly correlated variables can increase the multicollinearity in the dataset.
• Missing values ratio: This method removes attributes having missing values more than a specified
threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold
value as minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each 107
feature in a dataset,
allowing us to keep just the top most important features.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.

4. Data Transformation
 Data transformation is the process of converting data from one format to another. In essence, it involves
methods for transforming data into appropriate formats that the computer can learn efficiently from.
 For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a
dataset may store values of the speed of a car in different units as such. Before feeding this data to an
algorithm, we need to transform the data into the same unit.
 The following are some strategies for data transformation.
 Smoothing
 This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the
most valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to
make the patterns more visible.
 Aggregation
 Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or
analysis. Aggregating data from various sources to increase the number of data points is essential as only then
the ML model will have enough examples to learn from.
 Discretization
 Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient
to place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age
values.
 Generalization
 Generalization involves converting low-level data features into high-level data features. For instance,
categorical attributes such as home address can be generalized to higher-level definitions
108 such as city or state.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.

4. Data Transformation
 Normalization
 Normalization refers to the process of converting all data variables into a specific range. In other
words, it’s used to scale the values of an attribute so that it falls within a smaller range, for example,
0 to 1. Decimal scaling, min-max normalization, and z-score normalization are some methods of data
normalization.
 Feature construction
 Feature construction involves constructing new features from the given set of features. This
method simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
 Concept hierarchy generation
 Concept hierarchy generation lets you create a hierarchy between features, although it isn’t
specified. For example, if you have a house address dataset containing data about the street, city,
state, and country, this method can be used to organize the data in hierarchical forms.
 Accurate data, accurate results
 Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or
inconsistent data easily influences ML models. The key is to feed them high-quality,
109
accurate data,
for which data preprocessing is an essential step.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.

Data Preprocessing

110

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I


THANK YOU

B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 111

You might also like