ML-Unit-1
ML-Unit-1
UNIT-I
By
B.RUPA
Assistant Professor, Dept of CSE(DS)
Vardhaman College of Engineering
Unit-I: Contents
Introduction to Machine Learning:
▪ Types of Machine Learning
▪ Problems not to be solved using Machine Learning
▪ Applications of Machine Learning
▪ Tools in Machine Learning
▪ Issues in Machine Learning
▪ Machine learning Activities
▪ Basic Types of Data in Machine Learning
▪ Exploring Structure of data
▪ Data Quality & Remediation
▪ Data Pre-Processing
2
1. Machine learning is a "Field of study that gives computers the ability to learn without being
explicitly programmed“ defined by Arthur Samuel in 1959.
In machine learning, algorithms are trained to find patterns and correlations in large data
sets and to make the best decisions and predictions based on that analysis.
(OR)
2. Machine learning is a Computer Program is said to learn from Experience E with respect to
small Class of tasks T and Performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E – by Tom Mitchell in 1998
Machine Learning is the study of algorithms that improves its Performance P, at some Task T with
Experience E.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 8
Continued..
Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in
performing task T increases, which results in higher performance measure (P).
For instance, we give a “shape sorting block” toy to a child. (We know that the toy has different
shapes and shape holes).
In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child
observes the shape and tries to fit it in a shaped hole.
Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at
finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out
of 3 correct shape holes.
Second, the child tries it another time and notices that she is a little experienced in this task.
Considering the experience gained (E), the child tries this task another time, and when measuring
the performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now
figured out which shape goes into which shape hole. 9
Such execution is similar to machine learning. What a machine does is, it takes a task (T),
executes it, and measures its performance (P). Now a machine has a large number of data,
so as it processes that data, its experience (E) increases over time, resulting in a higher
performance measure (P). So after going through all the data, our machine learning model’s
accuracy increases, which means that the predictions made by our model will be very
accurate.
3. Machine Learning is the ability of systems to learn from data, identify patterns, and
enact lessons from that data without human interaction or with minimal human interaction.
Traditional
Programming
Data
Output
Program Computer
Machine Learning:
Data and output is run on the computer to create a program.
This program can be used in traditional programming.
Machine
Learning
Data
Computer Program
Output
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 19
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 20
Application of ML
Wherever there is a substantial amount of past data, machine learning can be
used to generate actionable insight from the data.
Though machine learning is adopted in multiple forms in every business domain,
below three major domains just to give some idea about what type of actions can
be done using machine learning.
The algorithms related to different machine learning tasks are known to all and can be
implemented using any language/platform.
Python: Python is one of the most popular, open source programming language widely
adopted by machine learning community.
• Python has very strong libraries for advanced mathematical functionalities (NumPy),
algorithms and mathematical tools (SciPy) and numerical plotting (matplotlib). Built on
these libraries, there is a machine learning library named scikit learn, which has various
classification, regression, and clustering algorithms embedded in it.
R: R is a language for statistical computing and data analysis. It is an open source language,
extremely popular in the academic community – especially among statisticians and data
miners.
• R is a very simple programming language with a huge set of libraries available for
different stages of machine learning.
• Some of the libraries standing out in terms of popularity are plyr/dplyr (for data
transformation), caret (‘Classification and Regression Training’ for classification), RJava
(to facilitate integration with Java), tm (for text mining), ggplot2 (for data
visualization).
• Other than the libraries, certain packages like Shiny and R Markdown have been
28
developed around R to develop interactive web applications, documents and dashboards
on R without much effort.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
ML Tools
SAS: SAS (earlier known as ‘Statistical Analysis System’) is another licenced commercial
software which provides strong support for machine learning functionalities.
• SAS is a software suite comprising different components.
• The basic data management functionalities are embedded in the Base SAS component
whereas
• the other components like SAS/INSIGHT, Enterprise Miner, SAS/STAT, etc. help in
specialized functions related to data mining and statistical analysis
Markov Decision
Decision trees Simple Linear K-Means Process
KNN K-Modes
Multiple Linear
Divisive
Multinomial Logistic
Regression
Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Types of Machine Learning
There are primarily three types of machine learning: Supervised, Unsupervised, and
Reinforcement Learning.
• Supervised machine learning: User supervise the machine while training it to work on its
own. This requires labeled training data
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own
44
• On the right side of the image, you can see a graph where customers are
grouped.
• Group A customers use more data and also have high call durations.
• Group B customers are heavy Internet users, while
• Group C customers have high call duration.
• So, Group B will be given more data benefit plans, while Group C will be given
cheaper called call rate plans and group A will be given the benefit of both. 47
48
1 The data used in supervised learning is This algorithm does not require any
labeled. labeled data because its job is to look for
The system learns from the labeled data patterns in the input data and organize it
and makes future predictions
2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation
Following are the typical preparation activities done once the input data comes
into the machine learning system:
Understand the type of data in the given input data set.
Explore the data to understand the nature and quality.
Explore the relationships amongst the data elements, e.g. inter-feature
relationship.
Find potential issues in data.
Do the necessary remediation, e.g. impute missing data values, etc., if needed.
Apply pre-processing steps, as necessary.
Once the data is prepared for modelling, then the learning tasks start off.
68
2. Ratio data: represents numeric data for which exact value can be measured.
Absolute zero is available for ratio data.
Also, these variables can be added, subtracted, multiplied, or divided. The
central tendency can be measured by mean, median, or mode and methods of
dispersion such as standard deviation.
Examples of ratio data include height, weight, age, salary, etc.
Unstructured Data:
This type of data does not have the proper format
and therefore known as unstructured data.
Ex: This comprises textual data, sounds, images,
videos, etc.
So it is quite clear from the measure that attribute 1 values are quite concentrated
around the mean while attribute 2 values are extremely spread out.
Since this data was small, a visual inspection and understanding were possible and
that matches with the measured value.
85
There are multiple factors which lead to these data quality issues.
Following are some of them:
1. Incorrect sample set selection
2. Errors in data collection
Data Preprocessing
Data preprocessing is the process of transforming raw data into a useful, understandable format.
Real-world or raw data usually has inconsistent formatting, human errors, and can also be
incomplete. Data preprocessing resolves such issues and makes datasets more complete and
efficient to perform data analysis.
In other words, data preprocessing is transforming data into a form that computers can easily work
on. It makes data analysis or visualization easier and increases the accuracy and speed of the
machine learning algorithms that train on the data.
Why is data preprocessing required?
A database is a collection of data points. Data points are also called observations, data samples,
events, and records.
Each sample is described using different characteristics, also known as features or attributes. Data
preprocessing is essential to effectively build models with these features.
If you’re aggregating data from two or more independent datasets, the gender field may have two
different values for men: man and male. Likewise, if you’re aggregating data from ten different
datasets, a field that’s present in eight of them may be missing in the rest two.
By preprocessing data, we make it easier to interpret and use. This process eliminates
inconsistencies or duplicates in data, which can otherwise negatively affect a model’s accuracy.
Data preprocessing also ensures that there aren’t any incorrect or missing values due to human
error or bugs. In short, employing data preprocessing techniques makes the database more
complete and accurate.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 101
.
An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of
turtles wrongly labeled as tortoises. This can be considered noise.
However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That
sample can be considered an outlier and not necessarily noise. This is because we want to
teach the algorithm all possible ways to detect tortoises, and so, deviation from the group is
essential.
For numeric values, you can use a scatter plot or box plot to identify outliers.
The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This
will enable you to work with only the essential features instead of analyzing large volumes of
data. Both linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted
value by looking at the values around it. The sorted values are then divided into “bins,” which
means sorting data into smaller segments of the same size. There are different techniques for
binning, including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and
detect outliers in the process. 104
2. Data integration
It is involved in a data analysis task that combines data from multiple sources into a coherent data store.
These sources may include multiple databases. Do you think how data can be matched up ?? For a data
analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about
them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is
the data about data) it helps in avoiding errors.
Since data is collected from various sources, data integration is a crucial part of data preparation.
Integration may lead to several inconsistent and redundant data points, ultimately leading to
models with inferior accuracy.
Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all
data in one place increases efficiency and productivity. This step typically involves using data
warehouse software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data
from multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
applications. This process can be synchronous or asynchronous and is usually event-driven.
105
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.
3. Data reduction
As the name suggests, data reduction is used to reduce the amount of data and thereby reduce
the costs associated with data mining or data analysis.
It offers a condensed representation of the dataset. Although this step reduces the volume, it
maintains the integrity of the original data. This data preprocessing step is especially crucial
when working with big data as the amount of data involved would be gigantic.
The following are some techniques used for data reduction.
Dimensionality reduction, also known as dimension reduction, reduces the number of features
or input variables in a dataset.
The number of features or input variables of a dataset is called its dimensionality. The higher
the number of features, the more troublesome it is to visualize the training dataset and create
a predictive model.
In some cases, most of these attributes are correlated, hence redundant; therefore,
dimensionality reduction algorithms can be used to reduce the number of random variables and
obtain a set of principal variables.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I 106
.
3. Data reduction
There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of
features. This allows us to get a smaller subset that can be used to visualize the problem
using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-
dimensional space to a lower-dimensional space, or in other words, space with a lesser number of
dimensions.
The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables
from a large set of variables. The newly extracted variables are called principal components. This
method works only for features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them;
otherwise, a pair of highly correlated variables can increase the multicollinearity in the dataset.
• Missing values ratio: This method removes attributes having missing values more than a specified
threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold
value as minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each 107
feature in a dataset,
allowing us to keep just the top most important features.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.
4. Data Transformation
Data transformation is the process of converting data from one format to another. In essence, it involves
methods for transforming data into appropriate formats that the computer can learn efficiently from.
For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a
dataset may store values of the speed of a car in different units as such. Before feeding this data to an
algorithm, we need to transform the data into the same unit.
The following are some strategies for data transformation.
Smoothing
This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the
most valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to
make the patterns more visible.
Aggregation
Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or
analysis. Aggregating data from various sources to increase the number of data points is essential as only then
the ML model will have enough examples to learn from.
Discretization
Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient
to place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age
values.
Generalization
Generalization involves converting low-level data features into high-level data features. For instance,
categorical attributes such as home address can be generalized to higher-level definitions
108 such as city or state.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.
4. Data Transformation
Normalization
Normalization refers to the process of converting all data variables into a specific range. In other
words, it’s used to scale the values of an attribute so that it falls within a smaller range, for example,
0 to 1. Decimal scaling, min-max normalization, and z-score normalization are some methods of data
normalization.
Feature construction
Feature construction involves constructing new features from the given set of features. This
method simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
Concept hierarchy generation
Concept hierarchy generation lets you create a hierarchy between features, although it isn’t
specified. For example, if you have a house address dataset containing data about the street, city,
state, and country, this method can be used to organize the data in hierarchical forms.
Accurate data, accurate results
Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or
inconsistent data easily influences ML models. The key is to feed them high-quality,
109
accurate data,
for which data preprocessing is an essential step.
B.RUPA, Asst.Professor, Dept of CSE(DS), Vardhaman College of Engineering UNIT-I
.
Data Preprocessing
110