AML LAB MANUAL Yash
AML LAB MANUAL Yash
Laboratory Manual
1
Enrolment No: 210310116016
LIST OF EXPERIMENTS
Course Code: 3171617
Course Title: Applied Machine Learning
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
5. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
6. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test
data sets.
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.
8. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be
used for this problem.
2
Enrolment No: 210310116016
3
Enrolment No: 210310116016
Experiment No: 1
Numeric Types:
1. Integers:
In Python 3, there is effectively no limit to how long an integer value can be. Of course,
it is constrained by the amount of memory your system has.
>>>
print(10) 10
>>> type(10)
<class 'int'>
4
Enrolment No: 210310116016
Strings:
Strings are sequences of character data. The string type in Python is called str. String literals
may be delimited using either single or double quotes. All the characters between the opening
delimiter and matching closing delimiter are part of the string.
A string in Python can contain as many characters as you wish. The only limit is
your machine‘s memory resources. A string can also be empty.
>>> print("I am a
string.") I am a string.
>>> type("I am a string.")
<class 'str'>
>>> ''
''
A raw string literal is preceded by r or R, which specifies that escape sequences in the
associated string are not translated. The backslash character is left in the string.
>>> print('foo\
nbar') foo
bar
5
Enrolment No: 210310116016
>>> print(r'foo\
nbar') foo\nbar
>>> print('foo\\
bar') foo\bar
>>> print(R'foo\\bar')
foo\\bar
Boolean Type:
Python 3 provides a Boolean data type. Objects of Boolean type may have one of two values,
True or False.
>>> type(True)
<class 'bool'>
>>> type(False)
<class 'bool'>
Python List:
List is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type. Declaring a list is pretty
straight forward. Items separated by commas are enclosed within brackets [ ]. We can use the
slicing operator [ ] to extract an item or a range of items from a list. Index starts form 0 in
Python.
Lists are mutable, meaning, value of elements of a list can be altered.
>>> a = [1, 2.2,
'python']
Python Tuple:
Tuple is an ordered sequence of items same as list. The only difference is that tuples are
immutable. Tuples once created cannot be modified. Tuples are used to write-protect data and
are usually faster than list as it cannot change dynamically. It is defined within parentheses ()
where items are separated by commas. We can use the slicing operator [] to extract items but
we cannot change its value.
>>> t = (5,'program',
1+3j)
Python Set:
Set is an unordered collection of unique items. Set is defined by values separated by comma
inside braces { }. Items in a set are not ordered. We can perform set operations like union,
intersection on two sets. Set have unique values. They eliminate duplicates. Since, set are
unordered collection, indexing has no meaning. Hence the slicing operator [] does not work.
>>> a = {1,2,2,3,3,3}
>>> a
6
Enrolment No: 210310116016
{1, 2, 3}
7
Enrolment No: 210310116016
Python Dictionary:
Dictionary is an unordered collection of key-value pairs. It is generally used when we have a
huge amount of data. Dictionaries are optimized for retrieving data. We must know the key
to retrieve the value. In Python, dictionaries are defined within braces {} with each item
being a pair in the form key:value. Key and value can be of any type. We use key to retrieve
the respective value. But not the other way around.
>>> d = {1:'value','key':2}
>>> type(d)
<class 'dict'>
Operators in Python
Operators are used to perform operations on variables and values. Python divides the operators
in the following groups:
1. Arithmetic operators
Arithmetic operators are used to perform mathematical operations like addition, subtraction,
multiplication and division.
+ Addition: adds two operands x + y
- Subtraction: subtracts two operands x - y
* Multiplication: multiplies two operands x*y
/ Division (float): divides the first operand by the second x/y
// Division (floor): divides the first operand by the second x // y
% Modulus: returns the remainder when first operand is divided by the second x % y
2. Comparison/Relational operators
Relational operators compares the values. It either returns True or False according to the
condition.
> Greater than: True if left operand is greater than the right x > y
< Less than: True if left operand is less than the right x < y
== Equal to: True if both operands are equal x == y
!= Not equal to - True if operands are not equal x != y
>= Greater than or equal to: True if left operand is greater than or equal to the right x >= y
<= Less than or equal to: True if left operand is less than or equal to the right x<= y
3. Logical operators
Logical operators perform Logical AND, Logical OR and Logical NOT operations.
and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if operand is false not x
8
Enrolment No: 210310116016
9
Enrolment No: 210310116016
This describes the environment in which Python programs are executed. This
describes the runtime behavior of the interpreter, including program startup,
configuration, and program termination.
Anaconda Navigator – Jupyter Notebook
Anaconda is a free and open-source distribution of the Python and R programming
languages for scientific computing, machine learning and data science that aims to simplify
package management and deployment.
The notebook extends the console-based approach to interactive computing in a qualitatively
new direction, providing a web-based application suitable for capturing the whole
computation process: developing, documenting, and executing code, as well as
communicating the results. A notebook kernel is a ―computational engine‖ that executes
the code contained in a Notebook document. The ipython kernel, referenced in this guide,
executes python code.
The Jupyter notebook combines two components:
o A web application: a browser-based tool for interactive authoring of documents
which combine explanatory text, mathematics, computations and their rich media
output.
o Notebook documents: a representation of all content visible in the web application,
including inputs and outputs of the computations, explanatory text, mathematics, images,
and rich media representations of objects.
PyCharm IDE
Google COLAB
Colaboratory is a research tool for machine learning education and research. It‘s a Jupyter
notebook environment that requires no setup to use. Google Colab is a free cloud service and
now it supports free GPU! You can: improve your Python programming language coding
skills. To start working with Colab you first need to log in to your google account, then go to
this link https://colab.research.google.com.
There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.
11
Enrolment No: 210310116016
1. Mean
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of
observations
(11) which equals 56.6 years.
2. Median
The median is the middle value in distribution when the values are arranged in ascending or
descending order.
In a distribution with an odd number of observations, the median value is the middle value.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Looking at the retirement age distribution (which has 11 observations), the median is the
middle value, which is 57 years.
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution,
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The middle two values are 56 and 57; therefore the median equals 56.5 years.
3. Mode
The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
Measure of Dispersion
Measures of spread describe how similar or varied the set of observed values are for a
particular variable (data item). Measures of spread include the range, quartiles and the
interquartile range, variance and standard deviation.
The spread of the values can be measured for quantitative data, as the variables are numeric
and can be arranged into a logical order with a low end value and a high end value.
The variance and the standard deviation are measures of the spread of the data around the
mean. They summarise how close each observed data value is to the mean value.
12
Enrolment No: 210310116016
In datasets with a small spread all values are very close to the mean, resulting in a small
variance and standard deviation. Where a dataset is more dispersed, values are spread further
away from the mean, leading to a larger variance and standard deviation.
The smaller the variance and standard deviation, the more the mean value is indicative of the
whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and
variance are zero.
13
Enrolment No: 210310116016
Experiment No: 2
Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
Python Libraries
There are a lot of reasons why Python is popular among developers and one of them is that it has
an amazingly large collection of libraries that users can work with. In this Python Library,
we will discuss Python Standard library and different libraries offered by Python
Programming Language: scipy, numpy, etc.
We know that a module is a file with some Python code, and a package is a directory for sub
packages and modules. A Python library is a reusable chunk of code that you may want to
include in your programs/ projects. Here, a ‗library‘ loosely describes a collection of core
modules. Essentially, then, a library is a collection of modules. A package is a library that can
be installed using a package manager like npm.
To display a list of all available modules, use the following command in the Python console:
>>> help('modules')
14
Enrolment No: 210310116016
How to install additional python libraries such as Numpy, Scipy, Pandas and Matplotlib
In Anaconda Environment: use these commands Conda install numpy scipy pandas
matplotlib
15
Enrolment No: 210310116016
The math module is a standard module in Python and is always available. To use
mathematical functions under this module, you have to import the module using import
math. It gives access to the underlying C library functions. This module does not support
complex datatypes. The cmath module is the complex counterpart.
16
Enrolment No: 210310116016
This module provides functions for calculating mathematical statistics of numeric (Real-
valued) data. The statistics module comes with very useful functions like: Mean, median,
mode, standard deviation, and variance.
The four functions we'll use in this post are common in statistics:
1. mean - average value
2. median - middle value
3. mode - most often value
4. standard deviation - spread of values
17
Enrolment No: 210310116016
Measures of spread
These functions calculate a measure of how much the population or sample tends to
deviate from the typical or average values.
pstdev() Population standard deviation of data.
pvariance() Population variance of
data. stdev() Sample standard deviation of
data. variance() Sample variance of data.
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined. This allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.
18
Enrolment No: 210310116016
The SciPy library is one of the core packages that make up the SciPy stack. It provides
many user- friendly and efficient numerical routines such as routines for numerical
integration, interpolation, optimization, linear algebra and statistics.
19
Enrolment No: 210310116016
NumPy is an open source library available in Python that aids in mathematical, scientific,
engineering, and data science programming. NumPy is an incredible library to perform
mathematical and statistical operations. It works perfectly well for multi-dimensional arrays
and matrices multiplication
For any scientific project, NumPy is the tool to know. It has been built to work with the
N- dimensional array, linear algebra, random number, Fourier transform, etc. It can be
integrated to C/C++ and Fortran.
NumPy is a programming language that deals with multi-dimensional arrays and matrices.
On top of the arrays and matrices, NumPy supports a large number of mathematical
operations.
NumPy is memory efficiency, meaning it can handle the vast amount of data more accessible
than any other library. Besides, NumPy is very convenient to work with, especially for matrix
multiplication and reshaping. On top of that, NumPy is fast. In fact, TensorFlow and Scikit
learn to use NumPy array to compute the matrix multiplication in the back end.
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of
positive integers.
In NumPy dimensions are called axes. The number of axes is rank.
NumPy’s array class is called ndarray. It is also known by the alias array.
We use python numpy array instead of a list because of the below three reasons:
1. Less Memory
2. Fast
3. Convenient
Numpy Functions
Numpy arrays carry attributes around with them. The most important ones
are: ndim: The number of axes or rank of the array
shape: A tuple containing the length in each dimension
size: The total number of elements
import numpy
x = numpy.array([[1,2,3], [4,5,6], [7,8,9]]) # 3x3
matrix print(x.ndim) # Prints 2
print(x.shape) # Prints (3L,
3L) print(x.size) # Prints 9
20
Enrolment No: 210310116016
21
Enrolment No: 210310116016
Built-in Methods
Many standard numerical functions are available as methods out of the box:
x=
numpy.array([1,2,3,4,5])
avg = x.mean()
sum = x.sum()
sx = numpy.sin(x)
Here is a list of some useful NumPy functions and methods names ordered in categories.
Array Creation
arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace,
logspace, mgrid, ogrid, ones, ones_like, r, zeros, zeros_like
Conversions
ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat
Manipulations
array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack,
ndarray.item, newaxis, ravel, repeat, reshape, resize, squeeze, swapaxes, take,
transpose, vsplit, vstack
Questions
all, any, nonzero, where
Ordering
argmax, argmin, argsort, max, min, ptp, searchsorted, sort
Operations
choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real,
sum
Basic Statistics
cov, mean, std, var
Basic Linear Algebra
cross, dot, outer, linalg.svd, vdot
22
Enrolment No: 210310116016
SciPy contains varieties of sub packages which help to solve the most common
issue related to Scientific Computation.
SciPy is the most used Scientific library only second to GNU Scientific Library
for C/C++ or Matlab's.
Easy to use and understand as well as fast computational power.
It can operate on an array of NumPy library.
Numpy VS
SciPy Numpy:
1. Numpy is written in C and use for mathematical or numeric calculation.
2. It is faster than other Python Libraries
3. Numpy is the most useful library for Data Science to perform basic calculations.
4. Numpy contains nothing but array data type which performs the most basic operation like
sorting, shaping, indexing, etc.
SciPy:
1. SciPy is built in top of the NumPy
2. SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few features.
3. Most new Data Science features are available in Scipy rather than Numpy.
23
Enrolment No: 210310116016
Example,
24
Enrolment No: 210310116016
Experiment No: 3
25
Enrolment No: 210310116016
Thus, using Pandas helps to shorten the procedure of handling data. With the time
saved, we can focus more on data analysis algorithms.
3. An extensive set of features
Pandas are really powerful. They provide you with a huge set of important commands
and features which are used to easily analyze your data. We can use Pandas to perform
various tasks like filtering your data according to certain conditions, or segmenting and
segregating the data according to preference, etc.
4. Efficiently handles large data
Wes McKinney, the creator of Pandas, made the python library to mainly handle large
datasets efficiently. Pandas help to save a lot of time by importing large amounts of
data very fast.
5. Makes data flexible and customizable
Pandas provide a huge feature set to apply on the data you have so that you can
customize, edit and pivot it according to your own will and desire. This helps to bring
the most out of your data.
6. Made for Python
Python programming has become one of the most sought after programming languages
in the world, with its extensive amount of features and the sheer amount of productivity
it provides. Therefore, being able to code Pandas in Python, enables you to tap into the
power of the various other features and libraries which will use with Python. Some of
these libraries are NumPy, SciPy, MatPlotLib, etc.
Pandas Library
The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of
a collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with one
you can do with the other, such as filling in null values and calculating the mean.
With CSV files all you need is a single line to load in the data:
26
Enrolment No: 210310116016
df =
pd.read_csv('purchases.csv')
df
27
Enrolment No: 210310116016
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):
movies_df.shape
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So
we have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you
might filter some rows based on some criteria and then want to know quickly how many
rows were removed.
Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you
aren't aggregating duplicate rows.
temp_df = movies_df.append(movies_df)
temp_df.shape
Out:
(2000, 11)
Using append() will return a copy without affecting the original DataFrame. We are
capturing this copy in temp so we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled.
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame,
28
Enrolment No: 210310116016
but this time with duplicates removed. Calling .shape confirms we're back to the 1000
rows of our original dataset.
29
Enrolment No: 210310116016
It's a little verbose to keep assigning DataFrames to the same variable like in this
example. For this reason, pandas has the inplace keyword argument on many of its
methods. Using inplace=True will modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Another important argument for drop_duplicates() is keep, which has three possible
options:
Since we didn't define the keep arugment in the previous example it was defaulted to
first. This means that if two rows are the same pandas will drop the second row and
keep the first row. Using last has the opposite effect: the first row is dropped.
keep, on the other hand, will drop all duplicates. If two rows are the same then both will
be dropped. Watch what happens to temp_df:
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being
left over. If you're wondering why you would want to do this, one reason is that it allows
you to locate all duplicates in your dataset. When conditional selections are shown below
you'll see how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and
lowercase words, spaces, and typos. To make selecting data by column name easier we
can spend a little time cleaning up their names.
Here's how to print the column names of our dataset:
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime (Minutes)',
'Rating', 'Votes', 'Revenue (Millions)', 'Metascore'],
dtype='object')
30
Enrolment No: 210310116016
Not only does .columns come in handy if you want to rename columns by allowing for
simple copy and paste, it's also useful if you need to understand why you are receiving
a Key Error when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't
want parentheses, so let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename() we
could also set a list of names to the columns like so:
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'], dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a
list comprehension:
list (and dict) comprehensions come in handy a lot when working with pandas and
data in general.
31
Enrolment No: 210310116016
It's a good idea to lowercase, remove special characters, and replace spaces with
underscores if you'll be working with a dataset for some time.
Let's calculate to total number of nulls in each column of our dataset. The first step is
to check which cells in our DataFrame are null:
movies_df.isnull()
Notice isnull() returns a DataFrame where each cell is either True or False depending
on that cell's null status.
To count the number of nulls in each column we use an aggregate function for
summing: movies_df.isnull().sum()
Up until now we've focused on some basic summaries of our data. We've learned
about simple column extraction using single brackets, and we imputed null values in a
column using fillna(). Below are the other methods of slicing, selecting, and extracting
you'll need to use constantly.
It's important to note that, although many methods are the same, DataFrames and
Series have different attributes, so you'll need be sure to know which type you are
working with or else you will receive attribute errors.
Let's look at working with columns
first. By column
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)
To make necessary statistical inferences, it becomes necessary to visualize your data and
Matplotlib is one such solution for the Python users. It is a very powerful plotting library
useful for those working with Python and NumPy. The most used module of Matplotib is
32
Enrolment No: 210310116016
Pyplot which provides an interface like MATLAB but instead, it uses Python and it is
open source.
33
Enrolment No: 210310116016
General Concepts
Matplotlib Library
Pyplot is a module of Matplotlib which provides simple functions to add plot elements
like lines, images, text, etc. to the current axes in the current figure.
plot(x-axis values, y-axis values) — plots a simple line graph with x-axis
values against y-axis values
show() — displays the graph
title(―string‖) — set the title of the plot as specified by the string
xlabel(―string‖) — set the label for x-axis as specified by the string
ylabel(―string‖) — set the label for y-axis as specified by the string
figure() — used to control a figure level attributes
subplot(nrows, ncols, index) — Add a subplot to the current figure
suptitle(―string‖) — It adds a common title to the figure specified by the string
subplots(nrows, ncols, figsize) — a convenient way to create subplots, in a single
call. It returns a tuple of a figure and number of axes.
set_title(―string‖) — an axes level method used to set the title of subplots in a figure
bar(categorical variables, values, color) — used to create vertical bar graphs
barh(categorical variables, values, color) — used to create horizontal bar graphs
legend(loc) — used to make legend of the graph
xticks(index, categorical variables) — Get or set the current tick locations and
labels of the x-axis
pie(value, categorical variables) — used to create a pie chart
hist(values, number of bins) — used to create a histogram
xlim(start value, end value) — used to set the limit of values of the x-axis
ylim(start value, end value) — used to set the limit of values of the y-axis
34
Enrolment No: 210310116016
scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values
against y-axis values
axes() — adds an axes to the current figure
set_xlabel(―string‖) — axes level method used to set the x-label of the plot
specified as a string
set_ylabel(―string‖) — axes level method used to set the y-label of the plot
specified as a string
scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot
with x-axis values against y-axis values
plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph
with x- axis values against y-axis values
Here we import Matplotlib‘s Pyplot module and Numpy library as most of the data that
we will be working with will be in the form of arrays only.
We pass two arrays as our input arguments to Pyplot‘s plot() method and use show()
method to invoke the required plot. Here note that the first array appears on the x-axis and
second array appears on the y-axis of the plot. Now that our first plot is ready, let us add
the title, and name x-axis and y-axis using methods title(), xlabel() and ylabel()
respectively.
35
Enrolment No: 210310116016
We can also specify the size of the figure using method figure() and passing the values
as a tuple of the length of rows and columns to the argument figsize
With every X and Y argument, you can also pass an optional third argument in the form
of a string which indicates the colour and line type of the plot. The default format is b-
which means a solid blue line. In the figure below we use go which means green circles.
Likewise, we can make many such combinations to format our plot.
36
Enrolment No: 210310116016
We can also plot multiple sets of data by passing in multiple sets of arguments of X and Y
axis in the plot() method as shown.
37
Enrolment No: 210310116016
We can use subplot() method to add more than one plots in one figure. In the image
below, we used this method to separate two graphs which we plotted on the same axes in
the previous example. The subplot() method takes three arguments: they are nrows, ncols
and index. They indicate the number of rows, number of columns and the index number
of the sub-plot. For instance, in our example, we want to create two sub-plots in one
figure such that it comes in one row and in two columns and hence we pass arguments
(1,2,1) and (1,2,2) in the subplot() method. Note that we have separately used
title()method for both the subplots. We use suptitle() method to make a centralized title for
the figure.
If we want our sub-plots in two rows and single column, we can pass arguments (2,1,1)
and (2,1,2)
38
Enrolment No: 210310116016
The above way of creating subplots becomes a bit tedious when we want many subplots
in our figure. A more convenient way is to use subpltots() method. Notice the difference
of ‘s’ in both the methods. This method takes two arguments nrows and ncols as number
of rows and number of columns respectively. This method creates two objects:figure and
axes which we store in variables fig and ax which can be used to change the figure and
axes level attributes respectively. Note that these variable names are chosen arbitrarily.
39
Enrolment No: 210310116016
40
Enrolment No: 210310116016
Experiment No: 4
The library is built upon the SciPy (Scientific Python) that must be installed before
you can use scikit-learn. This stack that includes:
Extensions or modules for SciPy care conventionally named SciKits. As such, the module
provides learning algorithms and is named scikit-learn.
The vision for the library is a level of robustness and support required for use in
production systems. This means a deep focus on concerns such as easy of use, code
quality, collaboration, documentation and performance.
The library is focused on modeling data. It is not focused on loading, manipulating and
summarizing data. For these features, refer to NumPy and Pandas.
41
Enrolment No: 210310116016
42
Enrolment No: 210310116016
Multiple Linear Regression attempts to model the Relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to
perform multiple linear Regression are almost similar to that of simple linear Regression.
The Difference Lies in the Evalution. We can use it to find out which factor has the
highest impact on the predicted output and now different variable relate to each other.
First assumption
Linear regression needs the relationship between the independent and dependent variables
to be linear. It is also important to check for outliers since linear regression is sensitive to
outlier effects. The linearity assumption can best be tested with scatter plots, the
following two examples depict two cases, where no and little linearity is present.
43
Enrolment No: 210310116016
Checking for collinearity helps you get rid of variables that are skewing your data by
having a significant relationship with another variable.
Correlation between variables describe the relationship between two variables. If they
are extremely correlated, then they are collinear.
Autocorrelation occurs when a variable‘s data affects another instance of that same
variable (same column, different row). Linear regression only works if there is little or no
autocorrelation in the dataset, and each instance is independent of each other. If instances
are autocorrelated then your residuals are not independent from each other, and will show
a pattern.
The simplest method to detect collinearity would be to plot it out in graphs or to view a
correlation matrix to check out pairwise correlation (correlation between 2 variables). If
you have two variables that are highly correlated, your best course of action is to just
remove one of them.
Why is multicollinearity an issue with regression? Well, the regression equation is the
best fit line to represent the effects of your predictors and the dependant variable, and
44
Enrolment No: 210310116016
45
Enrolment No: 210310116016
aligned. Data sets with values of r close to zero show little to no straight-line relationship.
46
Enrolment No: 210310116016
47
Enrolment No: 210310116016
Experiment No: 5
DESCRIPTION:
Title:
DESCRIPTION:
Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
import numpy as np
import math
from data_loader import read_data
class Node:
def init (self, attribute):
self.attribute = attribute
self.children = [] self.answer =
""
in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1
for x in range(items.shape[0]):
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]): if
data[y, col] == items[x]:
dict[items[x]][pos] = data[y] pos +=
1
48
Enrolment No: 210310116016
if delete:
dict[items[x]] = np.delete(dict[items[x]], col, 1) return
items, dict
def entropy(S):
items = np.unique(S) if
items.size == 1:
return 0
for x in range(items.shape[0]):
total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))
intrinsic = np.zeros((items.shape[0], 1)) for x in
range(items.shape[0]):
ratio = dict[items[x]].shape[0]/(total_size * 1.0)
entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)
for x in range(entropies.shape[0]):
total_entropy -= entropies[x]
return total_entropy / iv
def create_node(data, metadata):
if (np.unique(data[:, -1])).shape[0] == 1: node =
Node("")
node.answer = np.unique(data[:, -1])[0] return
node
49
Enrolment No: 210310116016
np.argmax(gains)
node = Node(metadata[split])
metadata = np.delete(metadata, split, 0)
items, dict = subtables(data, split, delete=True)
for x in range(items.shape[0]):
child = create_node(dict[items[x]], metadata)
node.children.append((items[x], child))
empty(size):
s = ""
for x in range(size): s += "
"
return s
value, n in node.children:
print(empty(level + 1), value)
print_tree(n, level + 2)
import csv
def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
headers = next(datareader)
metadata = []
50
Enrolment No: 210310116016
traindata = []
for name in headers:
metadata.append(name)
for row in datareader:
traindata.append(row)
outlook,temperature,humidity,wind, answer
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
Output
outlook
overcast b'yes'
rain
wind
b'strong' b'no'
b'weak' b'yes'
sunny
humidity b'high'
b'no'
b'normal' b'yes
51
Enrolment No: 210310116016
Experiment No: 6
Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering
few test data sets.
def loadCsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for
processing dataset[i] = [float(x) for x in
dataset[i]]
return dataset
52
Enrolment No: 210310116016
separated = {}
#creates a dictionary of classes 1 and 0 where the values are the instacnes
belonging to each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in
numbers])/float(len(numbers)-1) return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)]; del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset);
summaries = {}
for classValue, instances in separated.items():
#summaries is a dic of tuples(mean,std) for each class
value summaries[classValue] =
summarize(instances)
return summaries
53
Enrolment No: 210310116016
def main():
filename = '5data.csv'
splitRatio = 0.67
dataset = loadCsv(filename);
54
Enrolment No: 210310116016
# prepare model
summaries = summarizeByClass(trainingSet);
# test model
predictions = getPredictions(summaries,
testSet) accuracy = getAccuracy(testSet,
predictions)
print('Accuracy of the classifier is :
{0}%'.format(accuracy)) main()
Output
confusion matrix is as
follows [[17 0 0]
[ 0 17 0]
[ 0 0 11]]
Accuracy metrics
precision recall f1-score support
55
Enrolment No: 210310116016
Experiment No: 7
Bronchitis = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.92],
[„True‟, „False‟,0.08].
[ „False‟, „True‟,0.03],
[ „False‟, „False‟, 0.98]], [ smoking])
Tuberculosis_or_cancer =
ConditionalProbabilityTable( [[ „True‟, „True‟,
56
Enrolment No: 210310116016
„True‟, 1.0],
[„True‟, „True‟, „False‟, 0.0],
[„True‟, „False‟, „True‟, 1.0],
[„True‟, „False‟, „False‟, 0.0],
[„False‟, „True‟, „True‟, 1.0],
[„False‟, „True‟, „False‟, 0.0],
[„False‟, „False‟ „True‟, 1.0],
[„False‟, „False‟, „False‟, 0.0]], [tuberculosis, lung])
Xray = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.885],
[„True‟, „False‟, 0.115],
[ „False‟, „True‟, 0.04],
[ „False‟, „False‟, 0.96]],
[tuberculosis_or_cancer]) dyspnea =
ConditionalProbabilityTable(
[[ „True‟, „True‟, „True‟, 0.96],
[„True‟, „True‟, „False‟, 0.04],
[„True‟, „False‟, „True‟, 0.89],
[„True‟, „False‟, „False‟, 0.11],
[„False‟, „True‟, „True‟, 0.96],
[„False‟, „True‟, „False‟, 0.04],
[„False‟, „False‟ „True‟, 0.89],
[„False‟, „False‟, „False‟, 0.11 ]], [tuberculosis_or_cancer,
bronchitis]) s0 = State(asia, name=”asia”)
s1 = State(tuberculosis, name=”
tuberculosis”) s2 = State(smoking, name=”
smoker”)
network = BayesianNetwork(“asia”)
network.add_nodes(s0,s1,s2)
network.add_edge(s0,s1)
network.add_edge(s1.s2)
network.bake()
print(network.predict_probal({„tuberculosis‟: „True‟}))
57
Enrolment No: 210310116016
Experiment No: 8
58
Enrolment No: 210310116016
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x],
length) distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1] if
response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes =
sorted(classVotes.iteritems(),
reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet,
predictions): correct = 0 for x
in range(len(testSet)):
key=operator.itemgetter(1
),
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare data
trainingSet= []
testSet=[] split =
0.67
loadDataset('knndat.data', split, trainingSet,
testSet) print('Train set: ' + repr(len(trainingSet)))
print('Test set: ' + repr(len(testSet)))
# generate
59
Enrolment No: 210310116016
predictions
predictions=[] k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x],
k) result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][- 1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%') main()
OUTPUT
Confusion matrix is as follows
[[11 0 0]
[0 9 1]
[0 1 8]]
Accuracy metrics
0 1.00 1.00 1.00 11
1 0.90 0.90 0.90 10
2 0.89 0.89 0,89 9
60