Artificial Intelligence - Handbook
Artificial Intelligence - Handbook
Artificial Intelligence
|
Disclaimer: The content is curated for educational purposes only.
© Edunet Foundation. All rights reserved.
Table of Contents
Table of Contents ................................................................................................... 2
Learning Outcome .................................................................................................. 4
Chapter 1: What is AI and ML? ............................................................................... 5
1.1 Introduction to Artificial Intelligence .................................................................. 5
1.2 Artificial Intelligence Market Trends ................................................................ 13
1.3 Introduction to Machine Learning.................................................................... 18
1.4 PRACTICALS: Microsoft AI Demos ................................................................ 21
Chapter 2: Data Analysis with Python .................................................................. 37
2.1 What is Data Analysis? ................................................................................... 37
2.2 Anaconda Software and Introduction to Python .............................................. 40
2.3 Python Libraries .............................................................................................. 67
2.4 NumPy Library ................................................................................................ 72
Chapter 3: Data Analysis with Pandas and Matplotlib ........................................ 83
3.1 What is Data Visualization? ............................................................................ 83
3.2 Plotting with Matplotlib .................................................................................... 87
3.3 Data Manipulation with Pandas ...................................................................... 98
3.4 Pandas Plotting............................................................................................. 126
Chapter 4: Building Machine Learning Models.................................................. 136
4.1 Machine Learning Basics.............................................................................. 136
4.2 Linear Regression......................................................................................... 142
4.3 Logistic Regression ...................................................................................... 159
4.4 Naïve Bayes Theorem .................................................................................. 166
4.5 Bag of Words Approach................................................................................ 168
Chapter 5: Building Deep Learning Models ....................................................... 175
5.1 Deep Learning Basics................................................................................... 175
5.2 Concepts of Neural Networks ....................................................................... 176
5.3 Computer Vision Basics................................................................................ 184
5.4 Convolutional Neural Networks..................................................................... 203
Reference ........................................................................................................... 211
It is the branch of computer science that aims to create intelligent machines. It has
become an essential part of the technology industry. Research associated with
artificial intelligence is highly technical and specialized. The core problems of
artificial intelligence include programming computers for certain traits such as:
Knowledge, Reasoning, Problem solving, Perception, Learning, Planning, and Ability
to manipulate and move objects. Knowledge engineering is a core part of AI
research. Machines can often act and react like humans only if they have abundant
information relating to the world. Artificial intelligence must have access to objects,
categories, properties and relations between all of them to implement knowledge
engineering. Initiating common sense, reasoning and problem-solving power in
machines is a difficult and tedious approach. Machine learning is another core part of
AI.
Sense: It's one of the properties of AI. It not only identifies worthy materials &
objects but also recognizes real-time operational activities. The sensors or
sensing devices can trace out & quickly differentiate between wrong & correct
objects.
Reason: This property works like a human brain does to complete a task
successfully. It understands, judges and prepares to execute. Reason
enables AI to deal with internal & external properties of resources like
condition, time-frame, behavior, other parameters of entities involved during
conversion of the task.
Adapt: It's the property that works with the highest intelligence. Like the way, a
human brain remembers the result of any past event. Its re-trains, debugs and
even find out uncovered properties to put the operation more accurate. It
remembers the last events & manages the functionalities accordingly.
For AI, data is an essential element that underpins AI's underlying logic. Without
data, data processing for AI will not be possible. With data mining's cleaning,
integration, reduction and other pre-treatment means, AI could have adequate data
for learning. As AI technologies iterate, the production, collection, storage,
calculation, transmission and application of data will all be completed by machines.
1.1.1 Applications of AI
Here we have some of the Artificial Intelligence Applications in real world.
1.Healthcare
One of the foremost deep-lying impacts which AI has created is within the
Healthcare space. A device, as common as a Fitbit or an iWatch, collects a lot of
data like the sleep patterns of the individual, the calories burnt by him, heart rate and
a lot more which can help with early detection, personalization, even disease
diagnosis.
This device, when powered with AI can easily monitor and notify abnormal trends.
This can even schedule a visit to the closest Doctor by itself and therefore, it’s also
of great help to the doctors who can get help in making decisions and research with
AI. It has been used to predict ICU transfers, improve clinical workflows and even
pinpoint a patient’s risk of hospital-acquired infections.
2. Automobile
At this stage where automobiles changing from an engine with a chassis around it to
a software-controlled intelligent machine, the role of AI cannot be underestimated.
The goal of self-driving cars, during which Autopilot by Tesla has been the
frontrunner, takes up data from all the Tesla’s running on the road and uses it in
machine learning algorithms. The assessment of both chips is later matched by the
system and followed if the input from both is the same.
4. Surveillance
AI has made it possible to develop face recognition Tools which may be used for
surveillance and security purposes. As a result, this empowers the systems to
monitor the footage in real-time and can be a pathbreaking development in regards
to public safety.
Manual monitoring of a CCTV camera requires constant human intervention so
they’re prone to errors and fatigue.AI-based surveillance is automated and works
24/7, providing real-time insights. According to a report by the Carnegie Endowment
for International Peace, a minimum of 75 out of the 176 countries are using AI tools
for surveillance purposes. Across the country, 400 million CCTV cameras are
already in situ, powered by AI technologies, primarily face recognition.
5. Social Media
Social Media is not just a platform for networking and expressing oneself. It
subconsciously shapes our choices, ideologies, and temperament.
This is due to the synthetic Intelligence tools which work silently within the
background, showing us posts that we “might” like and advertising products that
“might” be useful based on our search and browsing history.
This helps with social media advertising because of its unprecedented ability to run
paid ads to platform users based on highly granular demographic and behavioural
targeting.
6. Entertainment
The show business, with the arrival of online streaming services like Netflix and
Amazon Prime, relies heavily on the info collected by the users.
This helps with recommendations based upon the previously viewed content. This is
done not only to deliver accurate suggestions but also to create content that would
be liked by a majority of the viewers.
With new contents being created every minute, it is very difficult to classify them and
making them easier to search.AI tools analyse the contents of videos frame by frame
and identify objects to feature appropriate tags.AI is additionally helping media
companies to form strategic decisions.
7. Education
In the education sector also, there are a number of problems which will be solved by
the implementation of AI.
A few of them being automated marking software, content retention techniques and
suggesting improvements that are required. This can help the teachers monitor not
© Edunet Foundation. All rights reserved. | 9
just the academic but also the psychological, mental and physical well-being of the
students but also their all-round development.
This would also help in extending the reach of education to areas where quality
educators can’t be present physically.
8. Space Exploration
AI systems are being developed to scale back the danger of human life that venture
into the vast realms of the undiscovered and unravelled universe which is a very
risky task that the astronauts need to take up.
As a result, unmanned space exploration missions just like the Mars Rover are
possible due to the utilization of AI. It has helped us discover numerous exoplanets,
stars, galaxies, and more recently, two new planets in our very own system.
NASA is also working with AI applications for space exploration to automate image
analysis and to develop autonomous spacecraft that would avoid space debris
without human intervention, create communication networks more efficient and
distortion-free by using an AI-based device.
9. Gaming
In the gaming industry also, computer game Systems powered by AI is ushering us
into a replacement era of immersive experience in gaming. It serves to enhance the
game-player experience instead of machine learning or deciding. AI has also been
playing a huge role in creating video games and making it more tailored to players’
preferences.
10. Robotics
With increasing developments within the field of AI, robots are becoming more
efficient in performing tasks that earlier were too complex.
AI in robotics helps the robots to learn the processes and perform the tasks with
complete autonomy, without any human intervention. This is because robots are
designed to perform repetitive tasks with utmost precision and increased speed.
AI has been introducing flexibility and learning capabilities in previously rigid
applications of robots. These benefits are expected to reinforce the market growth.
11. Agriculture
Artificial Intelligence is changing the way we do one among our most primitive and
basic professions which are farming. The use of AI in agriculture are often attributed
to agriculture robots, predictive analysis, and crop and soil monitoring.
In addition, drones are also used for spraying insecticides and detecting weed
formation in large farms. This is getting to help firms like Blue River Technologies,
better manage the farms.
12. E-Commerce
This is one of the Artificial Intelligence Applications that’s found to be widely used.
Different departments of E-commerce including logistics, predicting demand,
intelligent marketing, better personalization, use of chatbots, etc. are being disrupted
by AI. The E-Commerce industry, a prominent player being Amazon is one among
the primary industries to embrace AI. This may experience a good use of AI with
time.
E-commerce retailers are increasingly turning towards chatbots or digital assistants
to supply 24×7 support to their online buyers.
Artificial Intelligence is not a new word and not a new technology for researchers.
This technology is much older than you would imagine. Even there are the myths of
Mechanical men in Ancient Greek and Egyptian Myths. Following are some
milestones in the history of AI which defines the journey from the AI generation to till
date development.
Image: History of AI
Reference: https://www.javatpoint.com/history-of-artificial-intelligence
Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial neurons.
Year 1949: Donald Hebb demonstrated an updating rule for modifying the
connection strength between neurons. His rule is now called Hebbian learning.
Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's ability to
exhibit intelligent behavior equivalent to human intelligence, called a Turing test.
A boom of AI (1980-1987)
Year 1980: After AI winter duration, AI came back with "Expert System". Expert
systems were programmed that emulate the decision-making ability of a human
expert. In the Year 1980, the first national conference of the American Association of
Artificial Intelligence was held at Stanford University.
Let's get started with some practical demonstrations of some AI applications. In the
very first practical which is based on Text Analytics which determines the sentiments
of your message typed or spoken and predicts whether the message is Positive,
Negative or Neutral.
Enterprise change
© Edunet Foundation. All rights reserved. | 13
AI is engaged in the management and production processes of the enterprise, with a
trend of being increasingly commercialized, and some enterprises have realized
relatively mature intelligent applications. These enterprises have been able to collect
and make use of user information from multiple dimensions via various technological
means and provide consumers with pertinent products and services, at the same
time satisfy their potential needs through insights into development trends gained via
data optimization.
1. Enterprise change
Industry change
2. Industry change
3. Labor change
Investment and financing data in recent years show that investment frequencies and
amounts raised in business service, robotics, healthcare, industry solutions, basic
components and finance are all higher than those in other sectors. From enterprise
perspective, those with a top global team, financial strength and high-tech gene are
more favoured by secondary market investors. From industry perspective, however,
new retail, autonomous driving, healthcare and education, all easy to deploy,
indicate more opportunities, and companies engaged in such sectors could see more
investment opportunities.
The World Economic Forum’s “The Future of Jobs 2018” aims to base this debate on
facts rather than speculation. By tracking the acceleration of technological change as
it gives rise to new job roles, occupations and industries, the report evaluates the
changing contours of work in the Fourth Industrial Revolution.
One of the primary drivers of change identified is the role of emerging technologies,
such as artificial intelligence (AI) and automation. The report seeks to shed more
light on the role of new technologies in the labour market, and to bring more clarity to
the debate about how AI could both create and limit economic opportunity. With 575
million members globally, LinkedIn’s platform provides a unique vantage point into
global labour-market developments, enabling us to support the Forum's examination
of the trends that will shape the future of work.
Our analysis uncovered two concurrent trends: the continued rise of tech jobs and
skills, and, in parallel, a growth in what we call “human-centric” jobs and skills. That
is, those that depend on intrinsically human qualities.
Tech jobs like software engineers and data analysts, along with technical skills such
as cloud computing, mobile application development, software testing and AI, are on
the rise in most industries and across all regions. But several highly “automatable”
jobs fall into the top 10 most declining occupations – i.e., jobs that have seen the
largest decreases in share of hiring over the past five years. These occupations
include administrative assistants, customer service representatives, accountants,
and electrical/mechanical technicians, many of which depend on more repetitive
tasks.
Reference - https://static.javatpoint.com/tutorial/ai/images/subsets-of-ai.png
Classification is used to determine what category something belongs in, after seeing
a number of examples of things from several categories. Regression is the attempt
to produce a function that describes the relationship between inputs and
outputs and predicts how the outputs should change as the inputs change. In
reinforcement learning the agent is rewarded for good responses and punished for
bad ones.
Machine learning pipeline helps to automate ML Workflow and enable the sequence
data to be transformed and correlated together in a model to analyze and achieve
outputs. ML pipeline is constructed to allow the flow of data from raw data format to
some valuable information. It provides a mechanism to build a Multi-ML parallel
pipeline system to examine different ML methods’ outcomes. The Objective of the
Machine learning pipeline is to exercise control over the ML model. A well-planned
pipeline helps to makes the implementation more flexible. It is like having an
overview of a code to pick the fault and replace them with the correct code.
1. Pre-processing
Data preprocessing is a Data Mining technique that involves transferring raw data
into an understandable format. Real-world data is usually incomplete, inconsistent,
and lacks certain behaviors or trends, most likely to contain many inaccuracies. The
process of getting usable data for a Machine Learning algorithm follows steps such
as Feature Extraction and Scaling, Feature Selection, Dimensionality reduction, and
sampling. The product of Data Pre-processing is the final dataset used for training
the model and testing purposes.
2. Learning
A learning algorithm is used to process understandable data to extract patterns
appropriate for application in a new situation. In particular, the aim is to utilize a
system for a specific input-output transformation task. For this, choose the best-
performing model from a set of models produced by different hyperparameter
settings, metrics, and cross-validation techniques.
3. Evaluation
To Evaluate the Machine Learning model’s performance, fit a model to the training
data, and predict the labels of the test set. Further, count the number of wrong
predictions on the test dataset to compute the model’s prediction accuracy.
4. Prediction
The model’s performance to determine the outcomes of the test data set was not
used for any training or cross-validation activities.
Machine learning is one of the most exciting technologies that one would have ever
come across. As it is evident from the name, it gives the computer that which makes
it more similar to humans: The ability to learn. Machine learning is actively being
used today, perhaps in many more places than one would expect. We probably use
a learning algorithm dozen of time without even knowing it. Applications of Machine
Learning include:
c. Spam Detector
Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying the
mails and moving the spam mails to spam folder. This is again achieved by a spam
classifier running in the back end of mail application.
Deep learning is a type of ML that can determine for itself whether its predictions
are accurate. It also uses algorithms to analyze data, but it does so on a larger scale
than ML. Deep learning uses artificial neural networks, which consist of multiple
layers of algorithms. Each layer looks at the incoming data, performs its own
specialized analysis, and produces an output that other layers can understand. This
output is then passed to the next layer, where a different algorithm does its own
analysis, and so on.
With many layers in each neural network-and sometimes using multiple neural
networks-a machine can learn through its own data processing. This requires much
more data and much more computing power than ML.
We will next discuss some interactive demos related to text analytics, language
understanding on the Microsoft AI platform. In the context of text analytics, Microsoft
1. Go to website https://aidemos.microsoft.com/
2. Select Text Analytics and click on “Try it out>” as shown in figure
6. The API will select the key words as entities and link them to Wikipedia
Then Click on Next Step
Now, After Text Analytics, Let's try another AI application “Language Understanding”
where you can give commands in the format of text or voice and after understanding
the command, it takes decisions accordingly. So, it's an application of Natural
Language Processing. Let's get started
1. Go to website https://aidemos.microsoft.com/
2. Select Language Understanding and click on “Try it out>” as shown in figure
4. You can give your Commands (Either by text or voice) and Switches will glow
accordingly in the house next to it.
6. Result
Result
7. Click on Agreed
15. Click on Import images for Testing (Select an image for testing)
The data analytics encompasses six phases that are data discovery, data
aggregation, planning of the data models, data model execution, communication of
the results, and operationalization. Let us understand the six phases below:
There are different types of data analytics techniques. We will discuss about them
next.
Predictive Analytics: Predictive analytics turns the data into valuable, actionable
information. Predictive analytics uses data to determine the probable outcome of an
event or a likelihood of a situation occurring. Techniques that are used for predictive
analytics are:
Linear Regression
Time series analysis and forecasting
Data Mining
Descriptive Analytics: Descriptive analytics looks at data and analyze past event
for insight as to how to approach future events. It looks at the past performance and
understands the performance by mining historical data to understand the cause of
success or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis. Common examples of
Descriptive analytics are company reports that provide historic reviews like: 1) Data
Queries 2) Reports 3) Descriptive Statistics 4) Data dashboard
Diagnostic Analytics: In this analysis, we generally use historical data over other
data to answer any question or for the solution of any problem. We try to find any
dependency and pattern in the historical data of the particular problem.
© Edunet Foundation. All rights reserved. | 39
For example, companies go for this analysis because it gives a great insight into a
problem. Common techniques used for Diagnostic Analytics are: 1) Data
discovery 2) Data mining 3) Correlations
Many tools are available in the market, which make it easier for us:
The package manager of Anaconda is the conda which manages the package
versions. Anaconda is written in Python, and the package manager Conda checks
for the requirement of the dependencies and installs it if it is required. More
importantly, warning signs are given if the dependencies already exist. Anaconda is
pre-built with more than 1500 Python or R data science packages. Anaconda has
specific tools to collect data using Machine learning and Artificial Intelligence. The
distribution includes data-science packages suitable for Windows, Linux, and
macOS.
Package Anaconda has conda has its Python has pip as the package
package manager manager
Manager
User Anaconda is primarily Python is not only used in data
developed to support data science and machine learning but
Applications
science and machine also a variety of applications in
learning tasks embedded systems, web
development, and networking
program
Package Package manager conda Package manager pip allows all
allows Python as well as the Python dependencies to
Management
Non-Python library install
dependencies to install.
Now let's get started with very popular programming language used in data science
and variety of other tasks like website building, server-side programming etc. Named
Python.
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
Python has a simple syntax like the English language.
Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a
functional way.
Python was designed for readability and has some similarities to the English
language with influence from mathematics.
Python uses new lines to complete a command, as opposed to other
programming languages which often use semicolons or parentheses.
Python relies on indentation, using whitespace, to define scope; such as the
scope of loops, functions and classes. Other programming languages often
use curly-brackets for this purpose.
a. Click on Next
d. Browse the location, where you want to install. Keep it default, & Next.
f. Click on Install.
g. The installation process may take a few minutes.
i. Other packages.
To get hold of Anaconda, we need to get used to both CLI, & GUI of the software.
We previously used the CLI version, now need to see the GUI version.
Jupyter Notebook
So, after creating our environment for programming let's start with some basic
building blocks of Python language.
Variable Memory
Variable in python is created as soon as a value is assigned to it. It does not need
any additional commands to declare a variable in python. There are some basic
rules for naming a variable in python.
Data type defines the format, sets the upper & lower bounds of the data so that a
program could use it appropriately. In Python, we don’t need to declare a variable
with explicitly mentioning the data type. This feature is famously known as dynamic
typing. Python determines the type of a literal directly from the syntax at runtime. For
example – the quotes mark the declaration of a string value, square brackets
represent a list and curly brackets for a dictionary. Also, the non-decimal numbers
will get assigned to Integer type whereas the ones with a decimal point will be a float.
In Python, numeric data type represents the data which has numeric value. Numeric
value can be integer, floating number or even complex numbers. These values are
defined as int, float and complex class in Python.
Integers – This value is represented by int class. It contains positive or
negative whole numbers (without fraction or decimal). In Python there is
no limit to how long an integer value can be.
Float – This value is represented by float class. It is a real number with
floating point representation. It is specified by a decimal point. Optionally,
the character e or E followed by a positive or negative integer may be
appended to specify scientific notation.
Complex Numbers – Complex number is represented by complex class.
It is specified as (real part) + (imaginary part)j. For example – 2+3j
Creating String: Strings in Python can be created using single quotes or double
quotes or even triple quotes.
3) Tuple
Just like list, tuple is also an ordered collection of Python objects. The only difference
between tuple and list is that tuples are immutable i.e., tuples cannot be modified
after it is created and tuple uses () bracket. It is represented by tuple class. Tuples
can contain any number of elements and of any datatype (like strings, integers, list,
etc.).
Note: Tuples can also be created with a single element, but it is a bit tricky. Having
one element in the parentheses is not sufficient, there must be a trailing ‘comma’ to
make it a tuple.
5) Set
In Python, Set is an unordered collection of data type that is iterable, mutable and
has no duplicate elements. The order of elements in a set is undefined though it may
consist of various elements.
Creating Sets
Sets can be created by using the built-in set() function with an iterable object or a
sequence by placing the sequence inside curly braces, separated by ‘comma’. Type
of elements in a set need not be the same, various mixed-up data type values can
also be passed to the set.
Defining a Function
You can define functions to provide the required functionality. Here are simple rules
to define a function in Python.
2.2.11 Method
A method in python is somewhat similar to a function, except it is associated with
object/classes. Methods in python are very similar to functions except for two major
differences.
The method is implicitly used for an object for which it is called.
The method is accessible to data that is contained within the class.
Now, after understanding the concept of function in python let's get familiar
with conditional operators and looping statements in Python.
Equals: a == b
Not Equals: a != b
Less than: a < b
Less than or equal to: a <= b
Greater than: a > b
Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if statements"
and loops.
The elif keyword is python's way of saying "if the previous conditions were not
true, then try this condition".
The else keyword catches anything which isn't caught by the preceding
conditions.
Loops
A for loop is used for iterating over a sequence (that is either a list, a tuple, a
dictionary, a set, or a string). This is less like the for keyword in other programming
languages, and works more like an iterator method as found in other object-
orientated programming languages. With the for loop we can execute a set of
statements, once for each item in a list, tuple, set etc.
With the continue statement we can stop the current iteration of the loop, and
continue with the next.
range() Function
To loop through a set of code a specified number of times, we can use
the range() function. The range() function returns a sequence of numbers, starting
from 0 by default, and increments by 1 (by default), and ends at a specified number.
Nested Loops
A nested loop is a loop inside a loop.
The "inner loop" will be executed one time for each iteration of the "outer loop":
Python provides lots of built-in methods which we can use on strings. Below are the
list of some string methods available in Python 3.
1. capitalize()
Returns a copy of the string with its first character capitalized and the rest
lowercased.
2. Casefold()
Returns True if the string ends with the specified suffix, otherwise it returns
False.
7. upper()
The format() method has been introduced for handling complex string formatting
more efficiently. This method of the built-in string class provides functionality for
complex variable substitutions and value formatting. This new formatting technique
is regarded as more elegant. The general syntax of format();
3. Formatting Types
Inside the placeholders you can add a formatting type to format the result:
#To demonstrate, we insert the number 8 to set the available space for the
value to 8 characters.
6. : Extra space
The Python Standard Library contains the exact syntax, semantics, and tokens of
Python. It contains built-in modules that provide access to basic system functionality
like I/O and some other core modules. Most of the Python Libraries are written in the
C programming language. The Python standard library consists of more than 200
core modules. All these works together to make Python a high-level programming
language. Python Standard Library plays a very important role. Without it, the
programmers can’t have access to the functionalities of Python. But other than this,
there are several other libraries in Python that make a programmer’s life easier. Let’s
have a look at some of the commonly used libraries:
1. TensorFlow: This library was developed by Google in collaboration with the
Brain Team. It is an open-source library used for high-level computations. It is
also used in machine learning and deep learning algorithms. It contains a
large number of tensor operations. Researchers also use this Python library to
solve complex computations in Mathematics and Physics.
2. Matplotlib: This library is responsible for plotting numerical data. And that’s
why it is used in data analysis. It is also an open-source library and plots high-
defined figures like pie charts, histograms, scatterplots, graphs, etc.
3. Pandas: Pandas are an important library for data scientists. It is an open-
source machine learning library that provides flexible high-level data
structures and a variety of analysis tools. It eases data analysis, data
manipulation, and cleaning of data. Pandas support operations like Sorting,
Re-indexing, Iteration, Concatenation, Conversion of data, Visualizations,
Aggregations, etc.
4. Numpy: The name “Numpy” stands for “Numerical Python”. It is the
commonly used library. It is a popular machine learning library that supports
large matrices and multi-dimensional data. It consists of in-built mathematical
© Edunet Foundation. All rights reserved. | 68
functions for easy computations. Even libraries like TensorFlow use Numpy
internally to perform several operations on tensors. Array Interface is one of
the key features of this library.
5. SciPy: The name “SciPy” stands for “Scientific Python”. It is an open-source
library used for high-level scientific computations. This library is built over an
extension of Numpy. It works with Numpy to handle complex computations.
While Numpy allows sorting and indexing of array data, the numerical data
code is stored in SciPy. It is also widely used by application developers and
engineers.
6. Scrapy: It is an open-source library that is used for extracting data from
websites. It provides very fast web crawling and high-level screen scraping. It
can also be used for data mining and automated testing of data.
7. Scikit-learn: It is a famous Python library to work with complex data. Scikit-
learn is an open-source library that supports machine learning. It supports
variously supervised and unsupervised algorithms like linear regression,
classification, clustering, etc. This library works in association with Numpy and
SciPy.
8. PyGame: This library provides an easy interface to the Standard Directmedia
Library (SDL) platform-independent graphics, audio, and input libraries. It is
used for developing video games using computer graphics and audio libraries
along with Python programming language.
9. PyTorch: PyTorch is the largest machine learning library that optimizes tensor
computations. It has rich APIs to perform tensor computations with strong
GPU acceleration. It also helps to solve application issues related to neural
networks.
10. PyBrain: The name “PyBrain” stands for Python Based Reinforcement
Learning, Artificial Intelligence, and Neural Networks library. It is an open-
source library built for beginners in the field of Machine Learning. It provides
fast and easy-to-use algorithms for machine learning tasks. It is so flexible
and easily understandable and that’s why is really helpful for developers that
are new in research fields.
11. OpenCV: Open-Source Computer Vision is used for image processing. It is a
Python package that monitors overall functions focused on instant computer
vision. OpenCV provides several inbuilt functions, with the help of this you can
learn Computer Vision. It allows both read and write images at the same time.
Objects such as faces, trees, etc., can be diagnosed in any video or image.
There are many more libraries in Python. We can use a suitable library for our
purposes. Hence, Python libraries play a very crucial role and are very helpful to the
developers.
Using the Module: Now we can use the module we just created, by using
the import statement.
Example
Import the module named mymodule, and call the greeting function:
import mymodule
mymodule.greeting("Jonathan")
pipis the standard package manager for Python. It allows you to install and manage
additional packages that are not part of the Python standard library.
2.3.2 Attributes
Class attributes belong to the class itself they will be shared by all the instances.
Such attributes are defined in the class body parts usually at the top, for legibility.
Unlike class attributes, instance attributes are not shared by objects. Every object
has its own copy of the instance attribute (In case of class attributes all object refer
to single copy).
To list the attributes of an instance/object, we have two functions:-
1. vars()– This function displays the attribute of an instance in the form of an
dictionary.
© Edunet Foundation. All rights reserved. | 71
2. dir()– This function displays more attributes than vars function, as it is not limited
to instance. It displays the class attributes as well. It also displays the attributes of its
ancestor classes.
So, we explored the fundamentals of Linux operating system and basics of Python
programming language. Happy Learning.
Why NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to
process. NumPy aims to provide an array object that is up to 50x faster than
traditional Python lists. The array object in NumPy is called ndarray, it provides a lot
of supporting functions that make working with ndarray very easy. Arrays are very
frequently used in data science, where speed and resources are very important.
NumPy arrays are stored in a single contiguous (continuous) block of memory. There
are two key concepts relating to memory: dimensions and strides.
Firstly, many Numpy functions use strides to make things fast. Examples include
integer slicing (e.g. X[1,0:2]) and broadcasting. Understanding strides helps us better
understand how NumPy operates.
Secondly, we can directly use strides to make our own code faster. This can be
particularly useful for data pre-processing in machine learning. NumPy is a Python
library and is written partially in Python, but most of the parts that require fast
computation are written in C or C++.
Integers
The randint() method takes a size parameter where you can specify the shape of an
array.
Ex: Generate a 1-D array containing 5 random integers from 0 to 100
Ex: Generate a 2-D array with 3 rows, each row containing 5 random integers from 0
to 100
Statistics is concerned with collecting and then analyzing that data. It includes
methods for collecting the samples, describing the data, and then concluding that
data. NumPy is the fundamental package for scientific calculations and hence goes
hand-in-hand for NumPy statistical Functions.
NumPy contains various statistical functions that are used to perform statistical data
analysis. These statistical functions are useful when finding a maximum or minimum
of elements. It is also used to find basic statistical concepts like standard deviation,
variance, etc.
It calculates the mean by adding all the items of the arrays and then divides it by the
number of elements. We can also mention the axis along which the mean can be
calculated.
It can calculate the median for both one-dimensional and multi-dimensional arrays.
Median separates the higher and lower range of data values.
Standard Deviation
Standard deviation is the square root of the average of square deviations from mean.
The formula for standard deviation is:
Variance
Variance is the average of the squared differences from the mean. Following is the
formula for the same:
Quartiles:
A quartile is a type of quantile. The first quartile (Q1), is defined as the middle
number between the smallest number and the median of the data set, the second
quartile (Q2) – median of the given data set while the third quartile (Q3), is the middle
number between the median and the largest value of the data set.
Uses;
Decision Making
The data set having a higher value of interquartile range (IQR) has
more variability.
© Edunet Foundation. All rights reserved. | 79
The data set having a lower value of interquartile range (IQR) is
preferable.
After knowing the various statistical functions in NumPy which are used in
descriptive analytics, let us see some other interesting functionalities of NumPy.
The term broadcasting describes how NumPy treats arrays with different shapes
during arithmetic operations. Subject to certain constraints, the smaller array is
“broadcast” across the larger array so that they have compatible shapes.
The result is equivalent to the previous example where b was an array. We can think
of the scalar b being stretched during the arithmetic operation into an array with the
same shape as a. The new elements in b, as shown in the above figure, are simply
copies of the original scalar.
Sorting Arrays
numpy.sort(): This function returns a sorted copy of an array.
Parameters:
arr: Array to be sorted.
axis: Axis along which we need array to be started.
order: This argument specifies which fields to compare first.
kind: [‘quicksort’{default}, ‘mergesort’, ‘heapsort’]. These are the sorting
algorithms.
Use Matplotlib library and its various functions to visualize the data
Analyse different types of data using Pandas
Understand the use of data analytics in improved decision making
Classification of Data:
The Pandas library is one of the most preferred tools for data scientists to do data
manipulation and analysis, next to matplotlib for data visualization and NumPy, the
fundamental library for scientific computing in Python on which Pandas was built.
The fast, flexible, and expressive Pandas data structures are designed to make real-
world data analysis significantly easier, but this might not be immediately the case
for those who are just getting started with it. This is because there is so much
functionality built into this package that the options are overwhelming.
Our eyes are drawn to colors and patterns. We can quickly identify red from blue,
square from circle. Our culture is visual, including everything from art and
advertisements to TV and movies. Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message. When we see a chart,
we quickly see trends and outliers. If we can see something, we internalize it quickly.
It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of
data and couldn’t see a trend, you know how much more effective a visualization can
be. Data visualization helps transform your data into an engaging story with details
and patterns. It is used for:
Better Analysis
Speed up decision making process
Quick action
Identifying patterns
Story telling is more engaging
Grasping the latest trends
© Edunet Foundation. All rights reserved. | 84
Finding errors
Data visualization for idea illustration assists in conveying an idea, such as a tactic or
process. It is commonly used to spur idea generation across teams. In the early
days of visualization, the most common visualization technique was using a
Microsoft Excel spreadsheet to transform the information into a table, bar graph or
pie chart. While Microsoft Excel continues to be a popular tool for data visualization,
others have been created that provide us with more sophisticated abilities.
Data visualization is a part of exploratory data analysis, where the main objective is
to analyse data and summarize the entire characteristics, often with visual methods.
We perform analysis on the data that we collect, find important metrics/features by
using some nice and pretty visualizations. It is usually performed using the following
methods:
Multivariate Analysis: This analysis is used, when we have more than two variables
in the dataset. It is a hard task for the human brain to visualize the relationship
among more than 3 variables in a graph and thus multivariate analysis is used to
study these complex data types. Ex: - Cluster Analysis, Pair Plot and 3D scatter plot.
Matplotlib
Line Chart: A line chart displays the evolution of one or more numeric variables. It
is one of the most common chart types. It is a type of graph or chart which displays
information as a series of data points called ‘markers; connected by straight line
segments. Line plot can be used for both ordered and unordered data.
Line Color:
You can use the keyword argument color or the shorter c to set the color of the line.
The default colors used in matplotlib are - b: blue, g: green, r:red, c:cyan, m:
magenta, y: yellow, k:black, w:white.
Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to set the
color of the edge of the markers, and markefacecolor or the shorter mfc to set the
color of the face of the markers
The Matplotlib subplot() function can be called to plot two or more plots in one
figure. Matplotlib supports all kind of subplots including 2x1 vertical, 2x1 horizontal or
a 2x2 grid. The subplot() function takes three arguments that describes the layout of
the figure. The layout is organized in rows and columns, which are represented by
the first and second argument. The third argument represents the index of the
current plot.
plt.subplot(x, y, 1)
Example
#the figure has x rows, y columns, and this plot is the first plot.
plt.subplot(x,
Draw y,top
2 plots on 2) of each other:
#the figure has x rows, y columns, and this plot is the second plot.
Code: subplot theory.ipynb
You can draw as many plots you like on one figure, just describe the number of
rows, columns, and the index of the plot.
Bar Plot: Bar plots are arguably the simplest data visualization. They map
categories to numbers; it is very convenient while comparing categories of data or
different groups of data. Bar plots are very flexible: The height can represent
anything, as long as it is a number. And each bar can represent anything, as long as
it is a category.
A legend is an area describing the elements of the graph. In the matplotlib library,
there’s a function called legend() which is used to place a legend on the axes.
Matplotlib.pyplot.legend()
Write the code given below in jupyter notebook to add Legend to the graph and
click on Run. We can use a rounded box,(fancybox) or add a shadow, change the
transparency (alpha value) of the frame, or change the padding around the text
Result:
In above example you can observe text annotate local maximum and local minimum
by arrow. We next consider another example.
Three-dimensional plots are enabled by importing the mplot3d toolkit, included with
the main Matplotlib installation:
from mpl_toolkits import mplot3d
We can plot a variety of three-dimensional plot types, with the above three-
dimensional axes enabled. Remember that compared to the 2d plots, it will be
© Edunet Foundation. All rights reserved. | 96
greatly beneficial if we view the 3d plots interactively rather than statically in the
notebook. We will therefore use %matplotlib notebook rather than %matplotlib inline
when running the code. ax.plot3d and ax.scatter are the function to plot line and
point graph respectively. Write down the code given below in jupyter notebook and
click on Run
Output:
We have learned and explored the various functions in Matplotlib library which are
used to visualize the data in various format and make it available to help in decision
making process. Now, let us see another powerful library called Pandas which is
very popular in data analytics.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool. Pandas is a newer package built on top of NumPy, and
provides an efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row and
column labels, and often with heterogeneous types and/or missing data. Apart
from offering a convenient storage interface for labelled data, Pandas also
implements a number of powerful data operations familiar to users of both
database frameworks and spreadsheet programs.
Image: DataFrame
For e.g., we need to store passengers’ data of the Titanic. For a number of
passengers, we know the name (characters), age (integers) and sex (male/female)
data.
© Edunet Foundation. All rights reserved. | 98
Output:
.
3.3.2 How to manipulate textual data?
Here we use the Titanic data set, which is stored as CSV file. A screenshot of the
first few columns of the dataset is given below.
Task 2: Create a new column Surname that contains the surname of the passengers
by extracting the part before the comma.
By this method and various functions, you can perform textual data manipulation
using pandas.
3.3.3 Introducing Pandas Objects
As we see in the output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes.
The values are simply a familiar NumPy array:
Like with a NumPy array, data can be accessed by the associated index via the
familiar Python square-bracket notation:
Now that we have the area and the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:
In order to create a series from array, we have to import a numpy module and have
to use array() function.
In order to create a series from list, we have to first create a list after that we can
create a series from list.
In order to create a series from dictionary, we have to first create a dictionary after
that we can make a series using dictionary. Dictionary key are used to construct an
index.
The different data storage formats available to be manipulated by Pandas library are
text, binary and SQL. Below is a table containing available ‘readers’ and ‘writers’
functions of the pandas I/O API set with data format and description.
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_fwf
text JSON read_json to_json
text HTML read_html to_html
These files serve a number of different business purposes. Take, for instance, they
help companies export a high volume of data to a more concentrated database.
They also serve two other primary business functions:
CSV files are plain-text files, making them easier for the website developer to
create
Since they're plain text, they're easier to import into a spreadsheet or another
storage database, regardless of the specific software you're using.
To better organize large amounts of data.
Saving CSV files is relatively easy, you just need to know where to change the file
type. Under the "File name" section in the "Save As" tab, you can select "Save as
type" and change it to "CSV (Comma delimited) (*.csv). Once that option is selected,
you are on your way to quicker and easier data organization. This should be the
same for both Apple and Microsoft operating systems.
A JSON file is a file that stores simple data structures and objects in JavaScript
Object Notation (JSON) format, which is a standard data interchange format. It is
primarily used for transmitting data between a web application and a server. JSON
files are lightweight, text-based, human-readable, and can be edited using a text
editor.
Because JSON files are plain text files, you can open them in any text editor,
including:
Microsoft Notepad (Windows)
Apple TextEdit (Mac)
Vim (Linux)
GitHub Atom (cross-platform)
You can also open a JSON file in the Google Chrome and Mozilla Firefox web
browsers by dragging and dropping the file into your browser window.
Structures of JSON
JSON supports two widely used (amongst programming languages) data structures.
Since data structure supported by JSON is also supported by most of the modern
programming languages, it makes JSON a very useful data-interchange format.
Explanation of Syntax
An object starts and ends with '{' and '}'. Between them, a number of string value
pairs can reside. String and value is separated by a ':' and if there are more than one
string value pairs, they are separated by ','.
Example
{
"firstName": "John",
"lastName": "Maxwell",
"age": 40,
"email":"[email protected]"
}
In JSON, objects can nest arrays (starts and ends with '[' and ']') within it. The
following example shows that.
{
"Students": [
{ "Name":"Amit Goenka" ,
"Major":"Physics" },
{ "Name":"Smita Pallod" ,
"Major":"Chemistry" },
{ "Name":"Rajeev Sen" ,
"Major":"Mathematics" }
]
}
Array:
Syntax:
[ value, .......]
Explanation of Syntax:
An Array starts and ends with '[' and ']'. Between them, a number of values can
reside. If there are more than one values, they are separated by ','.
Example
[100, 200, 300, 400]
If the JSON data describes an array, and each element of that array is an object.
[
{
"name": "John Maxwell",
Remember that even arrays can also be nested within an object. The following
shows that.
{
"firstName": "John",
"lastName": "Maxwell",
"age": 40,
"address":
{
"streetAddress": "144 J B Queens Road",
"city": "Dallas",
"state": "Washington",
"postalCode": "75001"
},
"phoneNumber":
[
{
"type": "personal",
"number": "(214)5096995"
},
{
"type": "fax",
"number": "13235551234"
}
]
}
Value
Syntax:
String || Number || Object || Array || TRUE || FALSE || NULL
A value can be a string, a number, an object, an Array, a Boolean value (i.e., true or
false) or Null. This structure can be nested.
Number
The following table shows supported number types.
Number Types Description
Integer Positive or negative Digits.1-9 and 0.
Fraction Fractions like .8.
Exponent e, e+, e-, E, E+, E-
Whitespace
Whitespace can be placed between any pair of supported data-types.
The basic process of loading data from a CSV file into a Pandas DataFrame is
achieved using the “read_csv” function in Pandas:
# Load the Pandas libraries with alias 'pd'
import pandas as pd
A “CSV” file, that is, a file with a “csv” filetype, is a basic text file. Any text editor such
as NotePad on windows or TextEdit on Mac, can open a CSV file and show the
contents. Sublime Text is a wonderful and multi-functional text editor option for any
platform.
CSV is a standard form for storing tabular data in text format, where commas are
used to separate the different columns, and newlines (carriage return / press enter)
are used to separate rows. Typically, the first row in a CSV file contains the names of
the columns for the data.
An example of a table data set and the corresponding CSV-format data is shown in
the diagram below.
Note that almost any tabular data can be stored in CSV format – the format is
popular because of its simplicity and flexibility. You can create a text file in a text
editor, save it with a .csv extension, and open that file in Excel or Google Sheets to
see the table form.
Figure: Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote
characters are used if the data in a column may contain the separating character. In this case, the ‘NickName’
column contains semicolon characters, and so this column is “quoted”. Specify the separator and quote character
in pandas.read_csv
Figure: Pandas searches your ‘current working directory’ for the filename that you specify when opening or
loading files. The FileNotFoundError can be due to a misspelled filename, or an incorrect working directory.
Below are the steps to load JSON String into Pandas DataFrame
Step 1: Prepare the JSON String
To start with a simple example, let’s say that you have the following data about
different products and their prices:
Product Price
© Edunet Foundation. All rights reserved. | 116
Desktop Computer 700
Tablet 250
iPhone 800
Laptop 1200
Then, save the notepad with your desired file name and add the .json extension at
the end of the file name. Here, I named the file as data.json:
In this case, The JSON file is stored on the Desktop, under this path:
C:\Users\XYZ\Desktop\data.json
So, this is the code that is used to load the JSON file into the DataFrame:
import pandas as pd
df = pd.read_json (r'C:\Users\XYZ\Desktop\data.json')
print (df)
Run the code in Python (adjusted to your path), and you’ll get the following
DataFrame:
Figure: Output
3 different JSON strings
Below are 3 different ways that you could capture the data as JSON strings.
Each of those strings would generate a DataFrame with a different orientation when
loading the files into Python.
2. Values orientation
[["Desktop Computer",700],["Tablet",250],["iPhone",800],["Laptop",1200]]
3. Columns orientation
{"Product":{"0":"Desktop
Computer","1":"Tablet","2":"iPhone","3":"Laptop"},"Price":{"0":700,"1":250,"2":800,"3":
1200}}
HTML is a Hypertext Markup Language, mainly used for created web applications
and pages. HTML uses tags to define each block of code like a <p></p> tag for the
start and end of a paragraph, <h2></h2> for the start and end of the heading, and
similarly there are many tags that together collate to form an HTML web page.
In order to read an HTML file, pandas dataframe looks for a tag. That tag is called a
<td></td> tag. This tag is used for defining a table in HTML. The Pandas library
provides functions like read_html() and to_html(), to import and export data to
DataFrames. We will discuss below how to read tabular data from an HTML file and
load it into a Pandas DataFrame as well as to write data from a Pandas DataFrame
to an HTML file.
Before applying groupby function to the dataset, let’s go over a visual example.
Assume we have two features. One is color which is a categorical feature and the
other one is a numerical feature, values. We want to group values by color and
calculate the mean (or any other aggregation) of values for different colors. Then
© Edunet Foundation. All rights reserved. | 119
finally sort the colors based on average values. The following figure shows the steps
of this process.
A Sample DataFrame
Once the data has been loaded into Python, Pandas makes the calculation of
different statistics very simple. For example, mean, max, min, standard deviations
and more for columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out: 10528.0
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out: 92321.0
# How many entries are there for each month?
© Edunet Foundation. All rights reserved. | 121
data['month'].value_counts()
Out:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out: 9
The .describe() function as discussed before is a useful summarisation tool that will
quickly display statistics for any variable or group it is applied to. The describe()
output varies depending on whether you apply it to a numeric or character column.
The output from a groupby and aggregation operation varies between Pandas Series
and Pandas Dataframes. As a rule of thumb, if you calculate more than one column
of results, your result will be a Dataframe. For a single column of results, the agg
function, by default, will produce a Series.
You can change this by selecting your operation column differently:
# produces Pandas Series
data.groupby('month')['duration'].sum()
# Produces Pandas DataFrame
data.groupby('month')[['duration']].sum()
The groupby output will have an index or multi-index on rows corresponding to your
chosen grouping variables. To avoid setting this index, pass “as_index=False” to the
groupby operation.
data.groupby('month', as_index=False).agg({"duration": "sum"})
Figure: Using the as_index parameter while Grouping data in pandas prevents setting a row index on the result.
The aggregation functionality provided by the agg() function allows multiple statistics
to be calculated per group in one calculation.
Applying a single function to columns in groups
Instructions for aggregation are provided in the form of a python dictionary or list.
The dictionary keys are used to specify the columns upon which you’d like to perform
operations, and the dictionary values to specify the function to run.
For example:
# Group the data frame by month and item and extract a number of stats from each
group
data.groupby(
['month', 'item']
).agg(
{
'duration':sum, # Sum duration per group
'network_type': "count", # get the count of networks
'date': 'first' # get the first date per group
}
)
3.3.11 Pivot Tables
You may be familiar with pivot tables in Excel to generate easy insights into your
data. The function is quite similar to the group by function available in Pandas.
It’s a table of statistics that helps summarize the data of a larger table by “pivoting”
that data. In Pandas, we can construct a pivot table using the following syntax,
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All',
observed=False)
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes)
on the index and columns of the result DataFrame.
Parameters
Data: DataFrame
values: column to aggregate, optional
index: column, Grouper, array, or list of the previous
Columns: column, Grouper, array, or list of the previous
Aggfunc: function, list of functions, dict, default numpy.mean
fill_value: scalar, default None
Value to replace missing values with (in the resulting pivot table, after aggregation).
Margins: bool, default False
Returns: DataFrame
An Excel style pivot table.
We’ll use Pandas to import the data into a dataframe called df. We’ll also print out
the first five rows using the .head() function:
import pandas as pd
df =
pd.read_excel('https://github.com/datagy/pivot_table_pandas/raw/master/sample_piv
ot.xlsx', parse_dates=['Date'])
print(df.head())
We’ll begin by aggregating the Sales values by the Region the sale took place in:
sales_by_region = pd.pivot_table(df, index = 'Region', values = 'Sales')
print(sales_by_region)
This returns the following output:
This gave us a summary of the Sales field by Region. The default parameter for
aggfunc is mean. Because of this, the Sales field in the resulting dataframe is the
© Edunet Foundation. All rights reserved. | 124
average of Sales per Region. If we wanted to change the type of function used, we
could use the aggfunc parameter. For example, if we wanted to return the sum of all
Sales across a region, we could write:
total_by_region = pd.pivot_table(df, index = 'Region', values = 'Sales',
aggfunc='sum') print(total_by_region)
These returns:
Let’s create a dataframe that generates the mean Sale price by Region:
avg_region_price = pd.pivot_table(df, index = 'Region', values = 'Sales')
The values in this dataframe are:
Now, say we wanted to filter the dataframe to only include Regions where the
average sale price was over 450, we could write:
avg_region_price[avg_region_price['Sales'] > 450]
We can also apply multiple conditions, such as filtering to show only sales greater
than 450 or less than 430.
avg_region_price[(avg_region_price['Sales'] > 450) | (avg_region_price['Sales'] <
430)]
We have wrapped each condition in brackets and separated the conditions by a pipe
( | ) symbol. This returns the following:
Adding columns to a pivot table in Pandas can add another dimension to the tables.
The Columns parameter allows us to add a key to aggregate by. For example, if we
wanted to see the number of units sold by Type and by Region, we could write:
columns_example = pd.pivot_table(df, index = 'Type', columns = 'Region', values =
'Units', aggfunc = 'sum') print(columns_example)
Columns are optional as we indicated above and provide the keys by which to
separate the data. The pivot table aggregates the values in the values parameter.
We will understand the univariate plotting using the Wine reviews dataset. This
dataset contains 10 columns and 150k rows of wine reviews. We will first import the
dataset and then start with the analysis.
Bar Chart
California produces almost a third of wines reviewed in Wine Magazine! The number
of reviews of a certain score allotted by Wine Magazine:
A line chart can pass over any number of many individual values, making it the tool
of first choice for distributions with many unique values or categories. However, line
charts have an important weakness: unlike bar charts, they're not appropriate for
nominal categorical data. While bar charts distinguish between every "type" of point
line charts mushes them together. So, a line chart asserts an order to the values on
the horizontal axis, and the order won’t make sense with some data. After all, a
"descent" from California to Washington to Tuscany doesn't mean much! Line charts
also make it harder to distinguish between individual values.
Area Chart
Histogram
The bivariate plotting as discussed before, compares two sets of data to find a
relationship between the two variables. We will consider the same Wine Dataset
used for univariate analysis.
Scatter Plot
The simplest bivariate plot is the scatter plot.
Another interesting way to do this that's built right into pandas is to use our next plot
type, a hexplot.
Hex Plot
A hex plot aggregates points in space into hexagons, and then colors those
hexagons based on the values within them:
Stacked bar plots share the strengths and weaknesses of univariate bar charts. They
work best for nominal categorical or small ordinal categorical variables. Another
simple example is the area plot, which lends itself very naturally to this form of
manipulation,
The color and label in the above plot, indicates the amount of correlation between
the two variables of interest. We can see from the above heatmap that the variables
The plot in this case demonstrates conclusively that within our datasets goalkeepers
(at least, those with an overall score between 80 and 85) have much lower
Aggression scores than Strikers do. In this plot, the horizontal axis encodes the
Overall score, the vertical axis encodes the Aggression score, and the grouping
encodes the Position. We show the output below.
So, in this chapter we have explored about various important concepts of Data
analytics and the libraries which are used in data analysis like NumPy, Pandas and
Matplotlib. Using these libraries, we can analyse our data and make sense out of
data.
Terminology of ML
1. Dataset: A set of data examples, that contain features important to solving the
problem.
2. Features: Important pieces of data that help us understand a problem. These
are the input which are fed in to a Machine Learning algorithm to help it learn.
3. Target: It is the information the machine learns to predict. The prediction is
what the machine learning model “guesses” what the target value should be
based on the given features.
1. Data Collection: This is the first step, and the goal of this step is to collect the
data that the algorithm will learn from.
2. Data Preparation: Format and engineer the data into the optimal format,
extracting important features and performing dimensionality reduction.
3. Data Wrangling: It is the process of cleaning and converting raw data into a
useable format.
4. Training: Also known as the fitting stage, this is where the ML algorithm
actually learns by showing it the data that has been collected and prepared.
5. Evaluation: Test the model to see how well it performs and then fine tune the
model to maximize its performance.
6. Deployment: This is the last step of machine learning cycle, where we deploy
the model in a real-world system.
Let us now see some real-life applications of machine learning in our day-to-day life.
4.1.2 Real time Application of Machine Learning
Machine learning is relevant in many fields, industries, and has the capability to grow
over time. Here are six real-life examples of how machine learning is being used.
1. Image recognition
5. Extraction
Machine learning can extract structured information from unstructured data. A
machine learning algorithm automates the process of annotating datasets for
predictive analytics tools. It helps develop methods to prevent, diagnose, and treat
the disorders
Supervised learning is the most popular type of machine learning in which machines
are trained using "labelled" training data, and on the basis of that data, machines
predict the output. The labelled data means some input data is already tagged with
the correct output. In supervised learning, the training data provided to the machines
work as the supervisor that teaches the machines to predict the output correctly. The
aim of supervised learning algorithm is to find a mapping function to map the input
variable (x) with the output variable (y). Common algorithms used during supervised
learning include neural networks, decision trees, linear regression, and logistic
regression.
In the real-world, supervised learning can be used for predicting real estate prices,
finding disease risk factors, image classification, fraud Detection, spam filtering, etc.
A few example used cases include, creating customer groups based on purchase
behaviour, grouping inventory according to sales and manufacturing metrics.
Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It is a sort of algorithm that improves upon itself and learns from
new situations, by using a system of rewards and penalty. The learning system,
called agent in this context, learns with an interactive environment. The agent selects
and performs actions and receives rewards by performing correctly and penalties for
performing incorrectly. In reinforcement learning the agent learns by itself, without
the intervention from a human and is trained to give the best possible solution for the
best possible reward. For example, teaching cars to park themselves and drive
autonomously, dynamically controlling traffic lights to reduce traffic jams, training
robots etc.
Now, after understanding the brief overview of machine learning and its types, let us
explore a very popular library in machine learning which is used by many machine
learning engineers and data scientists to perform various data science and AI
projects, named as Scikit-learn.
4.1.4 Scikit Learn library overview
Scikit-learn provides a range of supervised and unsupervised learning algorithms via
a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use. Scikit Learn is built on top of several common data and math
Python libraries. Such a design makes it super easy to integrate between them all.
You can pass numpy arrays and pandas data frames directly to the ML algorithms of
Scikit! It uses the following libraries:
© Edunet Foundation. All rights reserved. | 140
NumPy: For any work with matrices, especially math operations
SciPy: Scientific and technical computing
Matplotlib: Data visualization
IPython: Interactive console for Python
Sympy: Symbolic mathematics
Pandas: Data handling, manipulation, and analysis
Now we have discussed about the machine learning library sklearn, let us start with
the basic algorithms in machine learning. We will first focus on the regression and
classification algorithms of Supervised machine learning, which works with labelled
datasets.
Classification
i) Logistic Regression
ii) K-Nearest Neighbours
iii) Support Vector Machines
iv) Naïve Bayes
v) Decision Tree Classification
Regression:
Linear regression can be further divided into two types of the algorithm:
1. Simple Linear Regression: If a single independent variable is used to predict the
value of a numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.
The linear regression model provides a sloped straight line representing the
relationship between the variables, where a scatter plot can be a helpful tool in
determining the strength of the relationship between the two variables.
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error. The different values for weights or the
coefficient of lines (a, b) gives a different line of regression, so we need to calculate
the best values for a and b to find the best fit line.
4.2.2 Ordinary Least Square Method
Ordinary least squares, is a common technique to determine the coefficients of linear
regression. This method draws a line through the data points that minimizes the sum
of the squared differences between the observed values and the corresponding fitted
values. This approach treats the data as a matrix and uses linear algebra operations
to estimate the optimal values for the coefficients. It means that all of the data must
be available and we must have enough memory to fit the data and perform matrix
operations.
In case of Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
where, N is the total number of observations, yi is the actual value, (a + b xi) is the
predicted value.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is called
optimization. It can be achieved by the following method:
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑅 − 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
Machine Learning is the foundation for most artificial intelligence solutions, and the
creation of an intelligent solution often begins with the use of machine learning to
train a predictive model using historic data that we have collected. Azure Machine
Learning is a cloud service that we can use to train and manage machine learning
models. It allows building no-code machine learning models through a drag and drop
visual interface. It’s designed to help data scientists and machine learning engineers
.
Now, you will be redirected into the following Microsoft Azure Machine
Learning Studio (Classic) and your free workspace will be created as below:
Sign in into Microsoft Azure Machine Learning Studio (classic) and create
workspace as discussed above.
First select Experiment and then New at the bottom of the page.
For example: Normalized losses has the 41 missing value which is the maximum.
Make connection between dragged item, press on red sign and then Launch
column selector to choose the relevant column.
Remove the column which have the missing value by selecting With Rules
All Columns Exclude column names normalized –losses and click on
tick.
© Edunet Foundation. All rights reserved. | 152
Select the Data Transformation Manipulation Clean missing data, drag
it into Panel and make connection.
Select Machine Learning Train Train Model and drag it into the Panel
and connect it.
Select the machine learning Algorithm from Machine Learning Initialize
Model Regression Linear Regression, drag it into panel and make
connection.
Now Select price as output for the prediction and press the tick mark.
The logistic regression model passes the outcome through a logistic function to
calculate the probability of an occurrence. The model then maps the probability to
binary outcomes. The logistic regression is a type of sigmoid, which is a
mathematical function resulting in an S-shaped curve that takes any real number and
maps it to a probability between 0 and 1. The formula of the sigmoid function is:
Where e is the base of the natural logarithms and x is the actual numerical value, we
want to transform. Below is the image showing the logistic function:
Binomial: In binomial Logistic regression, we have only two possible outcomes of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Ordinal: In ordinal Logistic regression, the outcome will be ordered. The dependent
variable can have 3 or more possible ordered types, having a quantitative
significance. These variables may represent "low", "Medium", or "High" and each
category can have the scores like 0,1,2,3.
1. True Negative (TN): TN represents the number of patients, who have been
classified by the algorithm as healthy and they are healthy.
2. True Positive (TP): TP represents the number who have been properly
classified to have the disease and they have the disease.
3. False Positive (FP): FP represents the number of patients who have been
classified to have the disease but actually they are healthy.
4. False Negative (FN): FN represents the number who has been predicted by
the algorithm as healthy but actually they are suffering from the disease.
F1 score also known as the F-measure states the equilibrium between the
precision and the recall. We use F-measure when we have to compare two
models with low precision and high recall or vice versa.
We next discuss about the ROC (Receiver Operating Characteristics) curve and
AUC (Area Under the Curve). ROC curve and AUC are the performance measures
that provide a comprehensive evaluation of classification models. We have
discussed before about algorithms like logistic regression, return probabilities rather
than discrete outputs. A threshold value is set on the probabilities to distinguish
between the classes. Depending on the threshold value the value of the metrics,
such as precision and recall also change. We can not maximize both precision and
recall together, as increasing recall decreases recall and vice versa. Our aim will be
to maximize the precision and recall depending on the task.
The ROC curve is used for this case, which is a graph that summarizes the
performance of a classification model at all classification thresholds. The curve has
two axes True Positive Rate (TPR) vs False Positive Rate (FPR), both of which takes
values between 0 and 1.
We show below a typical ROC curve, where the points in the curve are calculated by
Sign in into Microsoft Azure Machine Learning Studio (classic) and create
workspace.
First select Experiment and then New at the bottom of the page.
Select Blank Experiment.
Give the title for the project as Logistic Regression Model. Select Sample
option from the Saved Dataset.
Select Breast Cancer Dataset and Drag selected dataset on Panel. Right click
on 1 and choose visualize option. Visualize the Dataset and then close it. For
example: In Class column, 0 represents the no cancer while 1 represents the
cancer
Now, after creating a model on azure platform let us learn about another
classification algorithm.
This algorithm is called naïve because the model assumes that the input features
that go into the model are independent of each other, i.e., there is no correlation
between the input features. The assumptions may or may not be true, therefore the
name naïve. We will first discuss a bit about probability, conditional probability and
Bayes Theorem before we go into the working of Naïve Bayes.
The events for which we want the probability of their happening are known as the
“favourable events”. The probability always lies in the range of 0 to 1, with 0 meaning
there is no probability of that event happening and 1 meaning there is a 100%
possibility it will happen. When we restrict the idea of probability to create a
dependency on a specific event, that is known as conditional probability.
4.4.2 Conditional Probability
Conditional Probability is the probability of one (or more) event given the occurrence
of another event. Let us consider two events, A and B. The conditional probability of
event B, will be defined as the probability that event B will occur, given the
knowledge that event A has already happened. Mathematically, it is denoted by
We next discuss about the Bayes theorem which follows in the footsteps of the
conditional probability.
where,
P(A) and P(B) are called marginal/prior probability and evidence respectively.
They are the probabilities of events A and B occurring, irrespective of the
outcomes of the other.
P(A|B) is called the posterior probability. It is the probability of event A
occurring, given that event B has occurred.
P(B|A) is called the likelihood probability. It is the probability of event B
occurring given that event A has occurred.
P(A∩B) is the joint probability of both events A and B.
Let us ask the question, “What is the probability that there will be rain, given the
weather is cloudy?”
P(Rain) is the Prior, P(Cloud|Rain) is the likelihood, and P(Cloud) is the evidence,
therefore P(Rain|Cloud) = , P(Cloud|Rain) * P(Rain) / P(Cloud).
The Bayes Rule provides the formula to compute the probability of output (Y) given
the input (X). In real-world problems, unlike the hypothetical assumption of having a
P(Y=k | X1...Xn) = ( P(X1 | Y=k) * P(X2 | Y=k) * P(X3 | Y=k) * ....* P(Xn | Y=k) ) *
P(Y=k) / P(X1)*P(X2)*P(X3)*P(Xn)
It is called a “bag” of words, because any information about the order or structure of
words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document.
We can see that there are some contrasting reviews about the movie as well as the
length and pace of the movie. Imagine looking at a thousand reviews like these.
Clearly, there is a lot of interesting insights we can draw from them and build upon
them to gauge how well the movie performed. However, as we saw above, we
cannot simply give these sentences to a machine learning model and ask it to tell us
whether a review was positive or negative. We need to perform certain text pre-
processing steps.
We will first build a vocabulary from all the unique words in the above three reviews.
The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’,
‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’. We can now take each of these words and mark
their occurrence in the three movie reviews above with 1s and 0s. This will give us 3
vectors for 3 reviews:
Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
And that’s the core idea behind a Bag of Words (BoW) model. The method we used
till now, takes the count of each word and represents the word in the vector by the
number of counts of that particular word. So, what does a word having a high word
count signify? Can we interpret that this particular word is important in retrieving
information about the documents? The answer to that is No. This is because if that
particular word occurs many times in the dataset, maybe it is because this word is
just a frequent word, not because it is meaningful or relevant.
There are approaches such as Tf-Idf approach to rescale the frequency of words by
how often they appear in all documents so that the scores for frequent words like
“the” that are also frequent across all documents are penalized. Term frequency–
inverse document frequency (Tf-Idf), is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or corpus. It is calculated by
multiplying two different metrics:
Here, in the numerator, n is the number of times the term “t” appears in the
document “d”. Thus, each document and term would have its own TF value. Take
the same vocabulary we had built in the Bag-of-Words model to show how to
calculate the TF for Review #2: This movie is not scary and is slow
Here, the vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’,
‘spooky’, ‘good’
We can calculate the term frequencies for all the terms and all the reviews in this
manner:
Inverse Document Frequency (IDF): IDF is a measure of how important a term is.
We need the IDF value because computing just the TF alone is not sufficient to
understand the importance of words.
We can calculate the IDF values for the all the words in Review 2: IDF(‘this’) = log
(number of documents/number of documents containing the word ‘this’) = log (3/3) =
log (1) = 0
Similarly: IDF (‘movie’,) = log (3/3) = 0, IDF(‘is’) = log (3/3) = 0, IDF(‘not’) = log (3/1)
= log (3) = 0.48, IDF(‘scary’) = log (3/2) = 0.18, IDF(‘and’) = log (3/3) = 0, IDF(‘slow’)
= log (3/1) = 0.48.
We can calculate the IDF values for each word like this. Thus, the IDF values for the
entire vocabulary would be:
Hence, we see that words like “is”, “this”, “and”, etc., are reduced to 0 and have little
importance; while words like “scary”, “long”, “good”, etc. are words with more
importance and thus have a higher value. We can now compute the TF-IDF score for
each word in the corpus. Words with a higher score are more important, and those
with a lower score are less important:
We can now calculate the TF-IDF score for every word in Review 2:
Similarly, we can calculate the TF-IDF scores for all the words with respect to all the
reviews, as shown in the figure below. We have now obtained the TF-IDF scores for
our vocabulary. TF-IDF also gives larger values for less frequent words and is high
when both IDF and TF values are high i.e. the word is rare in all the documents
combined but frequent in a single document.
Summarizing, we found in the BoW model, a text is represented as the bag of its
words, disregarding grammar and even word order but keeping multiplicity. The Tf-
Idf score is a numerical statistic that is intended to reflect how important a word is to
a document in a collection or corpus.
Afinn: It is the simplest yet popular lexicons used for sentiment analysis developed
by Finn Årup Nielsen. It contains 3300+ words with a polarity score associated with
each word. In python, there is an in-built function for this lexicon.
Polarity is a float that lies between [-1,1], -1 indicates negative sentiment and +1
indicates positive sentiments. Polarity is related to the emotion of a given text.
Subjectivity is also a float which lies in the range of [0,1] (0.0 being very objective
and 1.0 being very subjective). Subjective sentences generally refer to personal
opinion, emotion, or judgment. A subjective sentence may or may not carry any
emotion.
So, in this chapter we explored the various algorithms which falls under the category
of supervised and unsupervised machine learning. But there are some drawbacks or
limitations on these algorithms. So, in the next chapter we are going to discuss the
subset of machine learning named deep learning and how it helps to overcome the
limitations of machine learning. So let us get started with deep learning.
Deep learning (also known as deep structured learning) is a part of a broader family
of machine learning methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or unsupervised. Deep
learning models are not meant to be trained with an algorithm. Instead, they make
learning a step further. Deep learning models works directly with audio, images and
video data to get real time analysis. The data being fed to the deep learning model
does not need any external intervention. You can feed raw data to the model and
receive actionable insights.
Deep-learning architectures such as deep neural networks, deep belief networks,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, machine vision, speech recognition, natural
language processing, audio recognition, social network filtering, machine translation,
bioinformatics, drug design, medical image analysis, material inspection and board
game programs, where they have produced results comparable to and in some
cases surpassing human expert performance.
Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various
differences from biological brains. Specifically, neural networks tend to be static and
symbolic, while the biological brain of most living organisms is dynamic (plastic) and
analogue. The adjective "deep" in deep learning refers to the use of multiple layers in
the network. Deep learning is a modern variation which is concerned with an
unbounded number of layers of bounded size, which permits practical application
and optimized implementation, while retaining theoretical universality under mild
conditions. In deep learning the layers are also permitted to be heterogeneous and
to deviate widely from biologically informed connectionist models, for the sake of
efficiency, trainability and understandability. In deep learning, each level learns to
© Edunet Foundation. All rights reserved. | 175
transform its input data into a slightly more abstract and composite representation.
We list below the differences between AI, ML and DL.
Let us understand the core part of artificial neural networks which is Neuron.
Neurons as we know it in the biological concept are the basic functional units of the
nervous system, and they generate electrical signals called action potentials, which
allows them to quickly transmit information over long distances. Almost all the
neurons have three basic functions essential for the normal functioning of all the
cells in the body. These are to:
Artificial neuron also known as perceptron is the basic unit of the neural network. In
simple terms, it is a mathematical function based on a model of biological neurons. It
can also be seen as a simple logic gate with binary outputs. Each artificial neuron
has the following main functions:
1. Takes inputs from the input layer
2. Weighs them separately and sums them up
3. Pass this sum through a nonlinear function to produce output.
1. The values of the two inputs (x1, x2) are 0.8 and 1.2.
2. We have a set of weights (1.0,0.75) corresponding to the two inputs.
3. Then we have a bias with value 0.5 which needs to be added to the sum.
© Edunet Foundation. All rights reserved. | 178
4. The input to activation function is then calculated using the formula:-
Now the combination (C) can be fed to the activation function. Let us first understand
the logic of Rectified linear (ReLU) activation function which we are currently using in
our example. In our case, the combination value we got was 2.2 which is greater
than 0 so the output value of our activation function will be 2.2. This will be the final
output value of our single layer neuron.
Since we have learnt a bit about both biological and artificial neurons, we can now
draw comparisons between both as follows:
Forward Propagation
Forward propagation is how neural networks make predictions. Input data is “forward
propagated” through the network layer by layer to the final layer which outputs a
prediction.
Backpropagation
After understanding the concepts like neuron, bias, weights, forward propagation,
backward propagation which plays a vital role in predicting final output. let's explore
some problems like overfitting and underfitting of a model.
The possibility of over-fitting exists because the criterion used for selecting the model
is not the same as the criterion used to judge the suitability of a model. For example,
a model might be selected by maximizing its performance on some set of training
data, and yet its suitability might be determined by its ability to perform well on
unseen data; then overfitting occurs when a model begins to "memorize" training
data rather than "learning" to generalize from a trend.
The potential for overfitting depends not only on the number of parameters and data
but also the conformability of the model structure with the data shape, and the
magnitude of model error compared to the expected level of noise or error in the
data. Even when the fitted model does not have an excessive number of parameters,
it is to be expected that the fitted relationship will appear to perform less well on a
new data set than on the data set used for fitting (a phenomenon sometimes known
as shrinkage). In particular, the value of the coefficient of determination will shrink
relative to the original data.
To lessen the chance of, or amount of, overfitting, several techniques are available
(e.g., model comparison, cross-validation, regularization, early stopping, pruning,
Bayesian priors, or dropout). The basis of some techniques is either (1) to explicitly
penalize overly complex models or (2) to test the model's ability to generalize by
evaluating its performance on a set of data not used for training, which is assumed to
approximate the typical unseen data that a model will encounter.
1. Use More Data for training to make model learn maximum hidden pattern from the
training data and model becomes generalized.
2. Use Regularization Techniques Example: L1, L2, Drop Out, Early Stopping (in
case of Neural Networks) etc.
The data that we collect or generate is mostly raw data, i.e., it is not fit to be used in
applications directly due to a number of possible reasons. Therefore, we need to
analyse it first, perform the necessary pre-processing, and then use it.
For instance, let's assume that we were trying to build a cat classifier. Our program
would take an image as input and then tell us whether the image contains a cat or
not. The first step for building this classifier would be to collect hundreds of cat
pictures. One common issue is that all the pictures we have scraped would not be of
the same size/dimensions, so before feeding them to the model for training, we
would need to resize/pre-process them all to a standard size. This is just one of
many reasons why image processing is essential to any computer vision application.
Computer Vision is an interdisciplinary field that deals with how computers can be
made to gain a high-level understanding from digital images or videos. The idea here
is to automate tasks that the human visual systems can do. So, a computer should
be able to recognize objects such as that of a face of a human being or a lamppost
or even a statue.
It is a multidisciplinary field that could broadly be called a subfield of artificial
intelligence and machine learning, which may involve the use of specialized methods
and make use of general learning algorithms. The goal of computer vision is to
extract useful information from images.
1. Image Classification
2. Object Detection
3. Optical Character Recognition
4. Image Segmentation
The more pixels used to represent an image, the closer the result can resemble the
original. The number of pixels in an image is sometimes called the resolution, though
resolution has a more specific definition. Pixel counts can be expressed as a single
Image as Matrix
Images are represented in rows and columns. For example digital grayscale image is
presented in the computer by pixels matrix. Each pixel of such image is presented by
one matrix element – integer from the set. The numeric values in pixel presentation
are uniformly changed from zero (black pixels) to 255 (white pixels).
3. Colour images: Colour images are three band monochrome images in which, each
band contains a different color and the actual information is stored in the digital
image. The color images contain gray level information in each spectral band.
The images are represented as red, green and blue (RGB images). And each color
image has 24 bits/pixel means 8 bits for each of the three-color band (RGB).
The computer reads any image as a range of values between 0 and 255. For any
colour image, there are 3 primary channels – Red, green and blue. How it works is
pretty simple. A matrix is formed for every primary colour and later these matrices
combine to provide a Pixel value for the individual R, G, B colours. Each element of
the matrices provides data pertaining to the intensity of brightness of the
pixel. Consider the following image:
As shown, the size of the image here can be calculated as B x A x 3, where 3 is the
number of channels. Note: For a black-white image, there is only one single
channel.
Installation
Note: Since we are going to use OpenCV via Python, it is an implicit requirement that
you already have Python (version 3) already installed on your workstation.
To check if your installation was successful or not, run the following command in
either a Python shell or your command prompt:
import cv2
NumPy Library: The computer processes images in the form of a matrix for which
NumPy is used and OpenCV uses it in the background.
OpenCV python: OpenCV library previously it was cv but the updated version is
cv2. It is used to manipulate images and videos.
Parameters:
Return Value: This method returns an image that is loaded from the specified file.
Note: The image should be in the working directory or a full path of image should be
given. By default, OpenCV stores coloured images in BGR (Blue Green and Red)
format.
cv2.imwrite() method is used to save an image to any storage device. This will save
the image according to the specified format in current working directory.
Syntax: cv2.imwrite(filename, image)
Parameters: filename: A string representing the file name. The filename must
include image format like .jpg, .png, etc.
image: It is the image that is to be saved.
Return Value: It returns true if image is saved successfully.
Arithmetic Operations like Addition, Subtraction, and Bitwise Operations (AND, OR,
NOT, XOR) can be applied to the input images. These operations can be helpful in
enhancing the properties of the input images. The Image arithmetic are important for
analyzing the input image properties. The operated images can be further used as
an enhanced input image, and many more operations can be applied for clarifying,
thresholding, dilating etc of the image.
Addition of Image:
We can add two images by using function cv2.add(). This directly adds up image
pixels in the two images.
Syntax: cv2.add(img1, img2)
But adding the pixels is not an ideal situation. So, we use cv2.addweighted().
Remember, both images should be of equal size and depth.
Subtraction of Image:
© Edunet Foundation. All rights reserved. | 192
Just like addition, we can subtract the pixel values in two images and merge them
with the help of cv2.subtract(). The images should be of equal size and depth.
Syntax: cv2.subtract(src1, src2)
Bitwise operations are used in image manipulation and used for extracting
essential parts in the image. In this article, Bitwise operations used are:
AND, OR, XOR, NOT
Also, Bitwise operations helps in image masking. Image creation can be enabled
with the help of these operations. These operations can be helpful in enhancing the
properties of the input images.
NOTE: The Bitwise operations should be applied on input images of same
dimensions
cv2.arrowedLine() method is used to draw arrow segment pointing from the start
point to the end point.
Syntax: cv2.arrowedLine(image, start_point, end_point, color, thickness,
line_type, shift, tipLength)
Parameters: image, start_point, end_point, color, thickness are same as defined in
cv2.line()
line_type: It denotes the type of the line for drawing.
shift: It denotes number of fractional bits in the point coordinates.
tipLength: It denotes the length of the arrow tip in relation to the arrow length.
Return Value: It returns an image.
cv2.circle() method is used to draw a circle on any image. The syntax of cv2.circle()
method is:
Syntax:
cv2.circle(image, center_coordinates, radius, color, thickness)
Parameters:
image: It is the image on which the circle is to be drawn.
center_coordinates: It is the center coordinates of the circle. The coordinates are
represented as tuples of two values i.e. (X coordinate value, Y coordinate value).
radius: It is the radius of the circle.
color: It is the color of the borderline of a circle to be drawn. For BGR, we pass a
tuple. e.g.: (255, 0, 0) for blue color.
thickness: It is the thickness of the circle border line in px. Thickness of -1 px will fill
the circle shape by the specified color.
Return Value: It returns an image.
Syntax – cv2.resize()
The syntax of resize function in OpenCV is
where,
Parameter Description
src [required] source/input image
dsize [required] desired size for the output image
fx [optional] scale factor along the horizontal axis
fy [optional] scale factor along the vertical axis
interpolation [optional] flag that takes one of the following methods.
INTER_NEAREST – a nearest-neighbor interpolation INTER_LINEAR – a bilinear
interpolation (used by default) INTER_AREA – resampling using pixel area relation.
It may be a preferred method for image decimation, as it gives moire’-free results.
But when the image is zoomed, it is similar to the INTER_NEAREST method.
INTER_CUBIC – a bicubic interpolation over 4×4 pixel neighborhood
INTER_LANCZOS4 – a Lanczos interpolation over 8×8 pixel neighborhood
Syntax – cv2.Canny()
The syntax of OpenCV Canny Edge Detection function is
edges = cv2.Canny('/path/to/img', minVal, maxVal, apertureSize, L2gradient)
where:
Parameter Description
/path/to/img (Mandatory) File Path of the image
minVal (Mandatory) Minimum intensity gradient
maxVal (Mandatory) Maximum intensity gradient
apertureSize (Optional)
L2gradient (Optional) (Default Value : false)
If true, Canny() uses a much more computationally expensive equation to detect
edges, which provides more accuracy at the cost of resources.
Advantages of Blurring
The benefits of blurring are the following:
It removes low-intensity edges.
It helps in smoothing the image.
It is beneficial in hiding the details; for example, blurring is required in many cases,
such as police intentionally want to hide the victim's face.
OpenCV Averaging
In this technique, the image is convolved with a box filter (normalize). It calculates
the average of all the pixels which are under the kernel area and replaces the central
element with the calculated average. OpenCV provides the cv2.blur() or
cv2.boxFilter() to perform this operation. We should define the width and height of
the kernel. The syntax of cv2.blur() function is following.
cv2.blur(src, dst, ksize, anchor, borderType)
Parameters:
src - It represents the source (input) image.
dst - It represents the destination (output) image.
ksize - It represents the size of the kernel.
anchor - It denotes the anchor points.
borderType - It represents the type of border to be used to the output.
© Edunet Foundation. All rights reserved. | 196
OpenCV Gaussian Blur
Image smoothing is a technique which helps in reducing the noise in the images.
Image may contain various type of noise because of camera sensor. It basically
eliminates the high frequency (noise, edge) content from the image so edges are
slightly blurred in this operation. OpenCV provide gaussianblur() function to apply
smoothing on the images. The syntax is following:
dst=cv2.GuassiasBlur(src,ksize,sigmaX[,dst[,sigmaY[,borderType=BORDER_DEFA
ULT]]]
Parameters:
src -It is used to input an Image.
dst -It is a variable which stores an output Image.
ksize -It defines the Gaussian Kernel Size[height width ]. Height and width must be
odd (1,3,5,..) and can have different values. If ksize is set to [0,0], then ksize is
computed from sigma value.
sigmaX - Kernel standard derivation along X-axis.(horizontal direction).
sigmaY - Kernel standard derivation along Y-axis (vertical direction). If sigmaY = 0
then sigmaX value is taken for sigmaY.
borderType - These are the specified image boundaries while the kernel is applied
on the image borders. Possible border type is:
cv.BORDER_CONSTANT
cv.BORDER_REPLICATE
cv.BORDER_REFLECT
cv.BORDER_WRAP
cv.BORDER_REFLECT_101
cv.BORDER_TRANSPARENT
cv.BORDER_REFLECT101
cv.BORDER_DEFAULT
cv.BORDER_ISOLATED
Can we still easily classify the images? I believe, yes, we can clearly see there are
two cars, two animals and a person. But what is the difference between these two
sets of images? Well, in the second case we removed the colour, the background,
and the other minute details from the pictures. We only have the edges, and you still
able to identify the objects in the image. So, for any given image, if we are able to
extract only the edges and remove the noise from the image, we would still be able
to classify the image.
Once we have the idea of the edges, now let’s understand how can we extract the
edges from an image. Say, we take a small part of the image. We can compare the
pixel values with its surrounding pixels, to find out if a particular pixel lies on the
edge.
For example, if I take the target pixel 16 and compare the values at its left and right.
Here the values are 10 and 119 respectively. Clearly, there is a significant change in
the pixel values. So, we can say the pixel lies on the edge. Whereas, if you look at
© Edunet Foundation. All rights reserved. | 199
the pixels in the following image. The pixel values to the left and the right of the
selected pixel don’t have a significant difference. Hence, we can say that this pixel is
not at the edge.
Now the question is do we have to sit and manually compare these values to find the
edges. Well, obviously not. For the task, we can use a matrix known as the
kernel and perform the element-wise multiplication .
Let’s say, in the selected portion of the image, I multiply all the numbers on left with -
1, all the numbers on right with 1. Also, all the numbers in the middle row with 0. In
simple terms, I am trying to find the difference between the left and right pixels.
When this difference is higher than a threshold, we can conclude it’s an edge. In the
above case, the number is 31 which is not a large number. Hence this pixel doesn’t
lie on edge.
Let’s take another case, here the highlighted pixel is my target.
Filter/kernel
This matrix, that we use to calculate the difference is known as the filter or the
kernel. This filter slides through the image to generate a new metric called a feature
map. The values of the feature map tell, the particular pixel lies on the edge or not.
For this example, we are using 3*3 Prewitt filter as shown in the above image. As
shown below, when we apply the filter to perform detection on the given 6*6 image
(we have highlighted it in purple for our understanding) the output image will contain
((a11*1) +(a12*0)+ (a13*(-1)) + (a21*1)+(a22*0)+(a23*(-1))+(a31*1)+(a32*0)+(a33*(-
1))) in the purple square. We repeat the convolutions horizontally and then vertically
to obtain the output image.
We would continue the above procedure to get the processed image after edge-
detection. But, in the real world, we deal with very high-resolution images for Artificial
Intelligence applications. Hence, we opt for an algorithm to perform the convolutions,
and even use Deep Learning to decide on the best values of the filter.
There are various methods, and the following are some of the most commonly used
methods-
Prewitt edge detection
Sobel edge detection
Laplacian edge detection
Canny edge detection
This method is a commonly used edge detector mostly to detect the horizontal and
vertical edges in images. The above are the Prewitt edge detection filters.
Sobel Edge Detection: This uses a filter that gives more emphasis to the centre of
the filter. It is one of the most commonly used edge detectors and helps reduce
noise and provides differentiating, giving edge response simultaneously. The
following are the filters used in this method-
Canny Edge Detection: This is the most commonly used highly effective and
complex compared to many other methods. It is a multi-stage algorithm used to
detect/identify a wide range of edges.
Convert the image to grayscale
Reduce noise – as the edge detection that using derivatives is sensitive to
noise, we reduce it.
Calculate the gradient – helps identify the edge intensity and direction.
Non-maximum suppression – to thin the edges of the image.
Double threshold – to identify the strong, weak and irrelevant pixels in the
images.
Hysteresis edge tracking – helps convert the weak pixels into strong ones
only if they have a strong pixel around them.
Convolutional layers are the major building blocks used in convolutional neural
networks. A convolution is the simple application of a filter to an input that results in
an activation. Repeated application of the same filter to an input result in a map of
activations called a feature map, indicating the locations and strength of a detected
feature in an input, such as an image.
Once a feature map is created, we can pass each value in the feature map through a
nonlinearity, such as a ReLU, much like we do for the outputs of a fully connected
layer.
The layer used for convolution of images is 2D Convolution layer. Most important
parameters of Conv2D Layer are:
Filters
The first required Conv2D parameter is the number of filters that the convolutional
layer will learn. Layers early in the network architecture (i.e., closer to the actual input
image) learn fewer convolutional filters while layers deeper in the network (i.e., closer
to the output predictions) will learn more filters. Conv2D layers in between will learn
more filters than the early Conv2D layers but fewer filters than the layers closer to
Kernel Size
The second required parameter you need to provide to the Keras Conv2D class is
the kernel size, a 2-tuple specifying the width and height of the 2D convolution
window. The kernel size must be an odd integer as well. Typical values for kernel
size include: (1, 1), (3, 3), (5, 5), (7, 7). It’s rare to see kernel sizes larger than 7×7.
Strides
The strides parameter is a 2-tuple of integers, specifying the “step” of the convolution
along the x and y axis of the input volume. The strides value defaults to (1, 1),
implying that:
1. A given convolutional filter is applied to the current location of the input volume.
2. The filter takes a 1-pixel step to the right and again the filter is applied to the input
volume.
3. This process is performed until we reach the far-right border of the volume in
which we move our filter one pixel down and then start again from the far left.
Typically, you’ll leave the strides parameter with the default (1, 1) value; however,
you may occasionally increase it to (2, 2) to help reduce the size of the output
volume (since the step size of the filter is larger).
Padding
If the size of the previous layer is not cleanly divisible by the size of the filters
receptive field and the size of the stride then it is possible for the receptive field to
attempt to read off the edge of the input feature map. In this case, techniques like
zero padding can be used to invent mock inputs for the receptive field to read. The
padding parameter to the Keras Conv2D class can take on one of two values: valid
or same.
Padding 'valid' is the first figure. The filter window stays inside the image.
The pooling layers down-sample the previous layers feature map. Pooling layers
follow a sequence of one or more convolutional layers and are intended to
consolidate the features learned and expressed in the previous layers feature map.
Pooling layers are often very simple, taking the average or the maximum of the input
value in order to create its own feature map. The pooling operation is specified,
rather than learned. Two common functions used in the pooling operation are:
Average Pooling: Calculate the average value for each patch on the feature
map.
Maximum Pooling (or Max Pooling): Calculate the maximum value for each
patch of the feature map.
The result of using a pooling layer and creating down sampled or pooled feature
maps is a summarized version of the features detected in the input. They are useful
as small changes in the location of the feature in the input detected by the
convolutional layer will result in a pooled feature map with the feature in the same
location.
Fully connected layers are the normal flat feed-forward neural network layer.
These layers may have a non-linear activation function or a softmax activation in
order to output probabilities of class predictions.
Fully connected layers are used at the end of the network after feature extraction
and consolidation has been performed by the convolutional and pooling layers. They
are used to create final non-linear combinations of features and for making
predictions by the network. Now we got introduced with artificial neural networks and
convolutional neural networks, let's get started with another deep learning technique
named transfer learning.
VGGNet architecture
MobileNet
The MobileNet model is designed to be used in mobile applications, and it is
TensorFlow’s first mobile computer vision model. MobileNet uses depthwise
separable convolutions. It significantly reduces the number of parameters when
compared to the network with regular convolutions with the same depth in the nets.
This results in lightweight deep neural networks.
Reference
1. Public information, Deloitte Research
2. http://www.oreilly.com/data/free/the-new-artificial-intelligence-market.csp
3. https://www.weforum.org/agenda/2018/09/artificial-intelligence-shaking-up-job-
market/
4. https://en.wikiversity.org/wiki/Artificial_intelligence/Introduction
5. https://techvidvan.com/tutorials/artificial-intelligence-applications/
6. https://www.xenonstack.com/blog/machine-learning-pipeline/
7. https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/ai-
overview
8. https://www.javatpoint.com/history-of-artificial-intelligence
9. https://aidemos.microsoft.com/
10. https://lobe.ai/
11. https://www.tutorialspoint.com/operating_system/os_linux.html
12. https://buildmedia.readthedocs.org/media/pdf/lym/latest/lym.pdf
13. https://phoenixnap.com/kb/linux-commands-cheat-sheet
14. https://www.guru99.com/file-permissions.html
15. https://www.hostinger.in/tutorials/linux-commands
16. https://www.guru99.com/introduction-to-shell-scripting.html
17. https://www.geeksforgeeks.org/cat-command-in-linux-with-examples/
18. https://www.geeksforgeeks.org/basic-shell-commands-in-linux/
19. https://www.tecmint.com/13-basic-cat-command-examples-in-linux/
20. https://phoenixnap.com/kb/use-nano-text-editor-commands-linux
21. https://linuxize.com/post/how-to-use-nano-text-editor/
22. https://www.javatpoint.com/how-to-install-vi-editor-in-ubuntu
23. http://www.compciv.org/recipes/cli/basic-shell-scripts/
24. https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)
25. https://www.python.org/doc/essays/blurb/
26. https://en.wikipedia.org/wiki/Python_(programming_language)
27. https://docs.python.org/3/library/
28. https://www.python.org/downloads/release/python-394/
29. https://www.datacamp.com/community/tutorials/functions-python-tutorial