0% found this document useful (0 votes)
88 views

Artificial Intelligence - Handbook

Uploaded by

Dr. Monika Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Artificial Intelligence - Handbook

Uploaded by

Dr. Monika Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 215

2022

Artificial Intelligence

|
Disclaimer: The content is curated for educational purposes only.
© Edunet Foundation. All rights reserved.
Table of Contents
Table of Contents ................................................................................................... 2
Learning Outcome .................................................................................................. 4
Chapter 1: What is AI and ML? ............................................................................... 5
1.1 Introduction to Artificial Intelligence .................................................................. 5
1.2 Artificial Intelligence Market Trends ................................................................ 13
1.3 Introduction to Machine Learning.................................................................... 18
1.4 PRACTICALS: Microsoft AI Demos ................................................................ 21
Chapter 2: Data Analysis with Python .................................................................. 37
2.1 What is Data Analysis? ................................................................................... 37
2.2 Anaconda Software and Introduction to Python .............................................. 40
2.3 Python Libraries .............................................................................................. 67
2.4 NumPy Library ................................................................................................ 72
Chapter 3: Data Analysis with Pandas and Matplotlib ........................................ 83
3.1 What is Data Visualization? ............................................................................ 83
3.2 Plotting with Matplotlib .................................................................................... 87
3.3 Data Manipulation with Pandas ...................................................................... 98
3.4 Pandas Plotting............................................................................................. 126
Chapter 4: Building Machine Learning Models.................................................. 136
4.1 Machine Learning Basics.............................................................................. 136
4.2 Linear Regression......................................................................................... 142
4.3 Logistic Regression ...................................................................................... 159
4.4 Naïve Bayes Theorem .................................................................................. 166
4.5 Bag of Words Approach................................................................................ 168
Chapter 5: Building Deep Learning Models ....................................................... 175
5.1 Deep Learning Basics................................................................................... 175
5.2 Concepts of Neural Networks ....................................................................... 176
5.3 Computer Vision Basics................................................................................ 184
5.4 Convolutional Neural Networks..................................................................... 203
Reference ........................................................................................................... 211

© Edunet Foundation. All rights reserved. | 2


This course booklet has been designed by the Edunet
Foundation for the Tech-Saksham programme in
partnership with Microsoft and SAP

© Edunet Foundation. All rights reserved. | 3


Learning Outcome
After completing this handbook, learner will be able to

 Demonstrate fundamental understanding of the history of artificial intelligence


and its foundations.
 Apply the basic principles, models, and algorithms of AI to recognize, model,
and solve problems in the analysis and design of information systems.
 Analyze the structures and algorithms of a selection of techniques related to
machine learning and Artificial Intelligence.
 Able to design and implement various machine learning algorithms in a range
of real-world applications.
 Appreciate the underlying mathematical relationships within and across
Machine Learning algorithms and the paradigms of supervised and un-
supervised learning.
 Be able to identify new application requirements in the field of computer vision
using Deep Learning.

© Edunet Foundation. All rights reserved. | 4


Chapter 1: What is AI and ML?
Learning Outcomes:

 Understand the AI market trends, investments and career opportunities


 Understand fundamental concepts of Artificial Intelligence and Machine
Learning
 Create a Machine Learning model to differentiate between Images

1.1 Introduction to Artificial Intelligence


Artificial intelligence (AI) is the intelligence exhibited by machines or software. It is
also the name of the academic field of study which studies how to create computers
and computer software that are capable of intelligent behaviour. Specifically, it is an
area of computer science that emphasizes the creation of intelligent machines that
work and react like humans. Some of the activities computers with artificial
intelligence are designed for include: Speech recognition, Learning, Planning,
Problem solving. In this topic we shall discus the following subjects; Deep learning,
Machine learning, Computer Programming, Medical field.

It is the branch of computer science that aims to create intelligent machines. It has
become an essential part of the technology industry. Research associated with
artificial intelligence is highly technical and specialized. The core problems of
artificial intelligence include programming computers for certain traits such as:
Knowledge, Reasoning, Problem solving, Perception, Learning, Planning, and Ability
to manipulate and move objects. Knowledge engineering is a core part of AI
research. Machines can often act and react like humans only if they have abundant
information relating to the world. Artificial intelligence must have access to objects,
categories, properties and relations between all of them to implement knowledge
engineering. Initiating common sense, reasoning and problem-solving power in
machines is a difficult and tedious approach. Machine learning is another core part of
AI.

Learning without any kind of supervision requires an ability to identify patterns in


streams of inputs, whereas learning with adequate supervision involves classification
and numerical regressions. Classification determines the category an object belongs
to and regression deals with obtaining a set of numerical input or output examples,
thereby discovering functions enabling the generation of suitable outputs from
respective inputs.

© Edunet Foundation. All rights reserved. | 5


Mathematical analysis of machine learning algorithms and their performance is a
well-defined branch of theoretical computer science often referred to as
computational learning theory. Machine perception deals with the capability to use
sensory inputs to deduce the different aspects of the world, while computer vision is
the power to analyse visual inputs with a few sub- problems such as facial, object
and gesture recognition. Robotics is also a major field related to AI. Robots require
intelligence to handle tasks such as object manipulation and navigation, along with
sub-problems of localization, motion planning and mapping.
Let's simplify the definition of AI: a machine, robot or android that has an intelligent
system like a human brain which can sense, reason, act & adapt according to the
operational instructions. It works based on the data, stored as well as collected, and
configures itself with real-time instructions.

Image: Basic Block diagram of AI

 Sense: It's one of the properties of AI. It not only identifies worthy materials &
objects but also recognizes real-time operational activities. The sensors or
sensing devices can trace out & quickly differentiate between wrong & correct
objects.

 Reason: This property works like a human brain does to complete a task
successfully. It understands, judges and prepares to execute. Reason
enables AI to deal with internal & external properties of resources like
condition, time-frame, behavior, other parameters of entities involved during
conversion of the task.

© Edunet Foundation. All rights reserved. | 6


 Act: This is a decisive property that enables to execute an action or send
instructions to others to execute the action instead. The act is the part where
the functionality & operational activities are directly proportionate.

 Adapt: It's the property that works with the highest intelligence. Like the way, a
human brain remembers the result of any past event. Its re-trains, debugs and
even find out uncovered properties to put the operation more accurate. It
remembers the last events & manages the functionalities accordingly.

For AI, data is an essential element that underpins AI's underlying logic. Without
data, data processing for AI will not be possible. With data mining's cleaning,
integration, reduction and other pre-treatment means, AI could have adequate data
for learning. As AI technologies iterate, the production, collection, storage,
calculation, transmission and application of data will all be completed by machines.

Image: Development stages of data processing


Reference: Public information, Deloitte Research

1.1.1 Applications of AI
Here we have some of the Artificial Intelligence Applications in real world.

© Edunet Foundation. All rights reserved. | 7


Image: Applications of Artificial Intelligence
Reference: https://techvidvan.com/tutorials/artificial-intelligence-applications/

1.Healthcare
One of the foremost deep-lying impacts which AI has created is within the
Healthcare space. A device, as common as a Fitbit or an iWatch, collects a lot of
data like the sleep patterns of the individual, the calories burnt by him, heart rate and
a lot more which can help with early detection, personalization, even disease
diagnosis.
This device, when powered with AI can easily monitor and notify abnormal trends.
This can even schedule a visit to the closest Doctor by itself and therefore, it’s also
of great help to the doctors who can get help in making decisions and research with
AI. It has been used to predict ICU transfers, improve clinical workflows and even
pinpoint a patient’s risk of hospital-acquired infections.

2. Automobile
At this stage where automobiles changing from an engine with a chassis around it to
a software-controlled intelligent machine, the role of AI cannot be underestimated.
The goal of self-driving cars, during which Autopilot by Tesla has been the
frontrunner, takes up data from all the Tesla’s running on the road and uses it in
machine learning algorithms. The assessment of both chips is later matched by the
system and followed if the input from both is the same.

3. Banking and Finance


One of the early adopters of Artificial Intelligence is the Banking and Finance
Industry. Features like AI bots, digital payment advisers and biometric fraud
detection mechanisms cause higher quality of services to a wider customer base.
The adoption of AI in banking is constant to rework companies within the industry,
provide greater levels useful and more personalized experiences to their customers,

© Edunet Foundation. All rights reserved. | 8


reduce risks as well as increase opportunities involving financial engines of our
modern economy.

4. Surveillance
AI has made it possible to develop face recognition Tools which may be used for
surveillance and security purposes. As a result, this empowers the systems to
monitor the footage in real-time and can be a pathbreaking development in regards
to public safety.
Manual monitoring of a CCTV camera requires constant human intervention so
they’re prone to errors and fatigue.AI-based surveillance is automated and works
24/7, providing real-time insights. According to a report by the Carnegie Endowment
for International Peace, a minimum of 75 out of the 176 countries are using AI tools
for surveillance purposes. Across the country, 400 million CCTV cameras are
already in situ, powered by AI technologies, primarily face recognition.

5. Social Media
Social Media is not just a platform for networking and expressing oneself. It
subconsciously shapes our choices, ideologies, and temperament.
This is due to the synthetic Intelligence tools which work silently within the
background, showing us posts that we “might” like and advertising products that
“might” be useful based on our search and browsing history.
This helps with social media advertising because of its unprecedented ability to run
paid ads to platform users based on highly granular demographic and behavioural
targeting.

6. Entertainment
The show business, with the arrival of online streaming services like Netflix and
Amazon Prime, relies heavily on the info collected by the users.
This helps with recommendations based upon the previously viewed content. This is
done not only to deliver accurate suggestions but also to create content that would
be liked by a majority of the viewers.
With new contents being created every minute, it is very difficult to classify them and
making them easier to search.AI tools analyse the contents of videos frame by frame
and identify objects to feature appropriate tags.AI is additionally helping media
companies to form strategic decisions.

7. Education
In the education sector also, there are a number of problems which will be solved by
the implementation of AI.
A few of them being automated marking software, content retention techniques and
suggesting improvements that are required. This can help the teachers monitor not
© Edunet Foundation. All rights reserved. | 9
just the academic but also the psychological, mental and physical well-being of the
students but also their all-round development.
This would also help in extending the reach of education to areas where quality
educators can’t be present physically.

8. Space Exploration
AI systems are being developed to scale back the danger of human life that venture
into the vast realms of the undiscovered and unravelled universe which is a very
risky task that the astronauts need to take up.
As a result, unmanned space exploration missions just like the Mars Rover are
possible due to the utilization of AI. It has helped us discover numerous exoplanets,
stars, galaxies, and more recently, two new planets in our very own system.
NASA is also working with AI applications for space exploration to automate image
analysis and to develop autonomous spacecraft that would avoid space debris
without human intervention, create communication networks more efficient and
distortion-free by using an AI-based device.

9. Gaming
In the gaming industry also, computer game Systems powered by AI is ushering us
into a replacement era of immersive experience in gaming. It serves to enhance the
game-player experience instead of machine learning or deciding. AI has also been
playing a huge role in creating video games and making it more tailored to players’
preferences.

10. Robotics
With increasing developments within the field of AI, robots are becoming more
efficient in performing tasks that earlier were too complex.
AI in robotics helps the robots to learn the processes and perform the tasks with
complete autonomy, without any human intervention. This is because robots are
designed to perform repetitive tasks with utmost precision and increased speed.
AI has been introducing flexibility and learning capabilities in previously rigid
applications of robots. These benefits are expected to reinforce the market growth.

11. Agriculture
Artificial Intelligence is changing the way we do one among our most primitive and
basic professions which are farming. The use of AI in agriculture are often attributed
to agriculture robots, predictive analysis, and crop and soil monitoring.
In addition, drones are also used for spraying insecticides and detecting weed
formation in large farms. This is getting to help firms like Blue River Technologies,
better manage the farms.

© Edunet Foundation. All rights reserved. | 10


AI has also enhanced crop production and improved real-time monitoring,
harvesting, processing and marketing.

12. E-Commerce
This is one of the Artificial Intelligence Applications that’s found to be widely used.
Different departments of E-commerce including logistics, predicting demand,
intelligent marketing, better personalization, use of chatbots, etc. are being disrupted
by AI. The E-Commerce industry, a prominent player being Amazon is one among
the primary industries to embrace AI. This may experience a good use of AI with
time.
E-commerce retailers are increasingly turning towards chatbots or digital assistants
to supply 24×7 support to their online buyers.

1.1.2 Evolution of Artificial Intelligence

Artificial Intelligence is not a new word and not a new technology for researchers.
This technology is much older than you would imagine. Even there are the myths of
Mechanical men in Ancient Greek and Egyptian Myths. Following are some
milestones in the history of AI which defines the journey from the AI generation to till
date development.

History of Artificial Intelligence

Image: History of AI
Reference: https://www.javatpoint.com/history-of-artificial-intelligence

© Edunet Foundation. All rights reserved. | 11


Maturation of Artificial Intelligence (1943-1952)

Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial neurons.
Year 1949: Donald Hebb demonstrated an updating rule for modifying the
connection strength between neurons. His rule is now called Hebbian learning.
Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's ability to
exhibit intelligent behavior equivalent to human intelligence, called a Turing test.

The birth of Artificial Intelligence (1952-1956)


Year 1955: An Allen Newell and Herbert A. Simon created the "first artificial
intelligence program Which was named as "Logic Theorist". This program had
proved 38 of 52 Mathematics theorems, and find new and more elegant proofs for
some theorems.
Year 1956: The word "Artificial Intelligence" first adopted by American Computer
scientist John McCarthy at the Dartmouth Conference. For the first time, AI coined
as an academic field. At that time high-level computer languages such as
FORTRAN, LISP, or COBOL were invented. And the enthusiasm for AI was very
high at that time.

The golden years-Early enthusiasm (1956-1974)


Year 1966: The researchers emphasized developing algorithms which can solve
mathematical problems. Joseph Weinbaum created the first chatbot in 1966, which
was named as ELIZA.
Year 1972: The first intelligent humanoid robot was built in Japan which was named
as WABOT-1.

The first AI winter (1974-1980)


The duration between years 1974 to 1980 was the first AI winter duration. AI winter
refers to the time period where computer scientist dealt with a severe shortage of
funding from government for AI researches. During AI winters, an interest of publicity
on artificial intelligence was decreased.

A boom of AI (1980-1987)
Year 1980: After AI winter duration, AI came back with "Expert System". Expert
systems were programmed that emulate the decision-making ability of a human
expert. In the Year 1980, the first national conference of the American Association of
Artificial Intelligence was held at Stanford University.

The second AI winter (1987-1993)


The duration between the years 1987 to 1993 was the second AI Winter duration.

© Edunet Foundation. All rights reserved. | 12


Again, Investors and government stopped in funding for AI research as due to high
cost but not efficient result. The expert system such as XCON was very cost
effective.
The emergence of intelligent agents (1993-2011)
Year 1997: In the year 1997, IBM Deep Blue beats world chess champion, Gary
Kasparov, and became the first computer to beat a world chess champion.
Year 2002: for the first time, AI entered the home in the form of Roomba, a vacuum
cleaner.
Year 2006: AI came in the Business world till the year 2006. Companies like
Facebook, Twitter, and Netflix also started using AI.

Deep learning, big data and artificial general intelligence (2011-present)


Year 2011: In the year 2011, IBM's Watson won jeopardy, a quiz show, where it had
to solve the complex questions as well as riddles. Watson had proved that it could
understand natural language and can solve tricky questions quickly.
Year 2012: Google has launched an Android app feature "Google now", which was
able to provide information to the user as a prediction.
Year 2014: In the year 2014, Chatbot "Eugene Goostman" won a competition in the
infamous "Turing test."
Year 2018: The "Project Debater" from IBM debated on complex topics with two
master debaters and also performed extremely well. Google has demonstrated an AI
program "Duplex" which was a virtual assistant and which had taken hairdresser
appointment on call, and lady on other side didn't notice that she was talking with the
machine.
Now AI has developed to a remarkable level. The concept of Deep learning, big
data, and data science are now trending like a boom. Nowadays companies like
Google, Facebook, IBM, and Amazon are working with AI and creating amazing
devices. The future of Artificial Intelligence is inspiring and will come with high
intelligence.

Let's get started with some practical demonstrations of some AI applications. In the
very first practical which is based on Text Analytics which determines the sentiments
of your message typed or spoken and predicts whether the message is Positive,
Negative or Neutral.

1.2 Artificial Intelligence Market Trends


AI has entered a new stage to become fully commercialized, exerting different
impacts on players of traditional industries and driving changes in the ecosystems of
these industries. Such changes are mainly seen at three levels.

Enterprise change
© Edunet Foundation. All rights reserved. | 13
AI is engaged in the management and production processes of the enterprise, with a
trend of being increasingly commercialized, and some enterprises have realized
relatively mature intelligent applications. These enterprises have been able to collect
and make use of user information from multiple dimensions via various technological
means and provide consumers with pertinent products and services, at the same
time satisfy their potential needs through insights into development trends gained via
data optimization.

1. Enterprise change

Sales Security Anti-fraud HR management Marketin Personal Smart tools


g assistant

Industry change

The change brought by AI would drive fundamental changes in the relationship of


upstream and downstream sectors on the traditional industry chain. The engagement
of AI has expanded the types of upstream products providers, and users may also
shift from individual consumers to enterprise consumers.

2. Industry change

Finance Healthcare Education Autonomous Retail Manufacturing


driving
Digital Media Legal Agriculture Logistics Oil & gas
government
Labour change

The application of new technologies such as AI is enhancing the efficiency of


information use and reducing the number of employees. In addition, the wider use of
robots would also replace labours in repetitive tasks and increase the percentage of
technological and management personnel, bringing changes in the labour structures
of enterprises.

3. Labor change

Augmented reality Gesture Robotics Emotion recognition


recognition

1.2.1 AI Market Investment

As the capital market deepens its understanding of AI, market investment in AI is


maturing and returning to reason. During the past five years, China's investment in
AI grew rapidly, with a total investment of RMB45 billion in 2015 - the starting year of
© Edunet Foundation. All rights reserved. | 14
China's AI development, and investment frequency continued to increase in 2016
and 2017. The first half of 2019 has seen a total investment of over RMB47.8 billion
in China's AI sector with great achievements.

Image: Changes of AI investment and financing


Reference: Public information, Deloitte Research

1.2.2 AI Market Opportunities

Investment and financing data in recent years show that investment frequencies and
amounts raised in business service, robotics, healthcare, industry solutions, basic
components and finance are all higher than those in other sectors. From enterprise
perspective, those with a top global team, financial strength and high-tech gene are
more favoured by secondary market investors. From industry perspective, however,
new retail, autonomous driving, healthcare and education, all easy to deploy,
indicate more opportunities, and companies engaged in such sectors could see more
investment opportunities.

© Edunet Foundation. All rights reserved. | 15


Image: Market opportunities and career in AI
Reference: http://www.oreilly.com/data/free/the-new-artificial-intelligence-market.csp

1.2.3 AI Career Opportunities

The World Economic Forum’s “The Future of Jobs 2018” aims to base this debate on
facts rather than speculation. By tracking the acceleration of technological change as
it gives rise to new job roles, occupations and industries, the report evaluates the
changing contours of work in the Fourth Industrial Revolution.

One of the primary drivers of change identified is the role of emerging technologies,
such as artificial intelligence (AI) and automation. The report seeks to shed more
light on the role of new technologies in the labour market, and to bring more clarity to
the debate about how AI could both create and limit economic opportunity. With 575
million members globally, LinkedIn’s platform provides a unique vantage point into
global labour-market developments, enabling us to support the Forum's examination
of the trends that will shape the future of work.

Our analysis uncovered two concurrent trends: the continued rise of tech jobs and
skills, and, in parallel, a growth in what we call “human-centric” jobs and skills. That
is, those that depend on intrinsically human qualities.

© Edunet Foundation. All rights reserved. | 16


Image: career opportunities in AI
Reference: https://www.weforum.org/agenda/2018/09/artificial-intelligence-shaking-up-job-market/

Tech jobs like software engineers and data analysts, along with technical skills such
as cloud computing, mobile application development, software testing and AI, are on
the rise in most industries and across all regions. But several highly “automatable”
jobs fall into the top 10 most declining occupations – i.e., jobs that have seen the
largest decreases in share of hiring over the past five years. These occupations
include administrative assistants, customer service representatives, accountants,
and electrical/mechanical technicians, many of which depend on more repetitive
tasks.

© Edunet Foundation. All rights reserved. | 17


AI Subdomains

Reference - https://static.javatpoint.com/tutorial/ai/images/subsets-of-ai.png

1.3 Introduction to Machine Learning


Machine learning is the study of computer algorithms that improve automatically
through experience and has been central to AI research since the field’s inception.
ML is broadly classified into supervised, unsupervised and reinforcement learning.
Unsupervised learning is the ability to find patterns in a stream of input. Supervised
learning includes both classification and numerical regression.

Classification is used to determine what category something belongs in, after seeing
a number of examples of things from several categories. Regression is the attempt
to produce a function that describes the relationship between inputs and
outputs and predicts how the outputs should change as the inputs change. In
reinforcement learning the agent is rewarded for good responses and punished for
bad ones.

Machine learning pipeline helps to automate ML Workflow and enable the sequence
data to be transformed and correlated together in a model to analyze and achieve
outputs. ML pipeline is constructed to allow the flow of data from raw data format to
some valuable information. It provides a mechanism to build a Multi-ML parallel
pipeline system to examine different ML methods’ outcomes. The Objective of the
Machine learning pipeline is to exercise control over the ML model. A well-planned
pipeline helps to makes the implementation more flexible. It is like having an
overview of a code to pick the fault and replace them with the correct code.

© Edunet Foundation. All rights reserved. | 18


A pipeline consists of several stages. Each stage of a pipeline is fed with the data
processed from its preceding stage, i.e., the output of a processing unit supplied as
an input to the next step. Machine Learning Pipeline consists of four main stages as
Pre-processing, Learning, Evaluation, and Prediction.

1. Pre-processing
Data preprocessing is a Data Mining technique that involves transferring raw data
into an understandable format. Real-world data is usually incomplete, inconsistent,
and lacks certain behaviors or trends, most likely to contain many inaccuracies. The
process of getting usable data for a Machine Learning algorithm follows steps such
as Feature Extraction and Scaling, Feature Selection, Dimensionality reduction, and
sampling. The product of Data Pre-processing is the final dataset used for training
the model and testing purposes.

2. Learning
A learning algorithm is used to process understandable data to extract patterns
appropriate for application in a new situation. In particular, the aim is to utilize a
system for a specific input-output transformation task. For this, choose the best-
performing model from a set of models produced by different hyperparameter
settings, metrics, and cross-validation techniques.

3. Evaluation
To Evaluate the Machine Learning model’s performance, fit a model to the training
data, and predict the labels of the test set. Further, count the number of wrong
predictions on the test dataset to compute the model’s prediction accuracy.

4. Prediction
The model’s performance to determine the outcomes of the test data set was not
used for any training or cross-validation activities.

1.3.1 Machine Learning Applications

Machine learning is one of the most exciting technologies that one would have ever
come across. As it is evident from the name, it gives the computer that which makes
it more similar to humans: The ability to learn. Machine learning is actively being
used today, perhaps in many more places than one would expect. We probably use
a learning algorithm dozen of time without even knowing it. Applications of Machine
Learning include:

a. Web Search Engine


One of the reasons why search engines like google, Bing etc. work so well is
because the system has learnt how to rank pages through a complex learning
algorithm.
© Edunet Foundation. All rights reserved. | 19
b. Photo tagging Applications
Be it Facebook or any other photo tagging application, the ability to tag friends
makes it even more happening. It is all possible because of a face recognition
algorithm that runs behind the application.

c. Spam Detector
Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying the
mails and moving the spam mails to spam folder. This is again achieved by a spam
classifier running in the back end of mail application.

The key difference between AI and ML are:

ARTIFICIAL INTELLIGENCE MACHINE LEARNING


AI stands for Artificial intelligence, ML stands for Machine
where intelligence is defined Learning which is defined as
acquisition of knowledge intelligence the acquisition of knowledge or
is defined as an ability to acquire skill
and apply knowledge.
The aim is to increase chance of The aim is to increase
success and not accuracy. accuracy, but it does not care
about success
It works as a computer program that It is a simple concept machine
does smart work takes data and learn from data.
The goal is to simulate natural The goal is to learn from data
intelligence to solve complex on certain task to maximize the
problem performance of machine on
this task.
AI is decision making. ML allows system to learn new
things from data.
It leads to develop a system to mimic It involves in creating self-
human to respond behave in a learning algorithms.
circumstance.
AI will go for finding the optimal ML will go for only solution for
solution. that whether it is optimal or not.
AI leads to intelligence or wisdom. ML leads to knowledge.

© Edunet Foundation. All rights reserved. | 20


Image: Artificial Intelligence
AI is the capability of a computer to imitate intelligent human behaviour. Through AI,
machines can analyse images, comprehend speech, interact in natural ways, and
make predictions using data.

ML is an AI technique that uses mathematical algorithms to create predictive


models. An algorithm is used to parse data fields and to "learn" from that data by
using patterns found within it to generate models. Those models are then used to
make informed predictions or decisions about new data. The predictive models are
validated against known data, measured by performance metrics selected for
specific business scenarios, and then adjusted as needed. This process of learning
and validation is called training. Through periodic retraining, ML models are
improved over time.

Deep learning is a type of ML that can determine for itself whether its predictions
are accurate. It also uses algorithms to analyze data, but it does so on a larger scale
than ML. Deep learning uses artificial neural networks, which consist of multiple
layers of algorithms. Each layer looks at the incoming data, performs its own
specialized analysis, and produces an output that other layers can understand. This
output is then passed to the next layer, where a different algorithm does its own
analysis, and so on.

With many layers in each neural network-and sometimes using multiple neural
networks-a machine can learn through its own data processing. This requires much
more data and much more computing power than ML.

1.4 PRACTICALS: Microsoft AI Demos

We will next discuss some interactive demos related to text analytics, language
understanding on the Microsoft AI platform. In the context of text analytics, Microsoft

© Edunet Foundation. All rights reserved. | 21


Cognitive Services Text Analytics API determines the sentiment of your message,
typed or spoken. Microsoft Cognitive Services Language Understanding interprets
human language and understands the intent.

1.4.1 Text Analytics

1. Go to website https://aidemos.microsoft.com/
2. Select Text Analytics and click on “Try it out>” as shown in figure

3. Enter any Message that you want for Text Analytics

4. Click on Next Step

© Edunet Foundation. All rights reserved. | 22


5. The API analyzes your Message to identify the Keywords and understands the
Sentiment, you can see on the Screen the Selected Key Words. Then Click on
Next Step

6. The API will select the key words as entities and link them to Wikipedia
Then Click on Next Step

© Edunet Foundation. All rights reserved. | 23


7. Click on Next Step

Now, After Text Analytics, Let's try another AI application “Language Understanding”
where you can give commands in the format of text or voice and after understanding
the command, it takes decisions accordingly. So, it's an application of Natural
Language Processing. Let's get started

1.4.2 Language Understanding

1. Go to website https://aidemos.microsoft.com/
2. Select Language Understanding and click on “Try it out>” as shown in figure

© Edunet Foundation. All rights reserved. | 24


3. Click on See it on Action

4. You can give your Commands (Either by text or voice) and Switches will glow
accordingly in the house next to it.

© Edunet Foundation. All rights reserved. | 25


5. Give Command Lights off and Click on Apply Button.

6. Result

© Edunet Foundation. All rights reserved. | 26


7. Give Command Call Batman, and Click on Apply.

Result

© Edunet Foundation. All rights reserved. | 27


We will next try out an interesting machine learning application called lobe which can
do a task of an image classification with few steps like labelling your images, training
a model and understanding your results. Let’s get started.

1.4.3 Image Recognition on Web Browser

1. Visit https://lobe.ai/ Website


2. Click on Download

© Edunet Foundation. All rights reserved. | 28


3. Enter Name, Email ID, Purpose. Select the Country and Click Download.

4. File is being Downloaded

5. Start Installing the Downloaded file. And Click NEXT

© Edunet Foundation. All rights reserved. | 29


6. Installation will take place. After that Click on Finish

7. Click on Agreed

© Edunet Foundation. All rights reserved. | 30


8. Click on Okay

9. Click on Get Started

© Edunet Foundation. All rights reserved. | 31


10. Name the Project. And Click on Import. To Upload Images from the System.
(NOTE: My project is to train my system Difference between Fruits and Vegetables)

11. Click on Images

© Edunet Foundation. All rights reserved. | 32


12. Select all you Project Images

13. Label them Accordingly e.g., - Vegetables and Fruits

© Edunet Foundation. All rights reserved. | 33


14. Training Will be done Automatically and Click on USE

15. Click on Import images for Testing (Select an image for testing)

© Edunet Foundation. All rights reserved. | 34


16.The Result will be Shown Down. If it is Correct Click on Right. If it is wrong Click
on Wrong Which will help to learn more to the system

© Edunet Foundation. All rights reserved. | 35


So, in this chapter we have seen the history, market trends and opportunities in the
very exciting field of Artificial Intelligence and also discussed about the fundamental
concepts in AI and machine learning. We also explored practically some of the
applications of AI and Machine Learning like sentiment analysis, Natural Language
Processing and Image classification. We explored an overall understanding of what
is AI, ML and its real-life applications.
In the next chapter we will be discussing about Data Analysis, Anaconda Software
and key concepts in Python Programming language.

© Edunet Foundation. All rights reserved. | 36


Chapter 2: Data Analysis with Python
Learning Outcomes:

 Explore the fundamentals of Data Analysis


 Learn how to use Anaconda Software and Python Libraries
 Understand key concepts of Python Programming

2.1 What is Data Analysis?


Data analysis is a process of inspecting, cleansing, transforming, modelling, and
questioning data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. It is a subset of data analytics that
includes specific processes. Data analytics on the other hand is a broader field of
using data and tools to make business decisions. Data is the oil of our time – the
new electricity. It gets collected, moved, refined. Data in its raw form has no value.
Instead, it is what you do with that data that provides value. Data analytics include all
the steps you take, to discover, interpret, visualize and tell the story of patterns in
your data in order to drive business strategy and outcomes. Data analytics if done
well will help to find trends, uncover opportunities, predict actions, triggers, or events
and make decisions.
Data analysis has multiple facets and approaches, encompassing diverse
techniques under a variety of names, and is used in different business, science, and
social science domains. In today's business world, data analysis plays a role in
making more scientific decisions and helping businesses operate more effectively.

The data analytics encompasses six phases that are data discovery, data
aggregation, planning of the data models, data model execution, communication of
the results, and operationalization. Let us understand the six phases below:

© Edunet Foundation. All rights reserved. | 37


Fig: Data Analytics Process

1. Business Understanding: The fundamental requirement is to understand


client and business objectives and define data mining goals. Current data
mining scenario factors in resources, constraints and assumptions should also
be taken into the consideration.
2. Data Understanding: In this stage, the data is collected from various sources
within the organization. A sanity check is conducted to understand whether it
is appropriate for data mining goals. This is a highly complex process since
data and process from various sources are unlikely to match easily.
3. Data Preparation: The data is production ready in this stage, as the data
from diverse sources are cleaned, transformed, formatted and anonymized.
The data is cleaned in this stage by smoothing noisy data and satisfying in
missing values.
4. Modelling: Mathematical models as well as modelling techniques are chosen
to determine the data patterns. After that, create a scenario to validate the
model. Then run the model on the prepared data set
5. Evaluation: In this stage, patterns recognized are examined against business
objectives. A go or no-go decision should be taken to move the model in the
deployment phase
6. Deployment: A thorough deployment plan, for shipping, maintenance, and
monitoring of data mining discoveries is created.

There are different types of data analytics techniques. We will discuss about them
next.

© Edunet Foundation. All rights reserved. | 38


2.1.1 Data Analytics and its type

Business firms now a days commonly emphasize on business data, to drive


business decisions, and improve business performance. But data alone with the
facts and figures are meaningless, unless we gain valuable insights that lead to
more-informed actions. Analytics offer a convenient way to get actionable insights
into the data for making better decisions. But the number of solutions on the market
can be daunting, belonging to a different category of analytics. It is also critical to
design and build a data warehouse or Business Intelligence (BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient
ingestion and analysis of large and diverse data sets. The organizations can make
sense of it all, by understanding the different type of data analytics which is broadly
classified to the following four types:

1. Predictive (what is likely to happen?)


2. Descriptive (What is happening?)
3. Prescriptive (what do I need to do?)
4. Diagnostic analytics (why is it happening?)

Predictive Analytics: Predictive analytics turns the data into valuable, actionable
information. Predictive analytics uses data to determine the probable outcome of an
event or a likelihood of a situation occurring. Techniques that are used for predictive
analytics are:
 Linear Regression
 Time series analysis and forecasting
 Data Mining

Descriptive Analytics: Descriptive analytics looks at data and analyze past event
for insight as to how to approach future events. It looks at the past performance and
understands the performance by mining historical data to understand the cause of
success or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis. Common examples of
Descriptive analytics are company reports that provide historic reviews like: 1) Data
Queries 2) Reports 3) Descriptive Statistics 4) Data dashboard

Prescriptive Analytics: Prescriptive Analytics automatically synthesize big data,


mathematical science, business rule, and machine learning to make a prediction and
then suggests a decision option to take advantage of the prediction. Prescriptive
Analytics not only anticipates what will happen and when to happen but also why it
will happen. For example, Prescriptive Analytics can benefit healthcare strategic
planning by using analytics to leverage operational and usage data combined with
data of external factors such as economic data, population demography, etc.

Diagnostic Analytics: In this analysis, we generally use historical data over other
data to answer any question or for the solution of any problem. We try to find any
dependency and pattern in the historical data of the particular problem.
© Edunet Foundation. All rights reserved. | 39
For example, companies go for this analysis because it gives a great insight into a
problem. Common techniques used for Diagnostic Analytics are: 1) Data
discovery 2) Data mining 3) Correlations

Tools for Data Analysis

Many tools are available in the market, which make it easier for us:

1) To process and manipulate data


2) Analyze the relationships and correlations between data sets,
3) Helps to identify patterns and trends for interpretation
 Python
 Tableau
 R Programming
 Power BI
We next discuss about Python, which we will be using for data analysis.

2.2 Anaconda Software and Introduction to Python


Anaconda is a free open-source data science tool that focusses on the distribution of
R and Python programming languages for data science and machine learning tasks.
Anaconda aims at simplifying the data management and deployment of the same.
Anaconda is popular because it brings many of the tools used in data science and
machine learning with just one install, so it's great for having short and simple setup.
Anaconda also uses the concept of creating environments so as to isolate different
libraries and versions.

The package manager of Anaconda is the conda which manages the package
versions. Anaconda is written in Python, and the package manager Conda checks
for the requirement of the dependencies and installs it if it is required. More
importantly, warning signs are given if the dependencies already exist. Anaconda is
pre-built with more than 1500 Python or R data science packages. Anaconda has
specific tools to collect data using Machine learning and Artificial Intelligence. The
distribution includes data-science packages suitable for Windows, Linux, and
macOS.

Anaconda Individual Edition contains Conda and Anaconda Navigator, as well as


Python and hundreds of scientific packages. When you installed Anaconda, you
installed all these too. Conda works on your command line interface such as
Anaconda Prompt on Windows and terminal on macOS and Linux.

© Edunet Foundation. All rights reserved. | 40


Navigator is a desktop graphical user interface that allows you to launch applications
and easily manage conda packages, environments, and channels without using
command-line commands.

Comparison Table Between Anaconda and Python


Parameter of Anaconda Python
Comparison
Definition Anaconda is the enterprise Python is a high-level general-
data science platform which purpose programming language
distributes R and Python for used for machine learning and
machine learning and data data science
science
Category Anaconda belongs to Data Python belongs to Computer
Science Tools Languages

Package Anaconda has conda has its Python has pip as the package
package manager manager
Manager
User Anaconda is primarily Python is not only used in data
developed to support data science and machine learning but
Applications
science and machine also a variety of applications in
learning tasks embedded systems, web
development, and networking
program
Package Package manager conda Package manager pip allows all
allows Python as well as the Python dependencies to
Management
Non-Python library install
dependencies to install.

2.2.1 Anaconda Navigator

Anaconda Navigator is a desktop graphical user interface (GUI) included in


Anaconda distribution that allows users to launch applications and manage conda
packages, environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda
Repository, install them in an environment, run the packages and update them. It is
available for Windows, macOS and Linux. The following applications are available by
default in Navigator;
 JupyterLab
 Jupyter Notebook
 QtConsole
 Spyder
© Edunet Foundation. All rights reserved. | 41
 Glue
 Orange
 RStudio
 Visual Studio Code

2.2.2 Anaconda Cloud


Anaconda Cloud is a package management service by Anaconda where users can
find, access, store and share public and private notebooks, environments, and conda
and PyPI packages. Cloud hosts useful Python packages, notebooks and
environments for a wide variety of applications. Users do not need to log in or to
have a Cloud account, to search for public packages, download and install them.
Users can build new packages using the Anaconda Client command line interface
(CLI), then manually or automatically upload the packages to Cloud.

Now let's get started with very popular programming language used in data science
and variety of other tasks like website building, server-side programming etc. Named
Python.

2.2.3 Introduction to Python

Python is an interpreted, object-oriented, high-level programming language with


dynamic semantics. Its high-level built-in data structures, combined with dynamic
typing and dynamic binding, makes it very attractive for Rapid Application
Development, as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse.
Released in 1991, Python is used for;

 Web development (server-side),


 Software development,
 Mathematics,
 System scripting.

What can Python do?

 Python can be used on a server to create web applications.


 Python can be used alongside software to create workflows.
 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
 Python can be used for rapid prototyping, or for production-ready software
development.

© Edunet Foundation. All rights reserved. | 42


Why Python?

 Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
 Python has a simple syntax like the English language.
 Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
 Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-oriented way or a
functional way.

Python Syntax compared to other programming languages

 Python was designed for readability and has some similarities to the English
language with influence from mathematics.
 Python uses new lines to complete a command, as opposed to other
programming languages which often use semicolons or parentheses.
 Python relies on indentation, using whitespace, to define scope; such as the
scope of loops, functions and classes. Other programming languages often
use curly-brackets for this purpose.

© Edunet Foundation. All rights reserved. | 43


2.2.4 Installing Anaconda
1. Go to anaconda.com, and click on Download. You can choose your operating
system.

2. Locate the downloaded file

© Edunet Foundation. All rights reserved. | 44


3. Run the setup.

a. Click on Next

b. Agree to the Terms.

© Edunet Foundation. All rights reserved. | 45


c. For beginners, or individual users “Just me”.

d. Browse the location, where you want to install. Keep it default, & Next.

© Edunet Foundation. All rights reserved. | 46


e. Let it be added to PATH Environment Variables, to make it readily
available through command prompt.

f. Click on Install.
g. The installation process may take a few minutes.

© Edunet Foundation. All rights reserved. | 47


h. Once installation is complete, press next.

i. Other packages.

© Edunet Foundation. All rights reserved. | 48


j. Finish Setup

2.2.5 Anaconda Prompt


1. Start Anaconda prompt from Start Menu
2. Use command to check out your conda,: conda info

© Edunet Foundation. All rights reserved. | 49


3. Check the Environment available:
conda info –envs
conda env list

4. Welcome to Anaconda Prompt


5. Update the version, if needed
conda update -n base -c defaults conda

© Edunet Foundation. All rights reserved. | 50


2.2.6 Anaconda Navigator

To get hold of Anaconda, we need to get used to both CLI, & GUI of the software.
We previously used the CLI version, now need to see the GUI version.

1. Start ANACONDA NAVIGATOR from Start Menu or Search menu

2. It provides you with the following GUI

© Edunet Foundation. All rights reserved. | 51


3. Welcome to Anaconda Navigator. You can launch the different packages from
the Navigator.

Jupyter Notebook

Jupyter is a free, open-source, interactive web tool known as a


computational notebook, which researchers can use to combine software code,
computational output, explanatory text and multimedia resources in a single
document. Or in easy words, IDE. The purpose is “to support interactive data
science and scientific computing across all programming languages.” Since, we
already have Jupyter notebook installed on our systems, we can launch Jupyter
notebook through Anaconda Navigator;

1. Open Anaconda Navigator and launch Jupyter Notebook

2. This opens a local Jupyter Notebook on your machine, in the browser.

© Edunet Foundation. All rights reserved. | 52


3. Alternatively, you may try online Jupyter notebook@ https://jupyter.org/try

So, after creating our environment for programming let's start with some basic
building blocks of Python language.

2.2.7 Variables in Python


Variables are containers for storing data values.

© Edunet Foundation. All rights reserved. | 53


What Are Variables in Python?
Variables and data types in python as the name suggests are the values that vary. In
a programming language, a variable is a memory location where you store a value.
The value that you have stored may change in the future according to the
specifications.

Variable Memory

Variable in python is created as soon as a value is assigned to it. It does not need
any additional commands to declare a variable in python. There are some basic
rules for naming a variable in python.

 Must begin with a letter (a - z, A - B) or underscore (_)


 Other characters can be letters, numbers or _
 Case Sensitive
 Can be any (reasonable) length
 There are some reserved words which you cannot use as a variable
name because Python uses them for other things. For e.g., print, continue,
break etc.

2.2.8 Advanced Datatypes in Python

Data type defines the format, sets the upper & lower bounds of the data so that a
program could use it appropriately. In Python, we don’t need to declare a variable
with explicitly mentioning the data type. This feature is famously known as dynamic
typing. Python determines the type of a literal directly from the syntax at runtime. For
example – the quotes mark the declaration of a string value, square brackets
represent a list and curly brackets for a dictionary. Also, the non-decimal numbers
will get assigned to Integer type whereas the ones with a decimal point will be a float.

© Edunet Foundation. All rights reserved. | 54


Below is the list of important data types that are commonly used in Python.
 Booleans
 Numbers
 Strings
 Bytes
 Lists
 Tuples
 Sets
 Dictionaries

Fig: Datatypes in Python

In Python, numeric data type represents the data which has numeric value. Numeric
value can be integer, floating number or even complex numbers. These values are
defined as int, float and complex class in Python.
 Integers – This value is represented by int class. It contains positive or
negative whole numbers (without fraction or decimal). In Python there is
no limit to how long an integer value can be.
 Float – This value is represented by float class. It is a real number with
floating point representation. It is specified by a decimal point. Optionally,
the character e or E followed by a positive or negative integer may be
appended to specify scientific notation.
 Complex Numbers – Complex number is represented by complex class.
It is specified as (real part) + (imaginary part)j. For example – 2+3j

© Edunet Foundation. All rights reserved. | 55


2.2.9 Sequence Type
In Python, sequence is the ordered collection of similar or different data types.
Sequences allows to store multiple values in an organized and efficient fashion.
There are several sequence types in Python –
 String
 List
 Tuple
1) String
In Python, Strings are arrays of bytes representing Unicode characters. A string is a
collection of one or more characters put in a single quote, double-quote or triple
quote. In python there is no character data type, a character is a string of length one.
It is represented by “str” class.

Creating String: Strings in Python can be created using single quotes or double
quotes or even triple quotes.

© Edunet Foundation. All rights reserved. | 56


2) List
Lists are just like the arrays, declared in other languages which is an ordered
collection of data. It is very flexible as the items in a list do not need to be of the
same type. Lists in Python can be created by just placing the sequence inside the
square brackets []. Lists are mutable, which means they can be changed.

3) Tuple
Just like list, tuple is also an ordered collection of Python objects. The only difference
between tuple and list is that tuples are immutable i.e., tuples cannot be modified
after it is created and tuple uses () bracket. It is represented by tuple class. Tuples
can contain any number of elements and of any datatype (like strings, integers, list,
etc.).
Note: Tuples can also be created with a single element, but it is a bit tricky. Having
one element in the parentheses is not sufficient, there must be a trailing ‘comma’ to
make it a tuple.

© Edunet Foundation. All rights reserved. | 57


4) Boolean
Data type with one of the two built-in values, True or False. Integers and floating
point numbers can be converted to the boolean data type using Python's bool()
function. An int, float or complex number set to zero returns False. An integer, float
or complex number set to any other number, positive or negative, returns Tru. It is
denoted by the class bool.
Note – The keywords True and False must have an UpperCase first letter. Using
lowercase true will throw an error.

5) Set
In Python, Set is an unordered collection of data type that is iterable, mutable and
has no duplicate elements. The order of elements in a set is undefined though it may
consist of various elements.
Creating Sets
Sets can be created by using the built-in set() function with an iterable object or a
sequence by placing the sequence inside curly braces, separated by ‘comma’. Type
of elements in a set need not be the same, various mixed-up data type values can
also be passed to the set.

© Edunet Foundation. All rights reserved. | 58


6) Dictionary
Dictionary in Python is an unordered collection of data values, used to store data
values like a map, which unlike other Data Types that hold only single value as an
element, Dictionary holds key:value pair. Key-value is provided in the dictionary to
make it more optimized. Each key-value pair in a Dictionary is separated by a
colon:, whereas each key is separated by a ‘comma’.
Creating Dictionary
In Python, a Dictionary can be created by placing a sequence of elements within
curly {} braces, separated by ‘comma’. Values in a dictionary can be of any datatype
and can be duplicated, whereas keys can’t be repeated and must be immutable.
Dictionary can also be created by the built-in function dict(). An empty dictionary
can be created by just placing it to curly braces{}.
Note – Dictionary keys are case sensitive, same name but different cases of Key will
be treated distinctly.

2.2.10 Functions & Methods in Python


A function is a block of organized, reusable code that is used to perform a single,
related action. Functions provide better modularity for your application and a high
degree of code reusing. Python gives us many built-in functions like print(), etc. but
you can also create your own functions. These functions are called user-defined
functions.

Defining a Function
You can define functions to provide the required functionality. Here are simple rules
to define a function in Python.

© Edunet Foundation. All rights reserved. | 59


 Function blocks begin with the keyword def followed by the function name
and parentheses ( ).
 Any input parameters or arguments should be placed within these
parentheses. You can also define parameters inside these parentheses.
 The first statement of a function can be an optional statement - the
documentation string of the function or docstring.
 The code block within every function starts with a colon (:) and is indented.
 The statement return [expression] exits a function, optionally passing back an
expression to the caller. A return statement with no arguments is the same as
return None.

2.2.11 Method
A method in python is somewhat similar to a function, except it is associated with
object/classes. Methods in python are very similar to functions except for two major
differences.
 The method is implicitly used for an object for which it is called.
 The method is accessible to data that is contained within the class.

Now, after understanding the concept of function in python let's get familiar
with conditional operators and looping statements in Python.

© Edunet Foundation. All rights reserved. | 60


2.2.12 Condition & Loop in Python

Python supports the usual logical conditions from mathematics:

 Equals: a == b
 Not Equals: a != b
 Less than: a < b
 Less than or equal to: a <= b
 Greater than: a > b
 Greater than or equal to: a >= b

These conditions can be used in several ways, most commonly in "if statements"
and loops.

 An "if statement" is written by using the if keyword.

 The elif keyword is python's way of saying "if the previous conditions were not
true, then try this condition".

 The else keyword catches anything which isn't caught by the preceding
conditions.

Loops

A for loop is used for iterating over a sequence (that is either a list, a tuple, a
dictionary, a set, or a string). This is less like the for keyword in other programming
languages, and works more like an iterator method as found in other object-
orientated programming languages. With the for loop we can execute a set of
statements, once for each item in a list, tuple, set etc.

© Edunet Foundation. All rights reserved. | 61


break Statement
With the break statement we can stop the loop before it has looped through all the
items.
continue Statement

With the continue statement we can stop the current iteration of the loop, and
continue with the next.

range() Function
To loop through a set of code a specified number of times, we can use
the range() function. The range() function returns a sequence of numbers, starting
from 0 by default, and increments by 1 (by default), and ends at a specified number.
Nested Loops
A nested loop is a loop inside a loop.
The "inner loop" will be executed one time for each iteration of the "outer loop":

© Edunet Foundation. All rights reserved. | 62


pass Statement
for loops cannot be empty, but if you for some reason have a for loop with no
content, put in the pass statement to avoid getting an error.

2.2.13 Strings & Methods

Python provides lots of built-in methods which we can use on strings. Below are the
list of some string methods available in Python 3.

1. capitalize()

Returns a copy of the string with its first character capitalized and the rest
lowercased.

2. Casefold()

Returns a casefolded copy of the string. Casefolded strings may be used


for caseless matching.

© Edunet Foundation. All rights reserved. | 63


3. Center(width, [fillchar])

Returns the string centered in a string of length width. Padding can be


done using the specified fillchar (the default padding uses an ASCII
space). The original string is returned if width is less than or equal to len(s)

4. Count(sub, [start], [end])

Returns the number of non-overlapping occurrences of substring (sub) in


the range [start, end]. Optional arguments startand end are interpreted as
in slice notation.

5. Encode(encoding = “utf-g”, errors = “strict”)


Returns an encoded version of the string as a bytes object. The default encoding is
utf-8. errors may be given to set a different error handling scheme. The possible
value for errors are:
 Strict (encoding errors raise a unicodeerror)
 Ignore
 Replace
 Xmlcharrefreplace
 Backslashreplace
 Any other name registered via codecs.register_error()

© Edunet Foundation. All rights reserved. | 64


6. endswith()

Returns True if the string ends with the specified suffix, otherwise it returns
False.

7. upper()

Converts a string into upper case

2.2.14 String Formatting in Python

1. format() function in Python

The format() method has been introduced for handling complex string formatting
more efficiently. This method of the built-in string class provides functionality for
complex variable substitutions and value formatting. This new formatting technique
is regarded as more elegant. The general syntax of format();

© Edunet Foundation. All rights reserved. | 65


2. The Placeholders
The placeholders can be identified using named indexes {price}, numbered
indexes {0}, or even empty placeholders {}

3. Formatting Types
Inside the placeholders you can add a formatting type to format the result:

1. :< Left align

#To demonstrate, we insert the number 8 to set the available space for the
value to 8 characters.

#Use "<" to left-align the value:

2. :> Right align

© Edunet Foundation. All rights reserved. | 66


3. :^ Center align

4. := Places the sign to the left most position

5. :+ To indicate if the result is positive or negative

6. : Extra space

2.3 Python Libraries


Normally, a library is a collection of books or is a room or place where many books
are stored to be used later. Similarly, in the programming world, a library is a
collection of precompiled codes that can be used later on in a program for some

© Edunet Foundation. All rights reserved. | 67


specific well-defined operations. Other than pre-compiled codes, a library may
contain documentation, configuration data, message templates, classes, and values,
etc.
A Python library is a collection of related modules. It contains bundles of code that
can be used repeatedly in different programs. It makes Python Programming simpler
and convenient for the programmer. As we don’t need to write the same code again
and again for different programs. Python libraries play a very vital role in fields of
Machine Learning, Data Science, Data Visualization, etc.

Working of Python Library

As is stated above, a Python library is simply a collection of codes or modules of


codes that we can use in a program for specific operations. But how it works.
Actually, in the MS Windows environment, the library files have a DLL extension
(Dynamic Load Libraries). When we link a library with our program and run that
program, the linker automatically searches for that library. It extracts the
functionalities of that library and interprets the program accordingly. That’s how we
use the methods of a library in our program. We will see further, how we bring in the
libraries in our Python programs.

Python standard library

The Python Standard Library contains the exact syntax, semantics, and tokens of
Python. It contains built-in modules that provide access to basic system functionality
like I/O and some other core modules. Most of the Python Libraries are written in the
C programming language. The Python standard library consists of more than 200
core modules. All these works together to make Python a high-level programming
language. Python Standard Library plays a very important role. Without it, the
programmers can’t have access to the functionalities of Python. But other than this,
there are several other libraries in Python that make a programmer’s life easier. Let’s
have a look at some of the commonly used libraries:
1. TensorFlow: This library was developed by Google in collaboration with the
Brain Team. It is an open-source library used for high-level computations. It is
also used in machine learning and deep learning algorithms. It contains a
large number of tensor operations. Researchers also use this Python library to
solve complex computations in Mathematics and Physics.
2. Matplotlib: This library is responsible for plotting numerical data. And that’s
why it is used in data analysis. It is also an open-source library and plots high-
defined figures like pie charts, histograms, scatterplots, graphs, etc.
3. Pandas: Pandas are an important library for data scientists. It is an open-
source machine learning library that provides flexible high-level data
structures and a variety of analysis tools. It eases data analysis, data
manipulation, and cleaning of data. Pandas support operations like Sorting,
Re-indexing, Iteration, Concatenation, Conversion of data, Visualizations,
Aggregations, etc.
4. Numpy: The name “Numpy” stands for “Numerical Python”. It is the
commonly used library. It is a popular machine learning library that supports
large matrices and multi-dimensional data. It consists of in-built mathematical
© Edunet Foundation. All rights reserved. | 68
functions for easy computations. Even libraries like TensorFlow use Numpy
internally to perform several operations on tensors. Array Interface is one of
the key features of this library.
5. SciPy: The name “SciPy” stands for “Scientific Python”. It is an open-source
library used for high-level scientific computations. This library is built over an
extension of Numpy. It works with Numpy to handle complex computations.
While Numpy allows sorting and indexing of array data, the numerical data
code is stored in SciPy. It is also widely used by application developers and
engineers.
6. Scrapy: It is an open-source library that is used for extracting data from
websites. It provides very fast web crawling and high-level screen scraping. It
can also be used for data mining and automated testing of data.
7. Scikit-learn: It is a famous Python library to work with complex data. Scikit-
learn is an open-source library that supports machine learning. It supports
variously supervised and unsupervised algorithms like linear regression,
classification, clustering, etc. This library works in association with Numpy and
SciPy.
8. PyGame: This library provides an easy interface to the Standard Directmedia
Library (SDL) platform-independent graphics, audio, and input libraries. It is
used for developing video games using computer graphics and audio libraries
along with Python programming language.
9. PyTorch: PyTorch is the largest machine learning library that optimizes tensor
computations. It has rich APIs to perform tensor computations with strong
GPU acceleration. It also helps to solve application issues related to neural
networks.
10. PyBrain: The name “PyBrain” stands for Python Based Reinforcement
Learning, Artificial Intelligence, and Neural Networks library. It is an open-
source library built for beginners in the field of Machine Learning. It provides
fast and easy-to-use algorithms for machine learning tasks. It is so flexible
and easily understandable and that’s why is really helpful for developers that
are new in research fields.
11. OpenCV: Open-Source Computer Vision is used for image processing. It is a
Python package that monitors overall functions focused on instant computer
vision. OpenCV provides several inbuilt functions, with the help of this you can
learn Computer Vision. It allows both read and write images at the same time.
Objects such as faces, trees, etc., can be diagnosed in any video or image.
There are many more libraries in Python. We can use a suitable library for our
purposes. Hence, Python libraries play a very crucial role and are very helpful to the
developers.

Use of Libraries in Python Program


As we write large-size programs in Python, we want to maintain the code’s
modularity. For the easy maintenance of the code, we split the code into different
parts and we can use that code later ever we need it. In Python, modules play that
part. Instead of using the same code in different programs and making the code
complex, we define mostly used functions in modules and we can just simply import
them in a program wherever there is a requirement. We don’t need to write that code
but still, we can use its functionality by importing its module. Multiple interrelated
© Edunet Foundation. All rights reserved. | 69
modules are stored in a library. And whenever we need to use a module, we import it
from its library. In Python, it’s a very simple job to do due to its easy syntax. We just
need to use import.
Let’s have a look at exemplar code.
# Importing math library
import math
A = 16
print(math.sqrt(A))
Output
4.0
Here in the above code, we imported the math library and used one of its methods
i.e. sqrt (square root) without writing the actual code to calculate the square root of a
number. That’s how a library makes the programmers’ job easier. But here we
needed only the sqrt method of math library, but we imported the whole library.
Instead of this, we can also import specific items from a library module. After
understanding the predefined libraries lets learn how to create our own library and
how to use it.

2.3.1 User-Defined Library


Creating a user defined module.
To create a module just save the code you want in a file with the file extension .py:
Example;
Save this code in a file named mymodule.py
def greeting(name):
print("Hello, " + name)

Using the Module: Now we can use the module we just created, by using
the import statement.
Example
Import the module named mymodule, and call the greeting function:
import mymodule

mymodule.greeting("Jonathan")

Installing Packages with pip

pipis the standard package manager for Python. It allows you to install and manage
additional packages that are not part of the Python standard library.

1. Check, whether you have an installed version


py –version
pip --version

2. If you need help, please try help

© Edunet Foundation. All rights reserved. | 70


pip help

3. Install the required packages: pip install requests

Let's see the object-oriented feature of python- class


An object is simply a collection of data (variables) and methods (functions) that act
on those data. Similarly, a class is a blueprint for that object. We can think of class
as a sketch (prototype) of a house. It contains all the details about the floors, doors,
windows etc. Based on these descriptions we build the house. House is the object .

2.3.2 Attributes

Class attributes belong to the class itself they will be shared by all the instances.
Such attributes are defined in the class body parts usually at the top, for legibility.

2.3.3 Instance Attributes

Unlike class attributes, instance attributes are not shared by objects. Every object
has its own copy of the instance attribute (In case of class attributes all object refer
to single copy).
To list the attributes of an instance/object, we have two functions:-
1. vars()– This function displays the attribute of an instance in the form of an
dictionary.
© Edunet Foundation. All rights reserved. | 71
2. dir()– This function displays more attributes than vars function, as it is not limited
to instance. It displays the class attributes as well. It also displays the attributes of its
ancestor classes.

So, we explored the fundamentals of Linux operating system and basics of Python
programming language. Happy Learning.

2.4 NumPy Library


NumPy is an open-source Python library used for working with arrays. It also has
functions for working in domain of linear algebra, Fourier transform, and matrices.
NumPy stands for Numerical Python.

Why NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to
process. NumPy aims to provide an array object that is up to 50x faster than
traditional Python lists. The array object in NumPy is called ndarray, it provides a lot
of supporting functions that make working with ndarray very easy. Arrays are very
frequently used in data science, where speed and resources are very important.

NumPy arrays are stored in a single contiguous (continuous) block of memory. There
are two key concepts relating to memory: dimensions and strides.

Firstly, many Numpy functions use strides to make things fast. Examples include
integer slicing (e.g. X[1,0:2]) and broadcasting. Understanding strides helps us better
understand how NumPy operates.

Secondly, we can directly use strides to make our own code faster. This can be
particularly useful for data pre-processing in machine learning. NumPy is a Python
library and is written partially in Python, but most of the parts that require fast
computation are written in C or C++.

Installing Numpy Module


1. To install numpy library: pip install numpy

2. To verify the libraries already installed: pip list

Create a NumPy ndarray Object


NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.
1. Create a NumPy ndarray Object

© Edunet Foundation. All rights reserved. | 72


Min, Max & Everything in between

Random Number Generator


 Generate Random Number: NumPy offers the random module to work with
random numbers. Ex: Generate a random integer from 0 to 100:

 Generate Random Float: The random module's rand() method returns a


random float between 0 and 1. Ex: Generate a random float from 0 to 1:

© Edunet Foundation. All rights reserved. | 73


 Generate Random Array: In NumPy we work with arrays, and you can use the
two methods from the above examples to make random arrays.

Integers
The randint() method takes a size parameter where you can specify the shape of an
array.
Ex: Generate a 1-D array containing 5 random integers from 0 to 100

Ex: Generate a 2-D array with 3 rows, each row containing 5 random integers from 0
to 100

2.4.1 Creating scalars in NumPy

© Edunet Foundation. All rights reserved. | 74


Creating Vector in NumPy

2.4.2 Creating Matrix in NumPy

Matrix Multiplication in NumPy

© Edunet Foundation. All rights reserved. | 75


Matrix multiplication is an operation that takes two matrices as input and produces
single matrix by multiplying rows of the first matrix to the column of the second
matrix. In matrix multiplication make sure that the number of rows of the first matrix
should be equal to the number of columns of the second matrix.

2.4.3 NumPy Statistical Functions

Statistics is concerned with collecting and then analyzing that data. It includes
methods for collecting the samples, describing the data, and then concluding that
data. NumPy is the fundamental package for scientific calculations and hence goes
hand-in-hand for NumPy statistical Functions.
NumPy contains various statistical functions that are used to perform statistical data
analysis. These statistical functions are useful when finding a maximum or minimum
of elements. It is also used to find basic statistical concepts like standard deviation,
variance, etc.

Fig: NumPy Statistical Functions

© Edunet Foundation. All rights reserved. | 76


NumPy Statistical Functions
NumPy is equipped with the following statistical functions, let us see one by one:
1. np.amin()- This function determines the minimum value of the element along a
specified axis.
2. np.amax()- This function determines the maximum value of the element along
a specified axis.
3. np.mean()- It determines the mean value of the data set.
4. np.median()- It determines the median value of the data set.
5. np.std()- It determines the standard deviation
6. np.var() – It determines the variance.
7. np.ptp()- It returns a range of values along an axis.
8. np.average()- It determines the weighted average
9. np.percentile()- It determines the nth percentile of data along the specified
axis.

1. Finding maximum and minimum of array in NumPy


NumPy np.amin()and np.amax()functions are useful to determine the minimum and
maximum value of array elements along a specified axis.

2. Finding Mean, Median, Standard Deviation and Variance in NumPy


Mean
Mean is the sum of the elements divided by the number of elements and is given by
the following formula:

It calculates the mean by adding all the items of the arrays and then divides it by the
number of elements. We can also mention the axis along which the mean can be
calculated.

© Edunet Foundation. All rights reserved. | 77


Median
Median is the middle element of the array. The formula differs for odd and even sets.
The median for a given set of data with n elements (observations) is given by

It can calculate the median for both one-dimensional and multi-dimensional arrays.
Median separates the higher and lower range of data values.

Standard Deviation
Standard deviation is the square root of the average of square deviations from mean.
The formula for standard deviation is:

Variance
Variance is the average of the squared differences from the mean. Following is the
formula for the same:

© Edunet Foundation. All rights reserved. | 78


2.4.4 Percentile & Interquartile in NumPy

Quartiles:

A quartile is a type of quantile. The first quartile (Q1), is defined as the middle
number between the smallest number and the median of the data set, the second
quartile (Q2) – median of the given data set while the third quartile (Q3), is the middle
number between the median and the largest value of the data set.

Algorithm to find Quartiles:


Quartiles are calculated by the help of the median. If the number of entries is an
even number i.e., of the form 2n, then, first quartile (Q1) is equal to the median of
the n smallest entries and the third quartile (Q3) is equal to the median of
the n largest entries.
If the number of entries is an odd number i.e., of the form (2n + 1), then
 the first quartile (Q1) is equal to the median of the n smallest entries
 the third quartile (Q3) is equal to the median of the n largest entries
 the second quartile(Q2) is the same as the ordinary median.
Range: It is the difference between the largest value and the smallest value in the
given data set.

Interquartile Range: The interquartile range (IQR), also called as mid-spread or


middle 50%, or technically H-spread is the difference between the third quartile (Q3)
and the first quartile (Q1). It covers the centre of the distribution and contains 50% of
the observations. IQR = Q3 – Q1

Uses;

 The interquartile range has a breakdown point of 25% due to which it is


often preferred over the total range.
 The IQR is used to build box plots, simple graphical representations of
a probability distribution.
 The IQR can also be used to identify the outliers in the given data set.
 The IQR gives the central tendency of the data.

Decision Making
 The data set having a higher value of interquartile range (IQR) has
more variability.
© Edunet Foundation. All rights reserved. | 79
 The data set having a lower value of interquartile range (IQR) is
preferable.

Interquartile range using numpy.median

Interquartile range using numpy.percentile

After knowing the various statistical functions in NumPy which are used in
descriptive analytics, let us see some other interesting functionalities of NumPy.

2.4.5 Array Broadcasting in NumPy

The term broadcasting describes how NumPy treats arrays with different shapes
during arithmetic operations. Subject to certain constraints, the smaller array is
“broadcast” across the larger array so that they have compatible shapes.

© Edunet Foundation. All rights reserved. | 80


Fig: Broadcast Array

In the simplest example of broadcasting, the scalar ``b`` is stretched to become an


array of same shape as ``a`` so the shapes are compatible for element-by-element
multiplication.

The result is equivalent to the previous example where b was an array. We can think
of the scalar b being stretched during the arithmetic operation into an array with the
same shape as a. The new elements in b, as shown in the above figure, are simply
copies of the original scalar.

Sorting Arrays
numpy.sort(): This function returns a sorted copy of an array.
Parameters:
arr: Array to be sorted.
axis: Axis along which we need array to be started.
order: This argument specifies which fields to compare first.
kind: [‘quicksort’{default}, ‘mergesort’, ‘heapsort’]. These are the sorting
algorithms.

© Edunet Foundation. All rights reserved. | 81


Concluding, we have discussed Data Creation, with arrays & matrices using NumPy
through the last few sections. We next discuss about data analysis with Pandas, and
data visualization with Matplotlib.

© Edunet Foundation. All rights reserved. | 82


Chapter 3: Data Analysis with Pandas
and Matplotlib
Learning Outcomes:

 Use Matplotlib library and its various functions to visualize the data
 Analyse different types of data using Pandas
 Understand the use of data analytics in improved decision making

3.1 What is Data Visualization?

3.1.1 Understanding Data

Data is a collection of information gathered by observations, measurements,


research or analysis. They may consist of facts, numbers, names, figures or even
description of things. Data is organized in the form of graphs, charts or tables. There
exist data scientist who does data mining and with the help of that data analyse our
world.

Classification of Data:

Qualitative: It describes the quality of something or someone. It is descriptive


information. For example, the skin colour, eye colour, hair texture, etc. gives us the
qualitative information about a person.

Quantitative: It provides numerical information. Example, the height and weight of a


person.

The Pandas library is one of the most preferred tools for data scientists to do data
manipulation and analysis, next to matplotlib for data visualization and NumPy, the
fundamental library for scientific computing in Python on which Pandas was built.

The fast, flexible, and expressive Pandas data structures are designed to make real-
world data analysis significantly easier, but this might not be immediately the case
for those who are just getting started with it. This is because there is so much
functionality built into this package that the options are overwhelming.

© Edunet Foundation. All rights reserved. | 83


Meanwhile, Pandas will be introduced at a later stage. We show below how to
classify the data & how to select data visualization techniques?

3.1.2 Understanding Data Visualization?

Data visualization is the graphical representation of information and data. By


using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
These visual displays of information communicate complex data relationships and
data-driven insights in a way that is easy to understand. In the world of Big Data,
data visualization tools and technologies are essential to analyze massive amounts
of information and make data-driven decisions.

Our eyes are drawn to colors and patterns. We can quickly identify red from blue,
square from circle. Our culture is visual, including everything from art and
advertisements to TV and movies. Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message. When we see a chart,
we quickly see trends and outliers. If we can see something, we internalize it quickly.
It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of
data and couldn’t see a trend, you know how much more effective a visualization can
be. Data visualization helps transform your data into an engaging story with details
and patterns. It is used for:
 Better Analysis
 Speed up decision making process
 Quick action
 Identifying patterns
 Story telling is more engaging
 Grasping the latest trends
© Edunet Foundation. All rights reserved. | 84
 Finding errors

Data visualization for idea illustration assists in conveying an idea, such as a tactic or
process. It is commonly used to spur idea generation across teams. In the early
days of visualization, the most common visualization technique was using a
Microsoft Excel spreadsheet to transform the information into a table, bar graph or
pie chart. While Microsoft Excel continues to be a popular tool for data visualization,
others have been created that provide us with more sophisticated abilities.

Image: Sample Data visualization

3.1.3 Different Types of Analysis

Data visualization is a part of exploratory data analysis, where the main objective is
to analyse data and summarize the entire characteristics, often with visual methods.
We perform analysis on the data that we collect, find important metrics/features by
using some nice and pretty visualizations. It is usually performed using the following
methods:

Univariate Analysis: Univariate is commonly used to describe a type of data that


contains only one attribute or characteristic. The salaries of people in the industry
could be a univariate analysis example. The univariate data could also be used to
calculate the mean age of the population in a village. Ex: - Box plot, Violin plot, Bar

© Edunet Foundation. All rights reserved. | 85


chart, Area chart, Line chart, Pie chart and histogram. We will discuss in details
about these plots afterwards.

Bivariate Analysis: This type of analysis is performed to find the relationship


between each variable in the dataset and the target variable of interest. In simple
terms it is used to find the interactions between variables. Ex:- Scatter Plot, Hex
Plot, etc.

Multivariate Analysis: This analysis is used, when we have more than two variables
in the dataset. It is a hard task for the human brain to visualize the relationship
among more than 3 variables in a graph and thus multivariate analysis is used to
study these complex data types. Ex: - Cluster Analysis, Pair Plot and 3D scatter plot.

3.1.4 Plotting and Visualization


Plotting is a chart or map showing the movements or progress of an object.
A plot is a graphical technique for representing a data set, usually as
© Edunet Foundation. All rights reserved. | 86
a graph showing the relationship between two or more variables. Python offers
multiple great graphing libraries that come packed with lots of different features.

To get a little overview here are a few popular plotting libraries:

 Matplotlib: low level, provides lots of freedom


 Pandas Visualization: easy to use interface, built on Matplotlib
 Seaborn: high-level interface, great default styles
 ggplot: based on R’s ggplot2, uses Grammar of Graphics
 Plotly: can create interactive plots

Let us explore the Matplotlib library:

Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python.

How to install matplotlib in python?

To install Matplotlib pip and conda can be used.


pip install matplotlib
or
Matplotlib is specifically
conda install good for creating basic graphs like line charts, bar charts,
matplotlib
histograms and many more. It can be imported by typing:
import matplotlib.pyplot as plt

matplotlib.pyplot is a collection of command style functions that make matplotlib work


like MATLAB. We next discuss about the different types of graphs and how to plot
them using matplotlib.

3.2 Plotting with Matplotlib

Line Chart: A line chart displays the evolution of one or more numeric variables. It
is one of the most common chart types. It is a type of graph or chart which displays
information as a series of data points called ‘markers; connected by straight line
segments. Line plot can be used for both ordered and unordered data.

© Edunet Foundation. All rights reserved. | 87


Output:

The code seems self explanatory. Following steps were followed:


 Define the x-axis and corresponding y-axis values as lists.
 Plot them on canvas using .plot() function.
 Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
 Give a title to your plot using .title() function.
 Finally, to view your plot, we use .show() function.

3.2.1 Matplotlib Markers


You can use the keyword argument marker to emphasize each point with a specified
marker. There are different possible marker options available with matplotlib.

© Edunet Foundation. All rights reserved. | 88


Matplotlib Line:
Linestyle is keyword used to change the style of the plotted line:
Shorter form of linestyle is ls. The linestyles allowed are, solid, dashed, dotted, dot-
dashed.

Line Color:
You can use the keyword argument color or the shorter c to set the color of the line.
The default colors used in matplotlib are - b: blue, g: green, r:red, c:cyan, m:
magenta, y: yellow, k:black, w:white.

Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to set the
color of the edge of the markers, and markefacecolor or the shorter mfc to set the
color of the face of the markers

3.2.2 Subplots in matplotlib

The Matplotlib subplot() function can be called to plot two or more plots in one
figure. Matplotlib supports all kind of subplots including 2x1 vertical, 2x1 horizontal or
a 2x2 grid. The subplot() function takes three arguments that describes the layout of
the figure. The layout is organized in rows and columns, which are represented by
the first and second argument. The third argument represents the index of the
current plot.
plt.subplot(x, y, 1)

Example
#the figure has x rows, y columns, and this plot is the first plot.
plt.subplot(x,
Draw y,top
2 plots on 2) of each other:
#the figure has x rows, y columns, and this plot is the second plot.
Code: subplot theory.ipynb

© Edunet Foundation. All rights reserved. | 89


Output

You can draw as many plots you like on one figure, just describe the number of
rows, columns, and the index of the plot.

3.2.3 Bar Plot, Histogram, Scatter Plot, Pie Chart

Bar Plot: Bar plots are arguably the simplest data visualization. They map
categories to numbers; it is very convenient while comparing categories of data or
different groups of data. Bar plots are very flexible: The height can represent
anything, as long as it is a number. And each bar can represent anything, as long as
it is a category.

© Edunet Foundation. All rights reserved. | 90


The categories can be of two types: 1) Nominal categories: "pure" categories that
don't make a lot of sense to order. Nominal categorical variables include things like
countries, ZIP codes, types of cheese, and lunar landers. 2) Ordinal categories:
things that do make sense to compare, like earthquake magnitudes, housing
complexes with certain numbers of apartments, and the sizes of bags of chips at
your local deli.

Histogram: The histogram is a popular graphing tool. It is used to summarize


discrete or continuous data that are measured on an interval scale. It is often used to
illustrate the major features of the distribution of the data in a convenient form. A
histogram looks, trivially, like a bar plot. And it basically is! In fact, a histogram is
special kind of bar plot that splits your data into even intervals known as bins and
displays how many rows are in each interval with bars. The only analytical difference
is that instead of each bar representing a single value, it represents a range of
values and is best for visualizing continuous data.

© Edunet Foundation. All rights reserved. | 91


Scatter Plot: A scatter plot represents individual pieces of data using dots. These
plots make it easier to see if two variables are related to each other. The resulting
pattern indicates the type (linear or non-linear) and strength of the relationship
between two variables. If the value along the Y axis seem to increase as X axis
increases (or decreases), it could indicate a positive (or negative) linear relationship.
Whereas, if the points are randomly distributed with no obvious pattern, it could
possibly indicate a lack of dependent relationship.

© Edunet Foundation. All rights reserved. | 92


Pie Chart: Pie chart is a circle that is divided into areas or slices. It is mainly used to
comprehend how a group is broken down into smaller pieces. The whole pie
represents 100 percent, and the slices denote the relative size of that particular
category. Pie charts can be helpful for showing the relationship of parts to the whole
when there are a small number of levels. We show below a pie chart, which displays
how different brands of a product line contribute to revenue.

© Edunet Foundation. All rights reserved. | 93


3.2.4 Legends and Text Annotations

A legend is an area describing the elements of the graph. In the matplotlib library,
there’s a function called legend() which is used to place a legend on the axes.
Matplotlib.pyplot.legend()
Write the code given below in jupyter notebook to add Legend to the graph and
click on Run. We can use a rounded box,(fancybox) or add a shadow, change the
transparency (alpha value) of the frame, or change the padding around the text

Result:

We next consider annotations, a piece of text referring to a data point. Creating a


good visualization involves guiding the reader so that the figure tells a story. In some
cases, this story can be told in an entirely visual manner, without the need for added
text, but in others, small textual cues and labels are necessary. Perhaps the most
basic types of annotations you will use are axes labels and titles, but the options go
beyond this. Let's take a look at some data and how we might visualize and annotate
it to help convey interesting information. Let’s try to understand it by one example.
The annotate() function in pyplot module of matplotlib library is used to annotate
the point xy with texts. Here annotation mark is the simple arrow.

© Edunet Foundation. All rights reserved. | 94


Output

In above example you can observe text annotate local maximum and local minimum
by arrow. We next consider another example.

© Edunet Foundation. All rights reserved. | 95


Output:

3.2.5 Three-Dimensional Plotting in Matplotlib

Three-dimensional plots are enabled by importing the mplot3d toolkit, included with
the main Matplotlib installation:
from mpl_toolkits import mplot3d

Once this submodule is imported, three-dimensional axes can be created by passing


the keyword projection='3d' to any of the normal axes creation routines. We first
create a simple 3D graph.

We can plot a variety of three-dimensional plot types, with the above three-
dimensional axes enabled. Remember that compared to the 2d plots, it will be
© Edunet Foundation. All rights reserved. | 96
greatly beneficial if we view the 3d plots interactively rather than statically in the
notebook. We will therefore use %matplotlib notebook rather than %matplotlib inline
when running the code. ax.plot3d and ax.scatter are the function to plot line and
point graph respectively. Write down the code given below in jupyter notebook and
click on Run

Output:

We have learned and explored the various functions in Matplotlib library which are
used to visualize the data in various format and make it available to help in decision
making process. Now, let us see another powerful library called Pandas which is
very popular in data analytics.

© Edunet Foundation. All rights reserved. | 97


3.3 Data Manipulation with Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool. Pandas is a newer package built on top of NumPy, and
provides an efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row and
column labels, and often with heterogeneous types and/or missing data. Apart
from offering a convenient storage interface for labelled data, Pandas also
implements a number of powerful data operations familiar to users of both
database frameworks and spreadsheet programs.

Image: DataFrame

3.3.1 How to install pandas in Python

Use the command, pip install pandas and install it.


You can quickly check if the package was successfully installed in Python, by
opening the Python IDLE and then running the command “import pandas”. If no
errors appear, then the package was successfully installed. For more details on
Python packages you can refer https://packaging.python.org/.
To load the pandas package and start working with it, import the package
import pandas as pd

For e.g., we need to store passengers’ data of the Titanic. For a number of
passengers, we know the name (characters), age (integers) and sex (male/female)
data.
© Edunet Foundation. All rights reserved. | 98
Output:

To manually store data in a table, create a DataFrame. When using a Python


dictionary of lists, the dictionary keys will be used as column headers and the values
in each list as columns of the DataFrame. Now we can perform more operation to
get more insights into data. Let us explore more commands.

When selecting a single column of a pandas DataFrame, the result is a


pandas Series. A Series is a one-dimensional array holding data of any type.
Pandas uses the to_frame() method to easily convert a series into a dateframe. To
select the column, use the column label in between square brackets [].

Find maximum Age of the passengers.

© Edunet Foundation. All rights reserved. | 99


The describe() method provides a quick overview of the numerical data in
a DataFrame. As the Name and Sex columns are textual data, these are by default
not taken into account by the describe() method. The describe() command is used to
view some basic statistical details like percentile, mean, std etc. from the numerical
columns of a DataFrame.

.
3.3.2 How to manipulate textual data?

Here we use the Titanic data set, which is stored as CSV file. A screenshot of the
first few columns of the dataset is given below.

The data consists of the following data columns:

 PassengerId: Id of every passenger.


 Survived: These features have value 0 and 1. 0 for not survived and 1 for
survived.
 Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
 Name: Name of passenger.
 Sex: Gender of passenger.
 Age: Age of passenger.
 SibSp: The persons accompanying the passenger, related as siblings or
spouse.
© Edunet Foundation. All rights reserved. | 100
 Parch: Whether a passenger is alone or have family.
 Ticket: Ticket number of passenger.
 Fare: Indicating the fare.
 Cabin: The cabin of passenger.
 Embarked: The embarked category.
We first import pandas and read the dataset. Then we print the first five rows of the
dataset.

Code: pandas data manipulation.ipynb


Task 1: Make all name characters lowercase.

Code: pandas data manipulation.ipynb

Task 2: Create a new column Surname that contains the surname of the passengers
by extracting the part before the comma.

© Edunet Foundation. All rights reserved. | 101


Code: pandas data manipulation.ipynb
Using the series.str.split() method, each of the values is returned as a list of 2
elements. The first element is the part before the comma and the second element is
the part after the comma.

By this method and various functions, you can perform textual data manipulation
using pandas.
3.3.3 Introducing Pandas Objects

Pandas objects can be thought of as enhanced versions of NumPy structured arrays


in which the rows and columns are identified with labels rather than simple integer
indices. There are three fundamental Pandas data structures:
1. Series,
2. DataFrame
3. Index.

© Edunet Foundation. All rights reserved. | 102


The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from
a list or array as follows:

As we see in the output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes.
The values are simply a familiar NumPy array:

Like with a NumPy array, data can be accessed by the associated index via the
familiar Python square-bracket notation:

3.3.4 The Pandas DataFrame Object


The next fundamental structure in Pandas is the DataFrame. Like the Series object
discussed in the previous section, the DataFrame can be thought of either as a
generalization of a NumPy array, or as a specialization of a Python dictionary.

DataFrame as a generalized NumPy array


If a Series is an analog of a one-dimensional array with flexible indices,
a DataFrame is an analog of a two-dimensional array with both flexible row indices

© Edunet Foundation. All rights reserved. | 103


and flexible column names. Just as you might think of a two-dimensional array as an
ordered sequence of aligned one-dimensional columns, you can think of
a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean
that they share the same index.
To demonstrate this, let's first construct two new Series listing of the area and the
population of say five states of USA.:

Now that we have the area and the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:

3.3.5 Creating Series from simple datatypes

Creating a Pandas Series


The Pandas Series can be defined as a one-dimensional array that is capable of
storing various data types. We can easily convert the list, tuple, and dictionary into
© Edunet Foundation. All rights reserved. | 104
series using "series' method. The row labels of series are called the index. A Series
cannot contain multiple columns. It has the following parameter:
 data: It can be any list, dictionary, or scalar value.
 index: The value of the index should be unique and hashable. It must
be of the same length as data. If we do not pass any index,
default np.arrange(n) will be used.
 dtype: It refers to the data type of series.
 copy: It is used for copying the data.

1. Create an Empty Series:


We can easily create an empty series in Pandas which means it will not have any
value.

2. Creating a series from array:

In order to create a series from array, we have to import a numpy module and have
to use array() function.

© Edunet Foundation. All rights reserved. | 105


3. Creating a series from Lists:

In order to create a series from list, we have to first create a list after that we can
create a series from list.

4. Creating a series from Dictionary:

In order to create a series from dictionary, we have to first create a dictionary after
that we can make a series using dictionary. Dictionary key are used to construct an
index.

3.3.6 Data Storage Formats in Pandas

The different data storage formats available to be manipulated by Pandas library are
text, binary and SQL. Below is a table containing available ‘readers’ and ‘writers’
functions of the pandas I/O API set with data format and description.
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_fwf
text JSON read_json to_json
text HTML read_html to_html

© Edunet Foundation. All rights reserved. | 106


Format Type Data Description Reader Writer
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary OpenDocument read_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary ORC Format read_orc
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary SPSS read_spss
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google BigQuery read_gbq to_gbq

Figure: “readers” and “writers” functions in pandas


3.3.7 CSV file and JSON file

What is CSV file?

A CSV is a comma-separated values file, which allows data to be saved in a tabular


format. CSV files can be used with many spreadsheets program, such as Microsoft
Excel or Google Spreadsheets. They differ from other spreadsheet file types
because you can only have a single sheet in a file, they cannot save cell, column, or
row. Also, you cannot save formulas in this format.

Why are .CSV files used?

These files serve a number of different business purposes. Take, for instance, they
help companies export a high volume of data to a more concentrated database.
They also serve two other primary business functions:
 CSV files are plain-text files, making them easier for the website developer to
create
 Since they're plain text, they're easier to import into a spreadsheet or another
storage database, regardless of the specific software you're using.
 To better organize large amounts of data.

© Edunet Foundation. All rights reserved. | 107


How do I save CSV files?

Saving CSV files is relatively easy, you just need to know where to change the file
type. Under the "File name" section in the "Save As" tab, you can select "Save as
type" and change it to "CSV (Comma delimited) (*.csv). Once that option is selected,
you are on your way to quicker and easier data organization. This should be the
same for both Apple and Microsoft operating systems.

What is a JSON file?

A JSON file is a file that stores simple data structures and objects in JavaScript
Object Notation (JSON) format, which is a standard data interchange format. It is
primarily used for transmitting data between a web application and a server. JSON
files are lightweight, text-based, human-readable, and can be edited using a text
editor.

How do I open a JSON file?

Because JSON files are plain text files, you can open them in any text editor,
including:
 Microsoft Notepad (Windows)
 Apple TextEdit (Mac)
 Vim (Linux)
 GitHub Atom (cross-platform)

You can also open a JSON file in the Google Chrome and Mozilla Firefox web
browsers by dragging and dropping the file into your browser window.

Structures of JSON

JSON supports two widely used (amongst programming languages) data structures.

 A collection of name/value pairs. Different programming languages support


this data structure in different names. Like object, record, struct, dictionary,
hash table, keyed list, or associative array.

 An ordered list of values. In various programming languages, it is called as


array, vector, list, or sequence.

Since data structure supported by JSON is also supported by most of the modern
programming languages, it makes JSON a very useful data-interchange format.

© Edunet Foundation. All rights reserved. | 108


Syntax:
{ string : value, .......}

Explanation of Syntax
An object starts and ends with '{' and '}'. Between them, a number of string value
pairs can reside. String and value is separated by a ':' and if there are more than one
string value pairs, they are separated by ','.
Example
{
"firstName": "John",
"lastName": "Maxwell",
"age": 40,
"email":"[email protected]"
}
In JSON, objects can nest arrays (starts and ends with '[' and ']') within it. The
following example shows that.
{
"Students": [

{ "Name":"Amit Goenka" ,
"Major":"Physics" },
{ "Name":"Smita Pallod" ,
"Major":"Chemistry" },
{ "Name":"Rajeev Sen" ,
"Major":"Mathematics" }
]
}

Array:
Syntax:
[ value, .......]

Explanation of Syntax:
An Array starts and ends with '[' and ']'. Between them, a number of values can
reside. If there are more than one values, they are separated by ','.
Example
[100, 200, 300, 400]
If the JSON data describes an array, and each element of that array is an object.

[
{
"name": "John Maxwell",

© Edunet Foundation. All rights reserved. | 109


"email": "[email protected]"
},
{
"name": "Dale Carnegie",
"email": "[email protected]"
}
]

Remember that even arrays can also be nested within an object. The following
shows that.

{
"firstName": "John",
"lastName": "Maxwell",
"age": 40,
"address":
{
"streetAddress": "144 J B Queens Road",
"city": "Dallas",
"state": "Washington",
"postalCode": "75001"
},
"phoneNumber":
[
{
"type": "personal",
"number": "(214)5096995"
},
{
"type": "fax",
"number": "13235551234"
}
]
}

Value
Syntax:
String || Number || Object || Array || TRUE || FALSE || NULL
A value can be a string, a number, an object, an Array, a Boolean value (i.e., true or
false) or Null. This structure can be nested.

© Edunet Foundation. All rights reserved. | 110


String
A string is a sequence of zero or more Unicode characters, enclosed by double
quotes, using backslash escapes. A character is represented as a single character
string, similar to a C or Java string.
The following table shows supported string types.
String Types Description
" A double quotation mark.
\ Reverse Solidus
/ Solidus
b Backspace
f form feed
n newline
r Carriage return
t Horizontal tab
u Four hexadecimal digits

Number
The following table shows supported number types.
Number Types Description
Integer Positive or negative Digits.1-9 and 0.
Fraction Fractions like .8.
Exponent e, e+, e-, E, E+, E-

Whitespace
Whitespace can be placed between any pair of supported data-types.

3.3.8 Reading data from files

Load CSV files to Python Pandas

The basic process of loading data from a CSV file into a Pandas DataFrame is
achieved using the “read_csv” function in Pandas:
# Load the Pandas libraries with alias 'pd'
import pandas as pd

# Read data from file 'filename.csv'


# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv("filename.csv")

# Preview the first 5 lines of the loaded data


data.head();

© Edunet Foundation. All rights reserved. | 111


While this code seems simple, an understanding of three fundamental concepts is
required to fully grasp and debug the operation of the data loading procedure if you
run into issues:
File Extensions and File Types
The first step to working with comma-separated-value (CSV) files is understanding
the concept of file types and file extensions.

1. Data is stored on your computer in individual “files”, or containers, each with a


different name.
2. Each file contains data of different types – the internals of a Word document is
quite different from the internals of an image.
3. Computers determine how to read files using the “file extension”, that is the
code that follows the dot (“.”) in the filename.
4. So, a filename is typically in the form “<random name>.<file extension>”.
Examples:
 project1.DOCX – a Microsoft Word file called project1.
 shanes_file.TXT – a simple text file called shanes_file
 IMG_5673.JPG – An image file called IMG_5673.
 Other well known file types and extensions include: XLSX: Excel, PDF:
Portable Document Format, PNG – images, ZIP – compressed file format,
GIF – animation, MPEG – video, MP3 – music etc. See a complete list of
extensions here.
5. A CSV file is a file with a “.csv” file extension, e.g. “data.csv”,
“super_information.csv”. The “CSV” in this case lets the computer know that
the data contained in the file is in “comma separated value” format.

Data Representation in CSV files

A “CSV” file, that is, a file with a “csv” filetype, is a basic text file. Any text editor such
as NotePad on windows or TextEdit on Mac, can open a CSV file and show the
contents. Sublime Text is a wonderful and multi-functional text editor option for any
platform.
CSV is a standard form for storing tabular data in text format, where commas are
used to separate the different columns, and newlines (carriage return / press enter)
are used to separate rows. Typically, the first row in a CSV file contains the names of
the columns for the data.
An example of a table data set and the corresponding CSV-format data is shown in
the diagram below.

© Edunet Foundation. All rights reserved. | 112


Figure: Comma-separated value files, or CSV files, are simple text files where commas and newlines are used to
define tabular data in a structured way.

Note that almost any tabular data can be stored in CSV format – the format is
popular because of its simplicity and flexibility. You can create a text file in a text
editor, save it with a .csv extension, and open that file in Excel or Google Sheets to
see the table form.

Other Delimiters / Separators – TSV files


The comma separation scheme is by far the most popular method of storing tabular
data in text files. The choice of the ‘,’ comma character to delimiters columns,
however, is arbitrary, and can be substituted where needed. Popular alternatives
include tab (“\t”) and semi-colon (“;”). Tab-separate files are known as TSV (Tab-
Separated Value) files.
When loading data with Pandas, the read_csv function is used for reading any
delimited text file, and by changing the delimiter using the sep parameter.

Delimiters in Text Fields – Quotechar


One complication in creating CSV files is if you have commas, semicolons, or tabs
actually in one of the text fields that you want to store. In this case, it’s important to
use a “quote character” in the CSV file to create these fields.
The quote character can be specified in Pandas.read_csv using the quotechar
argument. By default (as with many systems), it’s set as the standard quotation
marks (“). Any commas (or other delimiters as demonstrated below) that occur
between two quote characters will be ignored as column separators.
In the example shown, a semicolon-delimited file, with quotation marks as a
quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar

© Edunet Foundation. All rights reserved. | 113


allows the “NickName” column to contain semicolons without being split into more
columns.

Figure: Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote
characters are used if the data in a column may contain the separating character. In this case, the ‘NickName’
column contains semicolon characters, and so this column is “quoted”. Specify the separator and quote character
in pandas.read_csv

Python – Paths, Folders, Files


When you specify a filename to Pandas.read_csv, Python will look in your “current
working directory “. Your working directory is typically the directory that you started
from your Python process or Jupyter notebook.

Figure: Pandas searches your ‘current working directory’ for the filename that you specify when opening or
loading files. The FileNotFoundError can be due to a misspelled filename, or an incorrect working directory.

© Edunet Foundation. All rights reserved. | 114


Finding your Python Path
Your Python path can be displayed using the built-in “os” module. The OS module is
for operating system dependent functionality into Python programs and scripts.
To find your current working directory, the function required is os.getcwd(). The
os.listdir() function can be used to display all files in a directory, which is a good
check to see if the CSV file you are loading is in the directory as expected.
Instead of moving the required data files to your working directory, you can also
change your current working directory to the directory where the files reside using
os.chdir().

File Loading: Absolute and Relative Paths


When specifying file names to the read_csv function, you can supply both absolute
or relative file paths.
 A relative path is the path to the file if you start from your current working
directory. In relative paths, typically the file will be in a subdirectory of the
working directory and the path will not start with a drive specifier, e.g.
(data/test_file.csv). The characters ‘..’ are used to move to a parent directory
in a relative path.
 An absolute path is the complete path from the base of your file system to the
file that you want to load, e.g., c:/Documents/Shane/data/test_file.csv.
Absolute paths will start with a drive specifier (c:/ or d:/ in Windows, or ‘/’ in
Mac or Linux)
It’s recommended and preferred to use relative paths where possible in applications,
because absolute paths are unlikely to work on different computers due to different
directory structures.

Pandas CSV File Loading Errors


The most common error’s you’ll get while loading data from CSV files into Pandas
will be:
1. FileNotFoundError: File b'filename.csv' does not exist
A File Not Found error is typically an issue with path setup, current directory, or file
name confusion (file extension can play a part here!)
2. UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid
continuation byte.
A Unicode Decode Error is typically caused by not specifying the encoding of the
file, and happens when you have a file with non-standard characters. For a quick fix,
try opening the file in Sublime Text, and re-saving with encoding ‘UTF-8’.
3. pandas.parser.CParserError: Error tokenizing data.
Parse Errors can be caused in unusual circumstances to do with your data format –
try to add the parameter “engine=’python'” to the read_csv function call; this changes
the data reading function internally to a slower but more stable method.

© Edunet Foundation. All rights reserved. | 115


Advanced Read CSV Files
There are some additional flexible parameters in the Pandas read_csv() function that
are useful to have in your arsenal of data science techniques:
 Specifying Data Types
As mentioned before, CSV files do not contain any type information for data. Data
types are inferred through examination of the top rows of the file, which can lead to
errors. To manually specify the data types for different columns, the dtype parameter
can be used with a dictionary of column names and data types to be applied, for
example: dtype={"name": str, "age": np.int32}
Note that for dates and date times, the format, columns, and other behaviour can be
adjusted using parse_dates, date_parser, dayfirst, keep_date parameters.
 Skipping and Picking Rows and Columns from File
The nrows parameter specifies how many rows from the top of CSV file to read,
which is useful to take a sample of a large file without loading completely. Similarly,
the skiprows parameter allows you to specify rows to leave out, either at the start of
the file (provide an int), or throughout the file (provide a list of row indices). Similarly,
the usecols parameter can be used to specify which columns in the data to load.
 Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values can be
specified with different tokens. The na_values parameter allows you to customise the
characters that are recognised as missing values. The default values interpreted as
NA/NaN are: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,
‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

# Advanced CSV loading example


data = pd.read_csv(
"data/files/complex_data_example.tsv", # relative python path to subdirectory
sep='\t' # Tab-separated value file.
quotechar="'", # single quote allowed as quote character
dtype={"salary": int}, # Parse the salary column as an integer
usecols=['name', 'birth_date', 'salary']. # Only load the three columns specified.
parse_dates=['birth_date'], # Intepret the birth_date column as a date
skiprows=10, # Skip the first 10 rows of the file
na_values=['.', '??'] # Take any '.' or '??' values as NA
)

Load JSON files to Python Pandas

Below are the steps to load JSON String into Pandas DataFrame
Step 1: Prepare the JSON String
To start with a simple example, let’s say that you have the following data about
different products and their prices:
Product Price
© Edunet Foundation. All rights reserved. | 116
Desktop Computer 700
Tablet 250
iPhone 800
Laptop 1200

This data can be captured as a JSON string:


{"Product":{"0":"Desktop
Computer","1":"Tablet","2":"iPhone","3":"Laptop"},"Price":{"0":700,"1":250,"2":800,"3":
1200}}

Step 2: Create the JSON File


Once you have your JSON string ready, save it within a JSON file.
Alternatively, you can copy the JSON string into Notepad, and then save that file with
a .json file extension.
For example, open Notepad, and then copy the JSON string into it:

Figure: Copy the JSON string into the notepad.

Then, save the notepad with your desired file name and add the .json extension at
the end of the file name. Here, I named the file as data.json:

© Edunet Foundation. All rights reserved. | 117


Figure: Saving the file with.json extension.

Step 3: Load the JSON File into Pandas DataFrame


Finally, load your JSON file into Pandas DataFrame.
import pandas as pd
pd.read_json (r'Path where you saved the JSON file\File Name.json')

In this case, The JSON file is stored on the Desktop, under this path:
C:\Users\XYZ\Desktop\data.json
So, this is the code that is used to load the JSON file into the DataFrame:

import pandas as pd
df = pd.read_json (r'C:\Users\XYZ\Desktop\data.json')
print (df)

Run the code in Python (adjusted to your path), and you’ll get the following
DataFrame:

Figure: Output
3 different JSON strings
Below are 3 different ways that you could capture the data as JSON strings.
Each of those strings would generate a DataFrame with a different orientation when
loading the files into Python.

© Edunet Foundation. All rights reserved. | 118


1. Index orientation
{"0":{"Product":"Desktop
Computer","Price":700},"1":{"Product":"Tablet","Price":250},"2":{"Product":"iPhone","
Price":800},"3":{"Product":"Laptop","Price":1200}}

2. Values orientation
[["Desktop Computer",700],["Tablet",250],["iPhone",800],["Laptop",1200]]

3. Columns orientation
{"Product":{"0":"Desktop
Computer","1":"Tablet","2":"iPhone","3":"Laptop"},"Price":{"0":700,"1":250,"2":800,"3":
1200}}

3.3.9 Interacting with HTML tables

HTML is a Hypertext Markup Language, mainly used for created web applications
and pages. HTML uses tags to define each block of code like a <p></p> tag for the
start and end of a paragraph, <h2></h2> for the start and end of the heading, and
similarly there are many tags that together collate to form an HTML web page.

In order to read an HTML file, pandas dataframe looks for a tag. That tag is called a
<td></td> tag. This tag is used for defining a table in HTML. The Pandas library
provides functions like read_html() and to_html(), to import and export data to
DataFrames. We will discuss below how to read tabular data from an HTML file and
load it into a Pandas DataFrame as well as to write data from a Pandas DataFrame
to an HTML file.

3.3.10 Groupby Methods

Before applying groupby function to the dataset, let’s go over a visual example.
Assume we have two features. One is color which is a categorical feature and the
other one is a numerical feature, values. We want to group values by color and
calculate the mean (or any other aggregation) of values for different colors. Then
© Edunet Foundation. All rights reserved. | 119
finally sort the colors based on average values. The following figure shows the steps
of this process.

Figure: groupby in Pandas

A Sample DataFrame

In order to demonstrate the effectiveness and simplicity of the grouping commands,


we will need some data. The dataset contains 830 entries from a mobile phone log
spanning a total time of 5 months. The CSV file can be loaded into a pandas
DataFrame using the pandas.DataFrame.from_csv() function, and looks like this:

date duration item month network network_type


0 15/10/14 34.429 data 2014-11 data data
06:58
1 15/10/14 13.000 call 2014-11 Vodafone mobile
06:58
2 15/10/14 23.000 call 2014-11 Meteor mobile
14:46
3 15/10/14 4.000 call 2014-11 Tesco mobile
14:48
4 15/10/14 4.000 call 2014-11 Tesco mobile
17:27
5 15/10/14 4.000 call 2014-11 Tesco mobile
18:55
6 16/10/14 34.429 data 2014-11 data data
06:58
7 16/10/14 602.000 call 2014-11 Three mobile
15:01
8 16/10/14 1050.000 call 2014-11 Three mobile
© Edunet Foundation. All rights reserved. | 120
15:12
9 16/10/14 19.000 call 2014-11 voicemail voicemail
15:30
10 16/10/14 1183.000 call 2014-11 Three mobile
16:21
11 16/10/14 1.000 sms 2014-11 Meteor mobile
22:18
… … … … … … …

The main columns in the file are:


1. date: The date and time of the entry
2. duration: The duration (in seconds) for each call, the amount of data (in MB)
for each data entry, and the number of texts sent (usually 1) for each sms
entry.
3. item: A description of the event occurring – can be one of call, sms, or data.
4. month: The billing month that each entry belongs to – of form ‘YYYY-MM’.
5. network: The mobile network that was called/texted for each entry.
6. network_type: Whether the number being called was a mobile, international
(‘world’), voicemail, landline, or other (‘special’) number.
The date column can be parsed using the dateutil library.
import pandas as pd
import dateutil

# Load data from csv file


data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

Summarizing the DataFrame

Once the data has been loaded into Python, Pandas makes the calculation of
different statistics very simple. For example, mean, max, min, standard deviations
and more for columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out: 10528.0
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out: 92321.0
# How many entries are there for each month?
© Edunet Foundation. All rights reserved. | 121
data['month'].value_counts()
Out:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out: 9
The .describe() function as discussed before is a useful summarisation tool that will
quickly display statistics for any variable or group it is applied to. The describe()
output varies depending on whether you apply it to a numeric or character column.

Groupby output format – Series or DataFrame?

The output from a groupby and aggregation operation varies between Pandas Series
and Pandas Dataframes. As a rule of thumb, if you calculate more than one column
of results, your result will be a Dataframe. For a single column of results, the agg
function, by default, will produce a Series.
You can change this by selecting your operation column differently:
# produces Pandas Series
data.groupby('month')['duration'].sum()
# Produces Pandas DataFrame
data.groupby('month')[['duration']].sum()

The groupby output will have an index or multi-index on rows corresponding to your
chosen grouping variables. To avoid setting this index, pass “as_index=False” to the
groupby operation.
data.groupby('month', as_index=False).agg({"duration": "sum"})

Figure: Using the as_index parameter while Grouping data in pandas prevents setting a row index on the result.

© Edunet Foundation. All rights reserved. | 122


Multiple Statistics per Group

The aggregation functionality provided by the agg() function allows multiple statistics
to be calculated per group in one calculation.
Applying a single function to columns in groups
Instructions for aggregation are provided in the form of a python dictionary or list.
The dictionary keys are used to specify the columns upon which you’d like to perform
operations, and the dictionary values to specify the function to run.
For example:
# Group the data frame by month and item and extract a number of stats from each
group
data.groupby(
['month', 'item']
).agg(
{
'duration':sum, # Sum duration per group
'network_type': "count", # get the count of networks
'date': 'first' # get the first date per group
}
)
3.3.11 Pivot Tables

You may be familiar with pivot tables in Excel to generate easy insights into your
data. The function is quite similar to the group by function available in Pandas.

How to Build a Pivot Table in Python

It’s a table of statistics that helps summarize the data of a larger table by “pivoting”
that data. In Pandas, we can construct a pivot table using the following syntax,
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All',
observed=False)
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes)
on the index and columns of the result DataFrame.
Parameters
Data: DataFrame
values: column to aggregate, optional
index: column, Grouper, array, or list of the previous
Columns: column, Grouper, array, or list of the previous
Aggfunc: function, list of functions, dict, default numpy.mean
fill_value: scalar, default None
Value to replace missing values with (in the resulting pivot table, after aggregation).
Margins: bool, default False

© Edunet Foundation. All rights reserved. | 123


Add all row / columns (e.g. for subtotal / grand totals).
Dropna: bool, default True
Do not include columns whose entries are all NaN.
margins_name: str, default ‘All’
Name of the row / column that will contain the totals when margins is True.
Observed: bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed
values for categorical groupers. If False: show all values for categorical groupers.

Returns: DataFrame
An Excel style pivot table.

We’ll use Pandas to import the data into a dataframe called df. We’ll also print out
the first five rows using the .head() function:
import pandas as pd
df =
pd.read_excel('https://github.com/datagy/pivot_table_pandas/raw/master/sample_piv
ot.xlsx', parse_dates=['Date'])
print(df.head())

Creating a Pivot Table in Pandas

We’ll begin by aggregating the Sales values by the Region the sale took place in:
sales_by_region = pd.pivot_table(df, index = 'Region', values = 'Sales')
print(sales_by_region)
This returns the following output:

This gave us a summary of the Sales field by Region. The default parameter for
aggfunc is mean. Because of this, the Sales field in the resulting dataframe is the
© Edunet Foundation. All rights reserved. | 124
average of Sales per Region. If we wanted to change the type of function used, we
could use the aggfunc parameter. For example, if we wanted to return the sum of all
Sales across a region, we could write:
total_by_region = pd.pivot_table(df, index = 'Region', values = 'Sales',
aggfunc='sum') print(total_by_region)
These returns:

Filtering Python Pivot Tables

Let’s create a dataframe that generates the mean Sale price by Region:
avg_region_price = pd.pivot_table(df, index = 'Region', values = 'Sales')
The values in this dataframe are:

Now, say we wanted to filter the dataframe to only include Regions where the
average sale price was over 450, we could write:
avg_region_price[avg_region_price['Sales'] > 450]

We can also apply multiple conditions, such as filtering to show only sales greater
than 450 or less than 430.
avg_region_price[(avg_region_price['Sales'] > 450) | (avg_region_price['Sales'] <
430)]
We have wrapped each condition in brackets and separated the conditions by a pipe
( | ) symbol. This returns the following:

© Edunet Foundation. All rights reserved. | 125


Adding Columns to a Pandas Pivot Table

Adding columns to a pivot table in Pandas can add another dimension to the tables.
The Columns parameter allows us to add a key to aggregate by. For example, if we
wanted to see the number of units sold by Type and by Region, we could write:
columns_example = pd.pivot_table(df, index = 'Type', columns = 'Region', values =
'Units', aggfunc = 'sum') print(columns_example)

Columns are optional as we indicated above and provide the keys by which to
separate the data. The pivot table aggregates the values in the values parameter.

3.4 Pandas Plotting


We now understand the basic concept of Pandas. In the first part of the module, we
have discussed about the different types of plotting, i.e., univariate, bivariate and
multivariate plotting. Let us discuss each of these plotting types with different
datasets.

3.4.1 Univariate Plotting

We will understand the univariate plotting using the Wine reviews dataset. This
dataset contains 10 columns and 150k rows of wine reviews. We will first import the
dataset and then start with the analysis.

Bar Chart

© Edunet Foundation. All rights reserved. | 126


The above plot says California produces far more wine than any other province of
the world! We might ask what percent of the total is Californian vintage? This bar
chart tells us absolute numbers, but it's more useful to know relative proportions. No
problem, we use the following command

California produces almost a third of wines reviewed in Wine Magazine! The number
of reviews of a certain score allotted by Wine Magazine:

© Edunet Foundation. All rights reserved. | 127


Every vintage is allotted an overall score between 80 and 100; and, if we are to
believe that Wine Magazine is an arbiter of good taste, then a 92 is somehow
meaningfully "better" than a 91.
Line Chart
The wine review scorecard has 20 different unique values to fill, for which our bar
chart is just barely enough. What would we do if the magazine rated things 0-100?
We'd have 100 different categories; simply too many to fit a bar in for each one!

A line chart can pass over any number of many individual values, making it the tool
of first choice for distributions with many unique values or categories. However, line
charts have an important weakness: unlike bar charts, they're not appropriate for
nominal categorical data. While bar charts distinguish between every "type" of point
line charts mushes them together. So, a line chart asserts an order to the values on
the horizontal axis, and the order won’t make sense with some data. After all, a
"descent" from California to Washington to Tuscany doesn't mean much! Line charts
also make it harder to distinguish between individual values.
Area Chart

© Edunet Foundation. All rights reserved. | 128


Area charts are just line charts, but with the bottom shaded in. When plotting only
one variable, the difference between an area chart and a line chart is mostly visual.
In this context, they can be used interchangeably.

Histogram

3.4.2 Bivariate Plotting

The bivariate plotting as discussed before, compares two sets of data to find a
relationship between the two variables. We will consider the same Wine Dataset
used for univariate analysis.
Scatter Plot
The simplest bivariate plot is the scatter plot.

© Edunet Foundation. All rights reserved. | 129


The above plot shows us that the price and the points are weakly correlated: that is,
that more expensive wines do generally earn more points when reviewed. Scatter
plots work best with small datasets, and with variables which have a large number of
unique values as they have a tendency of overplotting. There are a few ways to deal
with overplotting. We've already demonstrated one way: sampling the points. We
had down sampled the data above by taking just 100 points from the full set. If we
don’t consider that, with enough points the distribution starts to look like a shapeless
blob.

Another interesting way to do this that's built right into pandas is to use our next plot
type, a hexplot.
Hex Plot
A hex plot aggregates points in space into hexagons, and then colors those
hexagons based on the values within them:

© Edunet Foundation. All rights reserved. | 130


The data in this plot is directly comparable with that in the scatter plot from earlier,
but the story it tells us is very different. From this hexplot we can see that the bottles
of wine reviewed by Wine Magazine cluster around 87.5 points and around $20. We
did not see this effect by looking at the scatter plot, because too many similarly-
priced, similarly-scoring wines were overplotted. By doing away with this problem,
this hexplot presents us a much more useful view of the dataset. Hexplots and
scatter plots can be applied to combinations of interval variables and/or ordinal
categorical variables.
Stacked Bar Chart
We next consider a stacked chart, where we plot the variables one on top of the
other.

Stacked bar plots share the strengths and weaknesses of univariate bar charts. They
work best for nominal categorical or small ordinal categorical variables. Another
simple example is the area plot, which lends itself very naturally to this form of
manipulation,

© Edunet Foundation. All rights reserved. | 131


Like single-variable area charts, multivariate area charts are meant for nominal
categorical or interval variables. Stacked plots are visually very pretty. However, they
have two major limitations.
The first limitation is that the second variable in a stacked plot must be a variable
with a very limited number of possible values (probably an ordinal categorical, as
here). Five different types of wine are a good number because it keeps the result
interpretable; eight is sometimes mentioned as a suggested upper bound. Many
dataset fields will not fit this criterion naturally, so you have to "make do", as here, by
selecting a group of interest.
The second limitation is one of interpretability. As easy as they are to make, and as
pretty as they look, stacked plots make it really hard to distinguish concrete values.
For example, looking at the plots above, can you tell which wine got a score of 87
more often: Red Blends (in purple), Pinot Noir (in red), or Chardonnay (in green)? It's
actually really hard to tell!
Bivariate Line Chart
One plot type we've seen already that remains highly effective when made bivariate
is the line chart. Because the line in this chart takes up so little visual space, it's
really easy and effective to overplot multiple lines on the same chart.

© Edunet Foundation. All rights reserved. | 132


Using a line chart this way makes inroads against the second limitation of stacked
plotting: interpretability. Bivariate line charts are much more interpretable because
the lines themselves don't take up much space. Their values remain readable when
we place multiple lines side-by-side, as here. For example, in this chart we can
easily answer our question from the previous example: which wine most commonly
scores an 87. We can see here that the Chardonnay, in green, narrowly beats out
the Pinot Noir, in red.

3.4.3 Multivariate Plotting

In an effort to understand the different concepts of multivariate plotting, we will


consider the dataset, FIFA 18 Complete Player Dataset. We will explore here the
different multivariate plotting. First let us import the dataset

We next perform some data pre-processing, to make some modifications in the


dataset. We show here a small part of the code.

Multivariate Scatter Plots


We will start with the scatter plot, when we are interested in seeing, which type of
offensive players, tends to get paid the most.

© Edunet Foundation. All rights reserved. | 133


The x axis plots, how well the players are paid and the y axis tracks the Overall
score of the player. The hue parameter tracks which of the three categories of
interest the player the point represents is in.
Heatmap
The most heavily used summarization visualization is the correlation plot, which
measures the correlation between every pair of values in a dataset and plots the
result in color.

The color and label in the above plot, indicates the amount of correlation between
the two variables of interest. We can see from the above heatmap that the variables

© Edunet Foundation. All rights reserved. | 134


Agility and acceleration are highly correlated, whereas the variables Aggression and
Balance are very much uncorrelated.
Grouped Box Plot
The grouped box plot takes the advantage of grouping. This is mainly when we are
interested in the following question: “Do Strikers score higher on Aggression than
Goalkeepers do?”

The plot in this case demonstrates conclusively that within our datasets goalkeepers
(at least, those with an overall score between 80 and 85) have much lower
Aggression scores than Strikers do. In this plot, the horizontal axis encodes the
Overall score, the vertical axis encodes the Aggression score, and the grouping
encodes the Position. We show the output below.

So, in this chapter we have explored about various important concepts of Data
analytics and the libraries which are used in data analysis like NumPy, Pandas and
Matplotlib. Using these libraries, we can analyse our data and make sense out of
data.

© Edunet Foundation. All rights reserved. | 135


Chapter 4: Building Machine Learning
Models
Learning Outcomes:

 Understand the basics of machine learning, its types and applications


 Create Machine Learning model using various Algorithms
 Demonstrate working of ML model without coding
 Able to differentiate between Supervised and Unsupervised Learning
 Able to identify and apply specific ML algorithm to solve real life problems

4.1 Machine Learning Basics


Machine Learning (ML) is undeniably one of the most influential and powerful
technologies in today’s world. It is a tool for turning information into knowledge. It is
about making predictions (answering questions) and classifications based on data.
The more data you have, the easier it will be to recognize patterns and inferences. In
the past 50 years, there has been an explosion of data. This mass of data is useless
unless we analyze it and find the patterns hidden within. Machine learning
techniques are used to automatically find the valuable underlying patterns within
complex data that we would otherwise struggle to discover. The hidden patterns and
knowledge about a problem can be used to predict future events and perform all
kinds of complex decision making. Summarizing, ML is the scientific study of
algorithms and statistical models that computer systems use to perform a specific
task without using explicit instructions, relying on patterns and inference instead.

Terminology of ML

1. Dataset: A set of data examples, that contain features important to solving the
problem.
2. Features: Important pieces of data that help us understand a problem. These
are the input which are fed in to a Machine Learning algorithm to help it learn.
3. Target: It is the information the machine learns to predict. The prediction is
what the machine learning model “guesses” what the target value should be
based on the given features.

© Edunet Foundation. All rights reserved. | 136


4. Model: A model defines the relationship between the target and the features.
It learns this from the data it is shown during training. The model is the output
you get after training an algorithm. For example, a decision tree algorithm
would be trained and produce a decision tree model.
4.1.1 Machine Learning Life Cycle:
The machine learning life cycle is a cyclical process that the ML projects follow. This
is a cycle iterating between improving the data, model, and evaluation that is never
really finished. This cycle is crucial in developing an ML model because it focuses on
using model results and evaluation in order to refine our dataset. The most important
thing in the complete process is to understand the problem and to know the purpose
of the problem. Therefore, before starting the life cycle, we need to understand the
problem because the good result depends on the better understanding of the
problem.

1. Data Collection: This is the first step, and the goal of this step is to collect the
data that the algorithm will learn from.
2. Data Preparation: Format and engineer the data into the optimal format,
extracting important features and performing dimensionality reduction.
3. Data Wrangling: It is the process of cleaning and converting raw data into a
useable format.
4. Training: Also known as the fitting stage, this is where the ML algorithm
actually learns by showing it the data that has been collected and prepared.
5. Evaluation: Test the model to see how well it performs and then fine tune the
model to maximize its performance.
6. Deployment: This is the last step of machine learning cycle, where we deploy
the model in a real-world system.

Let us now see some real-life applications of machine learning in our day-to-day life.
4.1.2 Real time Application of Machine Learning

Machine learning is relevant in many fields, industries, and has the capability to grow
over time. Here are six real-life examples of how machine learning is being used.

1. Image recognition

Image recognition is a well-known and widespread example of machine learning in


the real world. It can identify an object as a digital image, based on the intensity of
the pixels in black and white images or colour images. ML is also frequently used for
facial recognition within an image.
2. Speech recognition
Machine learning can translate speech into text. Certain software applications can
convert live voice and recorded speech into a text file. The speech can be

© Edunet Foundation. All rights reserved. | 137


segmented by intensities on time-frequency bands as well. Some of the most
common uses of speech recognition software are devices like Google
Home or Amazon Alexa.
3. Medical diagnosis
Machine learning can help with the diagnosis of diseases. Many physicians use
chatbots with speech recognition capabilities to discern patterns in symptoms. In the
case of rare diseases, the joint use of facial recognition software and machine
learning helps scan patient photos and identify phenotypes that correlate with rare
genetic diseases.
4. Statistical arbitrage
Arbitrage is an automated trading strategy that’s used in finance to manage a large
volume of securities. The strategy uses a trading algorithm to analyze a set of
securities using economic variables and correlations.

5. Extraction
Machine learning can extract structured information from unstructured data. A
machine learning algorithm automates the process of annotating datasets for
predictive analytics tools. It helps develop methods to prevent, diagnose, and treat
the disorders

Let us now discuss the different types of machine learning.

4.1.3 Types of Machine Learning

There are three main categories of machine learning techniques: supervised


learning, unsupervised learning and reinforcement learning.
Supervised Machine Learning

Supervised learning is the most popular type of machine learning in which machines
are trained using "labelled" training data, and on the basis of that data, machines
predict the output. The labelled data means some input data is already tagged with
the correct output. In supervised learning, the training data provided to the machines
work as the supervisor that teaches the machines to predict the output correctly. The
aim of supervised learning algorithm is to find a mapping function to map the input
variable (x) with the output variable (y). Common algorithms used during supervised
learning include neural networks, decision trees, linear regression, and logistic
regression.
In the real-world, supervised learning can be used for predicting real estate prices,
finding disease risk factors, image classification, fraud Detection, spam filtering, etc.

© Edunet Foundation. All rights reserved. | 138


Image: Working of Supervised learning

Unsupervised Machine Learning

Unsupervised learning is a machine learning technique where unlike supervised


learning, it works with unlabelled data. Here the model instead of finding the exact
nature of relationship between any two data points, finds the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the
human brain while learning new things. The goal of unsupervised learning is to find
the underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format. Common algorithms used in
unsupervised learning include Hidden Markov models, k-means, hierarchical
clustering, and Gaussian mixture models.

A few example used cases include, creating customer groups based on purchase
behaviour, grouping inventory according to sales and manufacturing metrics.

Image: Working of unsupervised learning

© Edunet Foundation. All rights reserved. | 139


Reinforcement Machine Learning

Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It is a sort of algorithm that improves upon itself and learns from
new situations, by using a system of rewards and penalty. The learning system,
called agent in this context, learns with an interactive environment. The agent selects
and performs actions and receives rewards by performing correctly and penalties for
performing incorrectly. In reinforcement learning the agent learns by itself, without
the intervention from a human and is trained to give the best possible solution for the
best possible reward. For example, teaching cars to park themselves and drive
autonomously, dynamically controlling traffic lights to reduce traffic jams, training
robots etc.

Image: Reinforcement Machine Learning

Now, after understanding the brief overview of machine learning and its types, let us
explore a very popular library in machine learning which is used by many machine
learning engineers and data scientists to perform various data science and AI
projects, named as Scikit-learn.
4.1.4 Scikit Learn library overview
Scikit-learn provides a range of supervised and unsupervised learning algorithms via
a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use. Scikit Learn is built on top of several common data and math
Python libraries. Such a design makes it super easy to integrate between them all.
You can pass numpy arrays and pandas data frames directly to the ML algorithms of
Scikit! It uses the following libraries:
© Edunet Foundation. All rights reserved. | 140
 NumPy: For any work with matrices, especially math operations
 SciPy: Scientific and technical computing
 Matplotlib: Data visualization
 IPython: Interactive console for Python
 Sympy: Symbolic mathematics
 Pandas: Data handling, manipulation, and analysis
Now we have discussed about the machine learning library sklearn, let us start with
the basic algorithms in machine learning. We will first focus on the regression and
classification algorithms of Supervised machine learning, which works with labelled
datasets.

4.1.5 Regression vs Classification

Supervised machine learning is classified into two types: Regression and


Classification. They both work with labelled datasets, however, their different
approach to Machine learning problems is their point of divergence. Classification
separates the data, whereas Regression fits the data. Regression algorithms are
used to predict the continuous values such as price, salary, age, etc. whereas
Classification algorithms are used to predict/Classify the discrete values such as
Male or Female, True or False, Spam or Not Spam, etc. Let us discuss them in some
detail.

Image: Classification vs Regression

Classification

Classification is an algorithm that approximates a mapping function from input


variables to identify discrete output variables, which can be labels or categories. A
classification algorithm can have both discrete and real-valued variables, but it
requires that the examples be classified into one of two or more classes.

© Edunet Foundation. All rights reserved. | 141


Classification algorithms are used for things like email and spam classification,
predicting the willingness of bank customers to pay their loans, and identifying
cancer tumor cells. Classification Algorithms can be further divided into the following
types:

i) Logistic Regression
ii) K-Nearest Neighbours
iii) Support Vector Machines
iv) Naïve Bayes
v) Decision Tree Classification
Regression:

Regression finds the correlations between dependent and independent variables. It


helps in predicting a continuous value based on the input variables such as house
prices, market trends, weather patterns, oil and gas prices. The main goal in
regression analysis is to estimate a mapping function based on the input variables
and the continuous output variable. Regression algorithms can be further classified
into the following types:

i) Simple Linear Regression


ii) Multiple Linear Regression
iii) Polynomial Regression
iv) Decision Tree Regression
v) Random Forest Regression

4.2 Linear Regression


Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis, i.e., to predict
the value of a variable, based on the value of another variable. Linear regression
makes predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc. The variable we want to predict is called the dependent (y)
variable and the variable(s) we are using to predict the value of y are called the
independent variable(s).

Linear regression can be further divided into two types of the algorithm:
1. Simple Linear Regression: If a single independent variable is used to predict the
value of a numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.

© Edunet Foundation. All rights reserved. | 142


2. Multiple Linear regression: If more than one independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.

The linear regression model provides a sloped straight line representing the
relationship between the variables, where a scatter plot can be a helpful tool in
determining the strength of the relationship between the two variables.

Image: Relationship between the variables in linear regression

Mathematically, a linear regression line has an equation of the form:


y= a + b x
where, y is the Dependent Variable (Target Variable), x is the Independent Variable
(predictor Variable). The slope of the line is b and a is the intercept (the value of y,
when x = 0). A regression line can show two types of relationship:
 Positive Linear Relationship: If the dependent variable increases on the Y-axis
and independent variable increases on X-axis, then such a relationship is termed
as a Positive linear relationship. The line equation will be y = a + b x.
 Negative Linear Relationship: If the dependent variable decreases on the Y-axis
and independent variable increases on the X-axis, then such a relationship is
called a negative linear relationship. The line equation will be y = -a + b x.

4.2.1 Assumptions of Linear Regression


Before we attempt to perform linear regression, we need to make sure that our data
can be analyzed using this procedure. Our data must pass through certain required
assumptions, which ensures to get the best possible result from the given dataset.

1. The output variable should be measured at a continuous level, such as time,


prices, test scores, sales to name a few.
2. Linear regression assumes the linear relationship between the dependent and
independent variables, therefore use a scatterplot to find out the relationship
quickly.
© Edunet Foundation. All rights reserved. | 143
3. The observations or the input variables should be independent of each other.
There should be no correlation between the independent variables.
4. The data should have no significant outliers.
5. Check for homoscedasticity — a statistical concept in which the variances
along the best-fit linear-regression line remain similar all through that line.
6. The residuals (errors) of the best-fit regression line should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause
difficulties in finding coefficients.

When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error. The different values for weights or the
coefficient of lines (a, b) gives a different line of regression, so we need to calculate
the best values for a and b to find the best fit line.
4.2.2 Ordinary Least Square Method
Ordinary least squares, is a common technique to determine the coefficients of linear
regression. This method draws a line through the data points that minimizes the sum
of the squared differences between the observed values and the corresponding fitted
values. This approach treats the data as a matrix and uses linear algebra operations
to estimate the optimal values for the coefficients. It means that all of the data must
be available and we must have enough memory to fit the data and perform matrix
operations.

Image: Ordinary least squares line

© Edunet Foundation. All rights reserved. | 144


Cost function
Cost function measures how a machine learning model performs. It is a calculation
of the error between the predicted and the actual values. It optimizes the regression
coefficients or weights (a,b).

In case of Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:

where, N is the total number of observations, yi is the actual value, (a + b xi) is the
predicted value.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is called
optimization. It can be achieved by the following method:

R-squared method: R-squared is a statistical method that determines the goodness


of fit. It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%. The high value of R-square determines
the less difference between the predicted values and actual values and hence
represents a good model. It can be calculated from the below formula:

𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑅 − 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

We have discussed about regression, ordinary least square method, optimization


technique and metrics to measure the performance of regression. Now, let’s get
started with the cloud platform where we can train, test and deploy our machine
learning model.
4.2.3 Azure ML No Code Platform

Machine Learning is the foundation for most artificial intelligence solutions, and the
creation of an intelligent solution often begins with the use of machine learning to
train a predictive model using historic data that we have collected. Azure Machine
Learning is a cloud service that we can use to train and manage machine learning
models. It allows building no-code machine learning models through a drag and drop
visual interface. It’s designed to help data scientists and machine learning engineers

© Edunet Foundation. All rights reserved. | 145


to leverage their existing data processing and model development skills &
frameworks.

Image: Azure Machine Learning

Azure ML Studio Briefing


Azure Machine Learning Studio is web-based integrated development environment
(IDE) for developing data experiments. It is closely knit with the rest of Azure’s cloud
services and that simplifies development and deployment of machine learning
models and services.

Setting up an account in Azure ML Studio and Creating Workspace

 Open the following web link on the web browser.


https://studio.azureml.net/

© Edunet Foundation. All rights reserved. | 146


 Click on the Sign In. Type your Microsoft email ID and then press next. If you
do not have the Microsoft account then create first using create one option
(https://signup.live.com/).
 Type the password for your Microsoft account and press sign in. Press Yes to
stay sign in.

.
 Now, you will be redirected into the following Microsoft Azure Machine
Learning Studio (Classic) and your free workspace will be created as below:

© Edunet Foundation. All rights reserved. | 147


4.2.4 Create a Regression model with Azure ML Studio

 Sign in into Microsoft Azure Machine Learning Studio (classic) and create
workspace as discussed above.

 First select Experiment and then New at the bottom of the page.

 Select Blank Experiment.

© Edunet Foundation. All rights reserved. | 148


 Give the title for the project.

 Select Sample option from the Saved Dataset.

 Select Automobile Price Data Dataset

© Edunet Foundation. All rights reserved. | 149


 Drag selected dataset on Panel.

 Right click on 1 and choose visualize option.

© Edunet Foundation. All rights reserved. | 150


 Visualize the Dataset and then close it.

For example: Normalized losses has the 41 missing value which is the maximum.

© Edunet Foundation. All rights reserved. | 151


 Select the Data Transformation  Manipulation  Select Columns in Dataset
and drag it into Panel.

 Make connection between dragged item, press on red sign and then Launch
column selector to choose the relevant column.

 Remove the column which have the missing value by selecting With Rules 
All Columns  Exclude  column names normalized –losses and click on
tick.
© Edunet Foundation. All rights reserved. | 152
 Select the Data Transformation  Manipulation  Clean missing data, drag
it into Panel and make connection.

© Edunet Foundation. All rights reserved. | 153


 Click on Clean missing data, set Remove entire row in cleaning mode and
press Run.

 If there is green tick, it means no error.

© Edunet Foundation. All rights reserved. | 154


 Select Data Transformation  Sample and Split Split data and drag it into
Panel. Select the value 0.75 to split the data.

 Select Machine Learning  Train  Train Model and drag it into the Panel
and connect it.
Select the machine learning Algorithm from Machine Learning  Initialize
Model Regression  Linear Regression, drag it into panel and make
connection.

© Edunet Foundation. All rights reserved. | 155


 Click on Train Model and Launch Column Selector

 Now Select price as output for the prediction and press the tick mark.

© Edunet Foundation. All rights reserved. | 156


 For Score calculation, Choose Machine Learning  Score  Score Model
and drag it into panel and connect it.

 Connect Split Data with Score Model for Testing.

 For Evaluation of the model choose, Machine Learning  Evaluate 


Evaluate Model, drag it into panel and connect. Run the model.

© Edunet Foundation. All rights reserved. | 157


 Green tick shows that model has run successfully. Entire model would be
display as below:

 To see the visualization Result, right click on Evaluate model  Evaluation


Result Visualize
© Edunet Foundation. All rights reserved. | 158
We just explored the Microsoft Azure machine learning studio which is GUI based
integrated development environment for constructing and operationalizing machine
learning workflow and we also constructed linear regression model. Now let us get
started with logistic regression algorithm which is mainly used for classification.

4.3 Logistic Regression


Logistic regression is one of the most popular supervised machine learning
algorithms. It is used to calculate or predict the probability of the categorical
dependent variable using a given set of independent variables. Logistic regression is
used to solve classification problems and the most common used case is the binary
logistic regression where the outcome is binary (yes or no). We see logistic
regression applied across multiple areas and fields such as disease
detection/prediction, spam detection, fraudulent transaction etc.

The logistic regression model passes the outcome through a logistic function to
calculate the probability of an occurrence. The model then maps the probability to
binary outcomes. The logistic regression is a type of sigmoid, which is a
mathematical function resulting in an S-shaped curve that takes any real number and
maps it to a probability between 0 and 1. The formula of the sigmoid function is:

Where e is the base of the natural logarithms and x is the actual numerical value, we
want to transform. Below is the image showing the logistic function:

© Edunet Foundation. All rights reserved. | 159


Image: logistic function

Type of Logistic Regression

On the basis of the number of possible outcomes, Logistic Regression can be


classified into three types:

Binomial: In binomial Logistic regression, we have only two possible outcomes of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, we have multiple outcomes. There


can be 3 or more possible unordered types of the dependent variable having no
quantitative significance, such as "Type A", or "Type B", or "Type C".

Ordinal: In ordinal Logistic regression, the outcome will be ordered. The dependent
variable can have 3 or more possible ordered types, having a quantitative
significance. These variables may represent "low", "Medium", or "High" and each
category can have the scores like 0,1,2,3.

4.3.1 Confusion Matrix, ROC & AUC Curve


A confusion matrix is a sort of performance measurement for machine learning
classification problem, where output can be two or more classes. It is a table that
visualizes and summarizes the performance of a classification algorithm.

© Edunet Foundation. All rights reserved. | 160


The confusion matrix as shown above consists of four basic characteristics or
numbers that are used to define the measurement metrics of the classifier. Let us try
to understand the above in the context of a disease prediction classifier. These are:

1. True Negative (TN): TN represents the number of patients, who have been
classified by the algorithm as healthy and they are healthy.
2. True Positive (TP): TP represents the number who have been properly
classified to have the disease and they have the disease.
3. False Positive (FP): FP represents the number of patients who have been
classified to have the disease but actually they are healthy.
4. False Negative (FN): FN represents the number who has been predicted by
the algorithm as healthy but actually they are suffering from the disease.

The performance metrics of a classification algorithm such as accuracy, precision,


recall and F1 can be constructed from the confusion matrix based on the above
stated TP, TN, FP and FN.

 Accuracy of an algorithm is defined as the ratio of the patients who were


correctly classified (TP + TN) to the total number of patients (TP+TN+FP+FN).

 Precision of an algorithm is defined as the ratio of the patients correctly


classified as having disease with the total number of patients who are
predicted to have the disease.

© Edunet Foundation. All rights reserved. | 161


 Recall also known as sensitivity is defined as the ratio of the patients correctly
classified as having the disease by the patients who actually have the
disease.

 F1 score also known as the F-measure states the equilibrium between the
precision and the recall. We use F-measure when we have to compare two
models with low precision and high recall or vice versa.

We next discuss about the ROC (Receiver Operating Characteristics) curve and
AUC (Area Under the Curve). ROC curve and AUC are the performance measures
that provide a comprehensive evaluation of classification models. We have
discussed before about algorithms like logistic regression, return probabilities rather
than discrete outputs. A threshold value is set on the probabilities to distinguish
between the classes. Depending on the threshold value the value of the metrics,
such as precision and recall also change. We can not maximize both precision and
recall together, as increasing recall decreases recall and vice versa. Our aim will be
to maximize the precision and recall depending on the task.

The ROC curve is used for this case, which is a graph that summarizes the
performance of a classification model at all classification thresholds. The curve has
two axes True Positive Rate (TPR) vs False Positive Rate (FPR), both of which takes
values between 0 and 1.

We show below a typical ROC curve, where the points in the curve are calculated by

© Edunet Foundation. All rights reserved. | 162


evaluating the classification model many times with different thresholds. Our main
aim should be to increase the TPR, while keeping the FPR low. We next move on to
AUC, which basically aggregates the performance of the model at all threshold
values. It is basically the area under the ROC curve and is calculated using integral
calculus. The best possible value of AUC is one for a perfect classifier, and value
close to zero indicates that all the predictions are wrong.

After understanding the fundamentals of logistics regression and ROC/AUC curve,


let us get started to build model using ML studio.

4.3.2 Logistic Regression model with ML Studio

We will follow the steps, as we did for linear regression.

 Sign in into Microsoft Azure Machine Learning Studio (classic) and create
workspace.
 First select Experiment and then New at the bottom of the page.
 Select Blank Experiment.
 Give the title for the project as Logistic Regression Model. Select Sample
option from the Saved Dataset.
 Select Breast Cancer Dataset and Drag selected dataset on Panel. Right click
on 1 and choose visualize option. Visualize the Dataset and then close it. For
example: In Class column, 0 represents the no cancer while 1 represents the
cancer

© Edunet Foundation. All rights reserved. | 163


 Select Data Transformation  Sample and Split Split data, drag it into
Panel and connect. Select the value 0.8 to split the data between train and
test. Set Random seed = 123.
 Select Machine Learning  Train Train Model and drag it into the Panel
and connect it.
 Select the machine learning Algorithm from Machine Learning  Initialize
Model Classification  Two Class Logistic Regression, drag it into panel
and make connection.
 Click on Train Model and Launch Column Selector
 Now Select Class as output for the prediction and press the tick mark.
 For Score calculation, Choose Machine Learning  Score  Score Model
and drag it into panel and connect it. Also Connect Split Data with Score
Model for Testing.
 For Evaluation of the model choose, Machine Learning  Evaluate 
Evaluate Model, drag it into panel and connect. Run the model.
 Green tick shows that model has run successfully. Entire model would be
display as below:

© Edunet Foundation. All rights reserved. | 164


 To see the visualization Result, right click on Evaluate model  Evaluation
Result Visualize

Now, after creating a model on azure platform let us learn about another
classification algorithm.

© Edunet Foundation. All rights reserved. | 165


4.4 Naïve Bayes Theorem
Naïve Bayes is one of the simplest supervised machine learning algorithms, that is
used as a classifier. The working of this model is based on the Bayes theorem,
which was proposed by Reverend Thomas Bayes back in the 1760s. The algorithm
is widely used to tackle problems such as text classification, spam filtering, sentiment
analysis, and recommendation systems to name a few.

This algorithm is called naïve because the model assumes that the input features
that go into the model are independent of each other, i.e., there is no correlation
between the input features. The assumptions may or may not be true, therefore the
name naïve. We will first discuss a bit about probability, conditional probability and
Bayes Theorem before we go into the working of Naïve Bayes.

4.4.1 What is Probability?


Probability helps us to predict, how likely an event X is to happen considering the
total of potential outcomes. Mathematically probability can be represented by the
following equation:

Probability of an event = Number of Favourable Events/Total Number of


Outcomes

The events for which we want the probability of their happening are known as the
“favourable events”. The probability always lies in the range of 0 to 1, with 0 meaning
there is no probability of that event happening and 1 meaning there is a 100%
possibility it will happen. When we restrict the idea of probability to create a
dependency on a specific event, that is known as conditional probability.
4.4.2 Conditional Probability
Conditional Probability is the probability of one (or more) event given the occurrence
of another event. Let us consider two events, A and B. The conditional probability of
event B, will be defined as the probability that event B will occur, given the
knowledge that event A has already happened. Mathematically, it is denoted by

P(B|A) = P(A and B)/P(A).

We next discuss about the Bayes theorem which follows in the footsteps of the
conditional probability.

© Edunet Foundation. All rights reserved. | 166


4.4.3 Bayes Theorem
Bayes theorem as mentioned above works on the concept of conditional probability.
In conditional probability we know that the occurrence of a particular outcome is
conditioned on the outcome of another event occurring. Given two events A and B,
Bayes theorem states that,

where,

 P(A) and P(B) are called marginal/prior probability and evidence respectively.
They are the probabilities of events A and B occurring, irrespective of the
outcomes of the other.
 P(A|B) is called the posterior probability. It is the probability of event A
occurring, given that event B has occurred.
 P(B|A) is called the likelihood probability. It is the probability of event B
occurring given that event A has occurred.
 P(A∩B) is the joint probability of both events A and B.

The above definitions allow Bayes Theorem to be restated as:

Posterior = Likelihood * Prior / Evidence

Let us ask the question, “What is the probability that there will be rain, given the
weather is cloudy?”

P(Rain) is the Prior, P(Cloud|Rain) is the likelihood, and P(Cloud) is the evidence,
therefore P(Rain|Cloud) = , P(Cloud|Rain) * P(Rain) / P(Cloud).

If we want to understand Bayes theorem from a machine learning perspective, we


will call our input features as evidence and labels as outcomes in our training data.
Using conditional probability, we calculate the probability of the evidence given the
outcomes, denoted as P(Evidence|Outcome). Our final goal will be to find the
probability of an outcome with respect to the evidence, denoted as
P(Outcome|Evidence) from the testing data. If the problem at hand has two
outcomes, then we calculate the probability of each outcome and say the highest
one wins. But what if we have multiple input features? This is when Naïve Bayes
comes into picture.

The Bayes Rule provides the formula to compute the probability of output (Y) given
the input (X). In real-world problems, unlike the hypothetical assumption of having a

© Edunet Foundation. All rights reserved. | 167


single input feature, we have multiple X variables. When we can assume the
features are independent of each other, we extend the Bayes Rule to what is called
Naive Bayes. Consider a case where there are multiple inputs (X1, X2, X3,... Xn).
We predict the outcome (Y) using the Naive Bayes equation as follows:

P(Y=k | X1...Xn) = ( P(X1 | Y=k) * P(X2 | Y=k) * P(X3 | Y=k) * ....* P(Xn | Y=k) ) *
P(Y=k) / P(X1)*P(X2)*P(X3)*P(Xn)

In the above formula:

 P(Y=k | X1...Xn) is called the Posterior Probability, which is the


probability of an outcome given the evidence.
 P(X1 | Y=k) * P(X2 | Y=k) * ... P(Xn | Y=k) is the probability of the
likelihood of evidence.
 P(Y=k) is the Prior Probability.
 P(X1)*P(X2)*P(Xn) is the probability of the evidence.

4.4.4 Types of Naïve Bayes Model


There are three types of Naive Bayes Model, which are given below:

1. Gaussian: It is used for numerical/continuous features. The Gaussian model


assumes that the features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
2. Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
3. Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as if
a particular word is present or not in a document. This model is also famous
for document classification tasks.

4.5 Bag of Words Approach


Machine learning algorithms, cannot work with raw text directly. The text has to be
converted into numbers, especially a vector of numbers called feature extraction.
The bag-of-words (BoW) model is a way of representing text data when modelling
textual data with machine learning algorithms. The bag-of-words model is simple to
understand and implement and has seen great success in problems such as
language modelling and document classification.

© Edunet Foundation. All rights reserved. | 168


One of the biggest problems with textual data is that it is messy and unstructured,
and machine learning algorithms as we know prefer structured, well defined fixed-
length inputs. Using the BoW technique, we can convert variable-length texts into a
fixed-length vector. The approach is very simple and flexible, and can be used in a
myriad of ways for extracting features from documents. A bag-of-words is a
representation of text that describes the occurrence of words within a document. It
involves two things:

 A vocabulary of known words.


 A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of
words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document.

Here’s a sample of reviews about a particular horror movie.

 Review 1: This movie is very scary and long


 Review 2: This movie is not scary and is slow
 Review 3: This movie is spooky and good

We can see that there are some contrasting reviews about the movie as well as the
length and pace of the movie. Imagine looking at a thousand reviews like these.
Clearly, there is a lot of interesting insights we can draw from them and build upon
them to gauge how well the movie performed. However, as we saw above, we
cannot simply give these sentences to a machine learning model and ask it to tell us
whether a review was positive or negative. We need to perform certain text pre-
processing steps.

We will first build a vocabulary from all the unique words in the above three reviews.
The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’,
‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’. We can now take each of these words and mark
their occurrence in the three movie reviews above with 1s and 0s. This will give us 3
vectors for 3 reviews:

© Edunet Foundation. All rights reserved. | 169


Image: Bag of Words Approach

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model. The method we used
till now, takes the count of each word and represents the word in the vector by the
number of counts of that particular word. So, what does a word having a high word
count signify? Can we interpret that this particular word is important in retrieving
information about the documents? The answer to that is No. This is because if that
particular word occurs many times in the dataset, maybe it is because this word is
just a frequent word, not because it is meaningful or relevant.

There are approaches such as Tf-Idf approach to rescale the frequency of words by
how often they appear in all documents so that the scores for frequent words like
“the” that are also frequent across all documents are penalized. Term frequency–
inverse document frequency (Tf-Idf), is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or corpus. It is calculated by
multiplying two different metrics:

Term Frequency (TF): It is a measure of how frequently a term, t, appears in a


document, d:

Here, in the numerator, n is the number of times the term “t” appears in the
document “d”. Thus, each document and term would have its own TF value. Take
the same vocabulary we had built in the Bag-of-Words model to show how to
calculate the TF for Review #2: This movie is not scary and is slow

Here, the vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’,
‘spooky’, ‘good’

© Edunet Foundation. All rights reserved. | 170


 Number of words in Review 2 = 8
 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number
of terms in review 2) = 1/8

Similarly: TF(‘movie’) = 1/8, TF(‘is’) = 2/8 = ¼, TF(‘very’) = 0/8 = 0, TF(‘scary’) = 1/8,


TF(‘and’) = 1/8, TF(‘long’) = 0/8 = 0, TF(‘not’) = 1/8, TF(‘slow’) = 1/8, TF(‘spooky’) =
0/8 = 0, TF(‘good’) = 0/8 = 0.

We can calculate the term frequencies for all the terms and all the reviews in this
manner:

Image: Term Frequency (TF)

Inverse Document Frequency (IDF): IDF is a measure of how important a term is.
We need the IDF value because computing just the TF alone is not sufficient to
understand the importance of words.

We can calculate the IDF values for the all the words in Review 2: IDF(‘this’) = log
(number of documents/number of documents containing the word ‘this’) = log (3/3) =
log (1) = 0

Similarly: IDF (‘movie’,) = log (3/3) = 0, IDF(‘is’) = log (3/3) = 0, IDF(‘not’) = log (3/1)
= log (3) = 0.48, IDF(‘scary’) = log (3/2) = 0.18, IDF(‘and’) = log (3/3) = 0, IDF(‘slow’)
= log (3/1) = 0.48.

We can calculate the IDF values for each word like this. Thus, the IDF values for the
entire vocabulary would be:

© Edunet Foundation. All rights reserved. | 171


Image: IDF values for each word

Hence, we see that words like “is”, “this”, “and”, etc., are reduced to 0 and have little
importance; while words like “scary”, “long”, “good”, etc. are words with more
importance and thus have a higher value. We can now compute the TF-IDF score for
each word in the corpus. Words with a higher score are more important, and those
with a lower score are less important:

We can now calculate the TF-IDF score for every word in Review 2:

TF-IDF (‘this’, Review 2) = TF (‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Similarly: TF-IDF (‘movie’, Review 2) = 1/8 * 0 = 0, TF-IDF (‘is’, Review 2) = 1/4 * 0 =


0, TF-IDF (‘not’, Review 2) = 1/8 * 0.48 = 0.06, TF-IDF (‘scary’, Review 2) = 1/8 *
0.18 = 0.023, TF-IDF (‘and’, Review 2) = 1/8 * 0 = 0, TF-IDF (‘slow’, Review 2) = 1/8
* 0.48 = 0.06.

Similarly, we can calculate the TF-IDF scores for all the words with respect to all the
reviews, as shown in the figure below. We have now obtained the TF-IDF scores for
our vocabulary. TF-IDF also gives larger values for less frequent words and is high
when both IDF and TF values are high i.e. the word is rare in all the documents
combined but frequent in a single document.

Summarizing, we found in the BoW model, a text is represented as the bag of its
words, disregarding grammar and even word order but keeping multiplicity. The Tf-
Idf score is a numerical statistic that is intended to reflect how important a word is to
a document in a collection or corpus.

© Edunet Foundation. All rights reserved. | 172


Image: TF-IDF scores
Reference: https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/TF_IDF-matrix.png

Spam Ham classification is one of the classical applications of Machine Learning.


There are two types of data present in this repository, which is ham (non-spam)
and spam data. Furthermore, in the ham data, there are easy and hard, which means
there is some non-spam data that has a very high similarity with spam data. This
might pose a difficulty for our system to make a decision. Let us use the concept of
Bag of Words to convert Text to Numbers and then apply Classification algorithm on

4.5.1 Lexicons for Sentiment Analysis

Sentiment analysis is a process by which information is analysed through the use of


natural language processing (NLP). It refers to finding patterns in data and inferring
the emotion of the given piece of information, which could be classified in positive,
negative or neutral emotion. Sentiment Analysis is widely used to gauge the market
sentiment, such as gauging the popularity of the books, movies, songs and services.
Real world data is much more complex and nuanced, e.g. the text may contain
sarcasm (where positive words can carry negative meaning or vice versa),
shorthand, abbreviations, misspelled words, punctuation, slang and of course
emojis. We therefore take the help of lexicons, that can take into consideration the
intensity, the subjectivity or the objectivity of the word and the context also. There
are several lexicons available, we will discuss below some of the popular ones:

Afinn: It is the simplest yet popular lexicons used for sentiment analysis developed
by Finn Årup Nielsen. It contains 3300+ words with a polarity score associated with
each word. In python, there is an in-built function for this lexicon.

© Edunet Foundation. All rights reserved. | 173


Textblob: It is a simple python library that offers API access to different NLP tasks
such as sentiment analysis, spelling correction, etc. Textblob sentiment analyzer
returns two properties for a given input sentence:

 Polarity is a float that lies between [-1,1], -1 indicates negative sentiment and +1
indicates positive sentiments. Polarity is related to the emotion of a given text.

 Subjectivity is also a float which lies in the range of [0,1] (0.0 being very objective
and 1.0 being very subjective). Subjective sentences generally refer to personal
opinion, emotion, or judgment. A subjective sentence may or may not carry any
emotion.

VADER sentiment: Valence Aware Dictionary and Sentiment Reasoner is another


popular rule-based sentiment analyzer. It uses a list of lexical features (e.g. word)
which are labelled as positive or negative according to their semantic orientation to
calculate the text sentiment. Vader sentiment returns the probability of a given input
sentence to be Positive, negative, and neutral. It is widely used in analysing the
sentiment on social media text because it has been specially attuned to analyse
sentiments expressed in social media.
For example:

“The food was great!”


Positive: 99, Negative :1%, Neutral: 0% .

These three probabilities will add up to 100%.

So, in this chapter we explored the various algorithms which falls under the category
of supervised and unsupervised machine learning. But there are some drawbacks or
limitations on these algorithms. So, in the next chapter we are going to discuss the
subset of machine learning named deep learning and how it helps to overcome the
limitations of machine learning. So let us get started with deep learning.

© Edunet Foundation. All rights reserved. | 174


Chapter 5: Building Deep Learning
Models
Learning Outcomes:

 Understand basic key concepts of Deep Learning


 Understand the process of automatic feature extraction using Artificial
Neural Networks
 Understand the basics of Convolutional Neural Networks
 Able to design ANN, and CNN

5.1 Deep Learning Basics

Deep learning (also known as deep structured learning) is a part of a broader family
of machine learning methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or unsupervised. Deep
learning models are not meant to be trained with an algorithm. Instead, they make
learning a step further. Deep learning models works directly with audio, images and
video data to get real time analysis. The data being fed to the deep learning model
does not need any external intervention. You can feed raw data to the model and
receive actionable insights.
Deep-learning architectures such as deep neural networks, deep belief networks,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, machine vision, speech recognition, natural
language processing, audio recognition, social network filtering, machine translation,
bioinformatics, drug design, medical image analysis, material inspection and board
game programs, where they have produced results comparable to and in some
cases surpassing human expert performance.
Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various
differences from biological brains. Specifically, neural networks tend to be static and
symbolic, while the biological brain of most living organisms is dynamic (plastic) and
analogue. The adjective "deep" in deep learning refers to the use of multiple layers in
the network. Deep learning is a modern variation which is concerned with an
unbounded number of layers of bounded size, which permits practical application
and optimized implementation, while retaining theoretical universality under mild
conditions. In deep learning the layers are also permitted to be heterogeneous and
to deviate widely from biologically informed connectionist models, for the sake of
efficiency, trainability and understandability. In deep learning, each level learns to
© Edunet Foundation. All rights reserved. | 175
transform its input data into a slightly more abstract and composite representation.
We list below the differences between AI, ML and DL.

Relation between Artificial Intelligence, Machine Learning, and


Deep Learning

5.2 Concepts of Neural Networks


Artificial neural networks (ANNs), usually simply called neural networks (NNs), are
computing systems vaguely inspired by the biological neural networks that constitute
human brains. An ANN is based on a collection of connected units or nodes called
artificial neurons, which loosely model the neurons in a biological brain. Each
connection, like the synapses in a biological brain, can transmit a signal to other
neurons. An artificial neuron that receives a signal then processes it and can signal
neurons connected to it. The "signal" at a connection is a real number, and the
output of each neuron is computed by some non-linear function of the sum of its
inputs. The connections are called edges. Neurons and edges typically have a
weight that adjusts as learning proceeds. The weight increases or decreases the
strength of the signal at a connection. Neurons may have a threshold such that a
signal is sent only if the aggregate signal crosses that threshold. Typically, neurons
are aggregated into layers. Different layers may perform different transformations on
their inputs. Signals travel from the first layer (the input layer), to the last layer (the
output layer), possibly after traversing the layers multiple times.

© Edunet Foundation. All rights reserved. | 176


Neural networks learn (or are trained) by processing examples, each of which
contains a known "input" and "result," forming probability-weighted associations
between the two, which are stored within the data structure of the net itself. The
training of a neural network from a given example is usually conducted by
determining the difference between the processed output of the network (often a
prediction) and a target output. This is the error. The network then adjusts its
weighted associations according to a learning rule and using the error value.
Successive adjustments will cause the neural network to produce output which is
increasingly similar to the target output. After a sufficient number of these
adjustments the training can be terminated based upon certain criteria.
Such systems "learn" to perform tasks by considering examples, generally without
being programmed with task-specific rules. For example, in image recognition, they
might learn to identify images that contain cats by analyzing example images that
have been manually labeled as "cat" or "no cat" and using the results to identify cats
in other images. They do this without any prior knowledge of cats, for example, that
the cats have fur, tails, whiskers, and cat-like faces. Instead, they automatically
generate identifying characteristics from the examples that they process.

Let us understand the core part of artificial neural networks which is Neuron.

5.2.1 Understanding Neurons

Neurons as we know it in the biological concept are the basic functional units of the
nervous system, and they generate electrical signals called action potentials, which
allows them to quickly transmit information over long distances. Almost all the
neurons have three basic functions essential for the normal functioning of all the
cells in the body. These are to:

1. Receive signals (or information) from outside.


2. Process the incoming signals and determine whether or not the information
should be passed along.
3. Communicate signals to target cells which might be other neurons or muscles
or glands.

Artificial neuron also known as perceptron is the basic unit of the neural network. In
simple terms, it is a mathematical function based on a model of biological neurons. It
can also be seen as a simple logic gate with binary outputs. Each artificial neuron
has the following main functions:
1. Takes inputs from the input layer
2. Weighs them separately and sums them up
3. Pass this sum through a nonlinear function to produce output.

© Edunet Foundation. All rights reserved. | 177


Fig: Neural Network
The perceptron(neuron) consists of 4 parts:
1. Input values or One input layer: We pass input values to a neuron using this
layer. It might be something as simple as a collection of array values. It is
similar to a dendrite in biological neurons.
2. Weights and Bias: Weights are a collection of array values which are
multiplied to the respective input values. We then take a sum of all these
multiplied values which is called a weighted sum. Next, we add a bias value to
the weighted sum to get final value for prediction by our neuron.
3. Activation Function: Activation Function decides whether or not a neuron is
fired. It decides which of the two output values should be generated by the
neuron.
4. Output Layer: Output layer gives the final output of a neuron which can then
be passed to other neurons in the network or taken as the final output value.

Let’s understand the working of an artificial neuron with an example. Consider a


neuron with two inputs (x1, x2) as shown below:

Fig: Single Layer Neuron

1. The values of the two inputs (x1, x2) are 0.8 and 1.2.
2. We have a set of weights (1.0,0.75) corresponding to the two inputs.
3. Then we have a bias with value 0.5 which needs to be added to the sum.
© Edunet Foundation. All rights reserved. | 178
4. The input to activation function is then calculated using the formula:-

Now the combination (C) can be fed to the activation function. Let us first understand
the logic of Rectified linear (ReLU) activation function which we are currently using in
our example. In our case, the combination value we got was 2.2 which is greater
than 0 so the output value of our activation function will be 2.2. This will be the final
output value of our single layer neuron.

Biological Neuron vs. Artificial Neuron

Since we have learnt a bit about both biological and artificial neurons, we can now
draw comparisons between both as follows:

© Edunet Foundation. All rights reserved. | 179


The activation function of a node defines the output of that node given an input or set
of inputs.

Forward Propagation
Forward propagation is how neural networks make predictions. Input data is “forward
propagated” through the network layer by layer to the final layer which outputs a
prediction.

Backpropagation

© Edunet Foundation. All rights reserved. | 180


In machine learning, backpropagation is a widely used algorithm for training
feedforward neural networks. Generalizations of backpropagation exist for other
artificial neural networks (ANNs), and for functions generally. These classes of
algorithms are all referred to generically as "backpropagation". In fitting a neural
network, backpropagation computes the gradient of the loss function with respect to
the weights of the network for a single input–output example, and does so efficiently,
unlike a naive direct computation of the gradient with respect to each weight
individually.
This efficiency makes it feasible to use gradient methods for training multilayer
networks, updating weights to minimize loss; gradient descent, or variants such as
stochastic gradient descent, are commonly used. The backpropagation algorithm
works by computing the gradient of the loss function with respect to each weight by
the chain rule, computing the gradient one layer at a time, iterating backward from
the last layer to avoid redundant calculations of intermediate terms in the chain rule;
this is an example of dynamic programming.

Fig: Forward and backward propagation in Neural Networks


Reference - Jorge Guerra Pires, CC BY-SA 3.0, via Wikimedia Commons

After understanding the concepts like neuron, bias, weights, forward propagation,
backward propagation which plays a vital role in predicting final output. let's explore
some problems like overfitting and underfitting of a model.

5.2.2 Overfitting in Deep Learning

Overfitting is "the production of an analysis that corresponds too closely or exactly to


a particular set of data, and may therefore fail to fit additional data or predict future
observations reliably". An overfitted model is a statistical model that contains more
parameters than can be justified by the data. The essence of overfitting is to have
unknowingly extracted some of the residual variation (i.e., the noise) as if that
variation represented underlying model structure.

© Edunet Foundation. All rights reserved. | 181


Underfitting occurs when a statistical model cannot adequately capture the
underlying structure of the data. An under-fitted model is a model where some
parameters or terms that would appear in a correctly specified model are missing.
Under-fitting would occur, for example, when fitting a linear model to non-linear data.
Such a model will tend to have poor predictive performance.

The possibility of over-fitting exists because the criterion used for selecting the model
is not the same as the criterion used to judge the suitability of a model. For example,
a model might be selected by maximizing its performance on some set of training
data, and yet its suitability might be determined by its ability to perform well on
unseen data; then overfitting occurs when a model begins to "memorize" training
data rather than "learning" to generalize from a trend.

As an extreme example, if the number of parameters is the same as or greater than


the number of observations, then a model can perfectly predict the training data
simply by memorizing the data in its entirety. Such a model, though, will typically fail
severely when making predictions.

The potential for overfitting depends not only on the number of parameters and data
but also the conformability of the model structure with the data shape, and the
magnitude of model error compared to the expected level of noise or error in the
data. Even when the fitted model does not have an excessive number of parameters,
it is to be expected that the fitted relationship will appear to perform less well on a
new data set than on the data set used for fitting (a phenomenon sometimes known
as shrinkage). In particular, the value of the coefficient of determination will shrink
relative to the original data.

To lessen the chance of, or amount of, overfitting, several techniques are available
(e.g., model comparison, cross-validation, regularization, early stopping, pruning,
Bayesian priors, or dropout). The basis of some techniques is either (1) to explicitly
penalize overly complex models or (2) to test the model's ability to generalize by
evaluating its performance on a set of data not used for training, which is assumed to
approximate the typical unseen data that a model will encounter.

Fig: Green line shows Overfitting


Reference - Chabacano, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia
Commons

© Edunet Foundation. All rights reserved. | 182


Measures to prevent overfitting

1. Use More Data for training to make model learn maximum hidden pattern from the
training data and model becomes generalized.

2. Use Regularization Techniques Example: L1, L2, Drop Out, Early Stopping (in
case of Neural Networks) etc.

3. Hyper Parameter Tuning to avoid Overfitting Example: Higher value of K in KNN,


Tuning of C and Gama for SVM, Depth of Tree in Decision Tree

4. Use less number of features — Manual or Feature Selection Algorithms or


automated using L1, L2 Regularization

5. Reduce complexity of Model — Reduce polynomial degree in case of Polynomial


regression and Logistic regression.

5.2.3 Importance of validation split


The definitions of training, validation, and test sets can be fairly nuanced, and the
terms are sometimes inconsistently used. In the deep learning community, “test-time
inference” is often used to refer to evaluation on data in production, which is not the
technical definition of a test set.
One of the most likely culprits for this disconnect between results in development vs
results in production is a poorly chosen validation set (or even worse, no validation
set at all). Depending on the nature of your data, choosing a validation set can be
the most important step. Although sklearn offers a train-test split method, this
method takes a random subset of the data, which is a poor choice for many real-
world problems.
When creating a machine learning model, the ultimate goal is for it to be accurate on
new data, not just the data you are using to build it. Consider the below example of 3
different models for a set of data:

Fig: Bias Variance Tradeoff


Reference - https://www.quora.com/What-is-an-intuitive-explanation-for-bias-variance-tradeoff

© Edunet Foundation. All rights reserved. | 183


The error for the pictured data points is lowest for the model on the far right (the blue
curve passes through the red points almost perfectly), yet it’s not the best choice.
Why is that? If you were to gather some new data points, they most likely would not
be on that curve in the graph on the right, but would be closer to the curve in the
middle graph. The underlying idea is that:
 the training set is used to train a given model
 the validation set is used to choose between models (for instance, does a
random forest or a neural network better for your problem? do you want a
random forest with 40 trees or 50 trees?)
 the test set tells you how you’ve done. If you’ve tried out a lot of different
models, you may get one that does well on your validation set just by chance,
and having a test set helps make sure that is not the case.
A key property of the validation and test sets is that they must be representative of
the new data you will see in the future. This may sound like an impossible order! By
definition, you haven’t seen this data yet. But there are still a few things you know
about it.
We frequently take insights from validation errors to tune our models. So, we are
implicitly leaking information from our validation data to our model. Advanced
validation methods have obscured the importance of single split validation data. K-
fold cross-validation is quite robust and probably the current industry standard for
model performance validation and parameter tuning. So, if you are using cross-
validation techniques in your analysis, you may ignore the validation data split.
The primary objective of test data is to give an unbiased estimate of model accuracy.
It should be used at the very end and only for a couple of times. If you tune your
model after looking at the test accuracies, you are technically leaking information
and hence cheating.
For the very same reason as above (leakage of information), in spite of the
programming convenience we should not combine train-validation-test dataset to
make common preprocessing flow. Some might argue that according to the base
hypothesis train-validation-test data come from the same population distribution and
hence there should be no harm in combining them for a common preprocessing flow.
This is true in idealistic scenarios, but real life if far from it as you never know when
your real-time production system starts getting evolving data (whose distribution is
slightly different from the training data). As a good Data Scientist, you should strive
to make a model flow that is generalizable and performs well (without any additional
changes) irrespective of the uncertainties in future data.
We are supposed to develop 2 separate pre-processing pipelines. (A) for training
data and (B) for validation and test data. However, it should be noted that these
pipelines aren’t completely independent. You learn the transformation features
(mean/range/standard-deviation) from training data and use it to transform your
validation and test data. Now let us start with the basics of computer vision.

5.3 Computer Vision Basics


What is Image Processing?

© Edunet Foundation. All rights reserved. | 184


It is important to know what exactly image processing is and what is its role in the
bigger picture before diving into its how's. Image Processing is most commonly
termed as 'Digital Image Processing' and the domain in which it is frequently used is
'Computer Vision'. Don't be confused - we are going to talk about both of these
terms and how they connect. Both Image Processing algorithms and Computer
Vision (CV) algorithms take an image as input; however, in image processing, the
output is also an image, whereas in computer vision the output can be some
features/information about the image.

Why do we need it?

The data that we collect or generate is mostly raw data, i.e., it is not fit to be used in
applications directly due to a number of possible reasons. Therefore, we need to
analyse it first, perform the necessary pre-processing, and then use it.
For instance, let's assume that we were trying to build a cat classifier. Our program
would take an image as input and then tell us whether the image contains a cat or
not. The first step for building this classifier would be to collect hundreds of cat
pictures. One common issue is that all the pictures we have scraped would not be of
the same size/dimensions, so before feeding them to the model for training, we
would need to resize/pre-process them all to a standard size. This is just one of
many reasons why image processing is essential to any computer vision application.

What Is Computer Vision?

Computer Vision is an interdisciplinary field that deals with how computers can be
made to gain a high-level understanding from digital images or videos. The idea here
is to automate tasks that the human visual systems can do. So, a computer should
be able to recognize objects such as that of a face of a human being or a lamppost
or even a statue.
It is a multidisciplinary field that could broadly be called a subfield of artificial
intelligence and machine learning, which may involve the use of specialized methods
and make use of general learning algorithms. The goal of computer vision is to
extract useful information from images.

Fig: Overview of the Relationship of Artificial Intelligence and Computer Vision

© Edunet Foundation. All rights reserved. | 185


Many popular computer vision applications involve trying to recognize some high-
level problems; for example:

1. Image Classification
2. Object Detection
3. Optical Character Recognition
4. Image Segmentation

Python Libraries for Computer Vision

1. OpenCV (Open-Source Computer Vision Library: http://opencv.org) is an


open-source BSD-licensed library that includes several hundreds of computer
vision algorithms.
2. Scikit-Image - A collection of algorithms for image processing in Python.
3. SimpleCV - An open-source computer vision framework that gives access to
several high-powered computer vision libraries, such as OpenCV. Written on
Python and runs on Mac, Windows, and Ubuntu Linux.
4. face_recognition - Face recognition library that recognize and manipulate
faces from Python or from the command line.
5. pytessarct - Python-tesseract is an optical character recognition (OCR) tool
for python. That is, it will recognize and "read" the text embedded in
images.Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.

What are Pixels?

Pixel or Picture Element is smallest addressable element in an image. Each pixel is


a sample of an original image; more samples typically provide more accurate
representations of the original. The intensity of each pixel is variable. Pixels can be
used as a unit of measure such as: 2400 pixels per inch, 640 pixels per line, or
spaced 10 pixels apart.

Fig: Image enlarged showing the pixels, rendered as small squares

The more pixels used to represent an image, the closer the result can resemble the
original. The number of pixels in an image is sometimes called the resolution, though
resolution has a more specific definition. Pixel counts can be expressed as a single

© Edunet Foundation. All rights reserved. | 186


number, as in a "three-megapixel" digital camera, which has a nominal three million
pixels, or as a pair of numbers, as in a "640 by 480 display", which has 640 pixels
from side to side and 480 from top to bottom (as in a VGA display) and therefore has
a total number of 640 × 480 = 307,200 pixels, or 0.3 megapixels.

Image as Matrix

Images are represented in rows and columns. For example digital grayscale image is
presented in the computer by pixels matrix. Each pixel of such image is presented by
one matrix element – integer from the set. The numeric values in pixel presentation
are uniformly changed from zero (black pixels) to 255 (white pixels).

Fig: Image with Pixels

5.3.1 Types of Images


Images can be divided in 3 different categories
1. Binary Images: It is the simplest type of image. It takes only two values i.e., Black
and White or 0 and 1. The binary image consists of a 1-bit image and it takes only 1
binary digit to represent a pixel. Binary images are mostly used for general shape or
outline.
For Example: Optical Character Recognition (OCR).
Binary images are generated using threshold operation. When a pixel is above the
threshold value, then it is turned white('1') and which are below the threshold value
then they are turned black('0')

© Edunet Foundation. All rights reserved. | 187


https://static.javatpoint.com/tutorial/dip/images/binary-images.png
2. Gray-scale images: Grayscale images are monochrome images, means they have
only one color. Grayscale images do not contain any information about color. Each
pixel determines available different grey levels. A normal grayscale image contains
8 bits/pixel data, which has 256 different grey levels. In medical images and
astronomy, 12 or 16 bits/pixel images are used.

3. Colour images: Colour images are three band monochrome images in which, each
band contains a different color and the actual information is stored in the digital
image. The color images contain gray level information in each spectral band.
The images are represented as red, green and blue (RGB images). And each color
image has 24 bits/pixel means 8 bits for each of the three-color band (RGB).

© Edunet Foundation. All rights reserved. | 188


After understanding the basics of image and pixels, let us get started with exploring
the concepts of Image processing.

How Does a Computer Read an Image?

Consider the image given below:

The computer reads any image as a range of values between 0 and 255. For any
colour image, there are 3 primary channels – Red, green and blue. How it works is
pretty simple. A matrix is formed for every primary colour and later these matrices
combine to provide a Pixel value for the individual R, G, B colours. Each element of
the matrices provides data pertaining to the intensity of brightness of the
pixel. Consider the following image:

© Edunet Foundation. All rights reserved. | 189


Ref: https://medium.com/edureka/python-opencv-tutorial-5549bd4940e3

As shown, the size of the image here can be calculated as B x A x 3, where 3 is the
number of channels. Note: For a black-white image, there is only one single
channel.

5.3.2 What is OpenCV?

OpenCV is a Python library which is designed to solve computer vision problems.


OpenCV was originally developed in 1999 by Intel but later it was supported by
Willow Garage. OpenCV supports a wide variety of programming languages such as
C++, Python, Java etc. It provides support for multiple platforms including Windows,
Linux, and MacOS. OpenCV Python is nothing but a wrapper class for the original
C++ library to be used with Python. Using this, all of the OpenCV array structures
gets converted to/from NumPy arrays.
This makes it easier to integrate it with other libraries which use NumPy. For
example, libraries such as SciPy and Matplotlib.

Installation
Note: Since we are going to use OpenCV via Python, it is an implicit requirement that
you already have Python (version 3) already installed on your workstation.

Windows: $ pip install opencv-python


MacOS: $ brew install opencv3 --with-contrib --with-python3
Linux: $ sudo apt-get install libopencv-dev python-opencv

To check if your installation was successful or not, run the following command in
either a Python shell or your command prompt:

import cv2

© Edunet Foundation. All rights reserved. | 190


Now let’s work with OpenCV. We’ll try to open an image by using OpenCV (Open-
Source Computer Vision) library. Following types of files are supported in OpenCV
library:

Windows bitmaps – *.bmp, *.dib


JPEG files – *.jpeg, *.jpg
Portable Network Graphics – *.png
WebP – *.webp
Sun rasters – *.sr, *.ras
TIFF files – *.tiff, *.tif

To use the OpenCV library in python, we need to install these libraries as a


prerequisite:

NumPy Library: The computer processes images in the form of a matrix for which
NumPy is used and OpenCV uses it in the background.
OpenCV python: OpenCV library previously it was cv but the updated version is
cv2. It is used to manipulate images and videos.

The steps to read and display an image in OpenCV are:


1. Read an image using imread() function.
2. Create a GUI window and display image using imshow() function.
3. Use function waitkey(0) to hold the image window on the screen by the specified
number of seconds, 0 means till the user closes it, it will hold GUI window on the
screen.
4. Delete image window from the memory after displaying using destroyAllWindows()
function.
Let’s start reading an image. using cv2. To read the images cv2.imread() method is
used. This method loads an image from the specified file. If the image cannot be
read (because of missing file, improper permissions, unsupported or invalid format)
then this method returns an empty matrix.

Syntax: cv2.imread(path, flag)

Parameters:

path: A string representing the path of the image to be read.


flag: It specifies the way in which image should be read. Its default value is
cv2.IMREAD_COLOR

Return Value: This method returns an image that is loaded from the specified file.

Note: The image should be in the working directory or a full path of image should be
given. By default, OpenCV stores coloured images in BGR (Blue Green and Red)
format.

All three types of flags are described below:

© Edunet Foundation. All rights reserved. | 191


cv2.IMREAD_COLOR: It specifies to load a color image. Any transparency of image
will be neglected. It is the default flag. Alternatively, we can pass integer value 1 for
this flag.
cv2.IMREAD_GRAYSCALE: It specifies to load an image in grayscale mode.
Alternatively, we can pass integer value 0 for this flag.
cv2.IMREAD_UNCHANGED: It specifies to load an image as such including alpha
channel. Alternatively, we can pass integer value -1 for this flag.

Display the image: cv2.imshow() method is used to display an image in a window.


The window automatically fits to the image size.
Syntax: cv2.imshow(window_name, image)
Parameters: window_name: A string representing the name of the window in
which image to be displayed.
image: It is the image that is to be displayed.
Return Value: It doesn’t return anything.

cv2.imwrite() method is used to save an image to any storage device. This will save
the image according to the specified format in current working directory.
Syntax: cv2.imwrite(filename, image)
Parameters: filename: A string representing the file name. The filename must
include image format like .jpg, .png, etc.
image: It is the image that is to be saved.
Return Value: It returns true if image is saved successfully.

Arithmetic Operations on Images using OpenCV

Arithmetic Operations like Addition, Subtraction, and Bitwise Operations (AND, OR,
NOT, XOR) can be applied to the input images. These operations can be helpful in
enhancing the properties of the input images. The Image arithmetic are important for
analyzing the input image properties. The operated images can be further used as
an enhanced input image, and many more operations can be applied for clarifying,
thresholding, dilating etc of the image.
Addition of Image:
We can add two images by using function cv2.add(). This directly adds up image
pixels in the two images.
Syntax: cv2.add(img1, img2)
But adding the pixels is not an ideal situation. So, we use cv2.addweighted().
Remember, both images should be of equal size and depth.

Syntax: cv2.addWeighted(img1, wt1, img2, wt2, gammaValue)


Parameters:
img1: First Input Image array (Single-channel, 8-bit or floating-point)
wt1: Weight of the first input image elements to be applied to the final image
img2: Second Input Image array (Single-channel, 8-bit or floating-point)
wt2: Weight of the second input image elements to be applied to the final image
gammaValue: Measurement of light.

Subtraction of Image:
© Edunet Foundation. All rights reserved. | 192
Just like addition, we can subtract the pixel values in two images and merge them
with the help of cv2.subtract(). The images should be of equal size and depth.
Syntax: cv2.subtract(src1, src2)

Bitwise operations are used in image manipulation and used for extracting
essential parts in the image. In this article, Bitwise operations used are:
AND, OR, XOR, NOT
Also, Bitwise operations helps in image masking. Image creation can be enabled
with the help of these operations. These operations can be helpful in enhancing the
properties of the input images.
NOTE: The Bitwise operations should be applied on input images of same
dimensions

Bitwise AND operation on Image:


Bit-wise conjunction of input array elements.
Syntax: cv2.bitwise_and (source1, source2, destination, mask)
Parameters:
source1: First Input Image array (Single-channel, 8-bit or floating-point)
source2: Second Input Image array (Single-channel, 8-bit or floating-point)
destination: Output array (Similar to the dimensions and type of Input image array)
mask: Operation mask, Input / output 8-bit single-channel mask

Bitwise OR operation on Image:


Bit-wise disjunction of input array elements.
Syntax: cv2.bitwise_or (source1, source2, destination, mask)
Parameters are same as before, bitwise and operation.

Bitwise XOR operation on Image:


Bit-wise exclusive-OR operation on input array elements.
Syntax: cv2.bitwise_xor (source1, source2, destination, mask)
Parameters are same as before, bitwise and operation.

Bitwise NOT operation on Image:


Inversion of input array elements.
Syntax: cv2.bitwise_not(source, destination, mask)
Parameters are same as before, bitwise and operation

Drawing functions from OpenCV:


cv2.line() method is used to draw a line on any image.
Syntax: cv2.line(image, start_point, end_point, color, thickness)
Parameters: image: It is the image on which the line is to be drawn.
start_point: It is the starting coordinates of the line. The coordinates are
represented as tuples of two values i.e. (X coordinate value, Y coordinate value).
end_point: It is the ending coordinates of the line. The coordinates are represented
as tuples of two values i.e. (X coordinate value, Y coordinate value).
color: It is the color of the line to be drawn. For RGB, we pass a tuple. e.g.: (255, 0,
0) for blue color.
thickness: It is the thickness of the line in px.
© Edunet Foundation. All rights reserved. | 193
Return Value: It returns an image.

cv2.arrowedLine() method is used to draw arrow segment pointing from the start
point to the end point.
Syntax: cv2.arrowedLine(image, start_point, end_point, color, thickness,
line_type, shift, tipLength)
Parameters: image, start_point, end_point, color, thickness are same as defined in
cv2.line()
line_type: It denotes the type of the line for drawing.
shift: It denotes number of fractional bits in the point coordinates.
tipLength: It denotes the length of the arrow tip in relation to the arrow length.
Return Value: It returns an image.

cv2.circle() method is used to draw a circle on any image. The syntax of cv2.circle()
method is:
Syntax:
cv2.circle(image, center_coordinates, radius, color, thickness)
Parameters:
image: It is the image on which the circle is to be drawn.
center_coordinates: It is the center coordinates of the circle. The coordinates are
represented as tuples of two values i.e. (X coordinate value, Y coordinate value).
radius: It is the radius of the circle.
color: It is the color of the borderline of a circle to be drawn. For BGR, we pass a
tuple. e.g.: (255, 0, 0) for blue color.
thickness: It is the thickness of the circle border line in px. Thickness of -1 px will fill
the circle shape by the specified color.
Return Value: It returns an image.

The steps to create a circle on an image are:


Read the image using imread() function. Pass this image to the cv2.circle() method
along with other parameters such as center_coordinates, radius, color and
thickness. Display the image using cv2.imshow() method.

cv2.rectangle() method is used to draw a rectangle on any image.

Syntax: cv2.rectangle(image, start_point, end_point, color, thickness)


Parameters:
image: It is the image on which rectangle is to be drawn.
start_point: It is the starting coordinates of rectangle. The coordinates are
represented as tuples of two values i.e. (X coordinate value, Y coordinate value).
end_point: It is the ending coordinates of rectangle. The coordinates are
represented as tuples of two values i.e. (X coordinate value, Y coordinate value).
color: It is the color of border line of rectangle to be drawn. For BGR, we pass a
tuple. eg: (255, 0, 0) for blue color.
thickness: It is the thickness of the rectangle border line in px. Thickness of -1 px
will fill the rectangle shape by the specified color.
Return Value: It returns an image.

© Edunet Foundation. All rights reserved. | 194


cv2.putText() method is used to draw a text string on any image.

Syntax: cv2.putText(image, text, org, font, fontScale,


color[,thickness[,lineType[, bottomLeftOrigin]]])
Parameters:
image: It is the image on which text is to be drawn.
text: Text string to be drawn.
org: It is the coordinates of the bottom-left corner of the text string in the image. The
coordinates are represented as tuples of two values i.e. (X coordinate value, Y
coordinate value).
font: It denotes the font type. Some of font types are FONT_HERSHEY_SIMPLEX,
FONT_HERSHEY_PLAIN, , etc.
fontScale: Font scale factor that is multiplied by the font-specific base size.
color: It is the color of text string to be drawn. For BGR, we pass a tuple. eg: (255, 0,
0) for blue color.
thickness: It is the thickness of the line in px.
lineType: This is an optional parameter.It gives the type of the line to be used.
bottomLeftOrigin: This is an optional parameter. When it is true, the image data
origin is at the bottom-left corner. Otherwise, it is at the top-left corner.
Return Value: It returns an image.

Image Processing by OpenCV:


Resizing an image means changing the dimensions of it, be it width alone, height
alone or changing both of them. Also, the aspect ratio of the original image could be
preserved in the resized image. To resize an image, OpenCV provides cv2.resize()
function. In this tutorial, we shall the syntax of cv2.resize and get hands-on with
examples provided for most of the scenarios encountered in regular usage.

Syntax – cv2.resize()
The syntax of resize function in OpenCV is

cv2.resize(src, dsize[, dst[, fx[, fy[, interpolation]]]])

where,
Parameter Description
src [required] source/input image
dsize [required] desired size for the output image
fx [optional] scale factor along the horizontal axis
fy [optional] scale factor along the vertical axis
interpolation [optional] flag that takes one of the following methods.
INTER_NEAREST – a nearest-neighbor interpolation INTER_LINEAR – a bilinear
interpolation (used by default) INTER_AREA – resampling using pixel area relation.
It may be a preferred method for image decimation, as it gives moire’-free results.
But when the image is zoomed, it is similar to the INTER_NEAREST method.
INTER_CUBIC – a bicubic interpolation over 4×4 pixel neighborhood
INTER_LANCZOS4 – a Lanczos interpolation over 8×8 pixel neighborhood

© Edunet Foundation. All rights reserved. | 195


OpenCV – Edge Detection
Edge Detection is an image processing technique to find boundaries of objects in the
image. We shall learn to find edges of focused objects in an image using Canny
Edge Detection Technique.

Syntax – cv2.Canny()
The syntax of OpenCV Canny Edge Detection function is
edges = cv2.Canny('/path/to/img', minVal, maxVal, apertureSize, L2gradient)

where:

Parameter Description
/path/to/img (Mandatory) File Path of the image
minVal (Mandatory) Minimum intensity gradient
maxVal (Mandatory) Maximum intensity gradient
apertureSize (Optional)
L2gradient (Optional) (Default Value : false)
If true, Canny() uses a much more computationally expensive equation to detect
edges, which provides more accuracy at the cost of resources.

Image blurring using OpenCV


Blurring is the commonly used technique for image processing to removing the
noise. It is generally used to eliminate the high-frequency content such as noise,
edges in the image. The edges are being blurred when we apply blur to the image.
The advantages of blurring are the following:

Advantages of Blurring
The benefits of blurring are the following:
It removes low-intensity edges.
It helps in smoothing the image.
It is beneficial in hiding the details; for example, blurring is required in many cases,
such as police intentionally want to hide the victim's face.

OpenCV Averaging
In this technique, the image is convolved with a box filter (normalize). It calculates
the average of all the pixels which are under the kernel area and replaces the central
element with the calculated average. OpenCV provides the cv2.blur() or
cv2.boxFilter() to perform this operation. We should define the width and height of
the kernel. The syntax of cv2.blur() function is following.
cv2.blur(src, dst, ksize, anchor, borderType)
Parameters:
src - It represents the source (input) image.
dst - It represents the destination (output) image.
ksize - It represents the size of the kernel.
anchor - It denotes the anchor points.
borderType - It represents the type of border to be used to the output.
© Edunet Foundation. All rights reserved. | 196
OpenCV Gaussian Blur
Image smoothing is a technique which helps in reducing the noise in the images.
Image may contain various type of noise because of camera sensor. It basically
eliminates the high frequency (noise, edge) content from the image so edges are
slightly blurred in this operation. OpenCV provide gaussianblur() function to apply
smoothing on the images. The syntax is following:

dst=cv2.GuassiasBlur(src,ksize,sigmaX[,dst[,sigmaY[,borderType=BORDER_DEFA
ULT]]]
Parameters:
src -It is used to input an Image.
dst -It is a variable which stores an output Image.
ksize -It defines the Gaussian Kernel Size[height width ]. Height and width must be
odd (1,3,5,..) and can have different values. If ksize is set to [0,0], then ksize is
computed from sigma value.
sigmaX - Kernel standard derivation along X-axis.(horizontal direction).
sigmaY - Kernel standard derivation along Y-axis (vertical direction). If sigmaY = 0
then sigmaX value is taken for sigmaY.
borderType - These are the specified image boundaries while the kernel is applied
on the image borders. Possible border type is:

cv.BORDER_CONSTANT
cv.BORDER_REPLICATE
cv.BORDER_REFLECT
cv.BORDER_WRAP
cv.BORDER_REFLECT_101
cv.BORDER_TRANSPARENT
cv.BORDER_REFLECT101
cv.BORDER_DEFAULT
cv.BORDER_ISOLATED

We will do practical on all the above and more functions.

5.3.4 Edge Detection: Extracting the Edges from An Image

© Edunet Foundation. All rights reserved. | 197


Suppose we have given a task to classify a set of images in Cars, Animals, and
Humans. Here is a bunch of images. Can you differentiate between the objects?
Quite simple, right? Yes, we can easily identify the cars, animals and the human in
the above pictures. Now let’s consider another set of images as shown below.

Can we still easily classify the images? I believe, yes, we can clearly see there are
two cars, two animals and a person. But what is the difference between these two
sets of images? Well, in the second case we removed the colour, the background,
and the other minute details from the pictures. We only have the edges, and you still
able to identify the objects in the image. So, for any given image, if we are able to
extract only the edges and remove the noise from the image, we would still be able
to classify the image.

What is Edge Detection?


As we know, the computer sees the images in the form of matrices, as shown here.
© Edunet Foundation. All rights reserved. | 198
In this case, we can clearly identify the edges by looking at the numbers or the pixel
values. So, if you look closely in the matrix of the numbers, there is a significant
difference between the pixel values around the edge. The black area in the left
image is represented by low values as shown in the second image. Similarly, the
white area is represented by the larger numbers.
Edge detection is an image processing technique for finding the boundaries of an
object in the given image. So, to summarize, the edges are the part of the image that
represents the boundary or the shape of the object in the image. Also, the pixel
values around the edge show a significant difference or a sudden change in the pixel
values. Based on this fact we can identify which pixels represent the edge or which
pixel lie on the edge.

How to Extract the Edges from an Image?

Once we have the idea of the edges, now let’s understand how can we extract the
edges from an image. Say, we take a small part of the image. We can compare the
pixel values with its surrounding pixels, to find out if a particular pixel lies on the
edge.
For example, if I take the target pixel 16 and compare the values at its left and right.
Here the values are 10 and 119 respectively. Clearly, there is a significant change in
the pixel values. So, we can say the pixel lies on the edge. Whereas, if you look at
© Edunet Foundation. All rights reserved. | 199
the pixels in the following image. The pixel values to the left and the right of the
selected pixel don’t have a significant difference. Hence, we can say that this pixel is
not at the edge.

Now the question is do we have to sit and manually compare these values to find the
edges. Well, obviously not. For the task, we can use a matrix known as the
kernel and perform the element-wise multiplication .

Let’s say, in the selected portion of the image, I multiply all the numbers on left with -
1, all the numbers on right with 1. Also, all the numbers in the middle row with 0. In
simple terms, I am trying to find the difference between the left and right pixels.
When this difference is higher than a threshold, we can conclude it’s an edge. In the
above case, the number is 31 which is not a large number. Hence this pixel doesn’t
lie on edge.
Let’s take another case, here the highlighted pixel is my target.

© Edunet Foundation. All rights reserved. | 200


In this example, the result is 354 which is significantly high. Hence, we can say that
the given pixel lies on edge.

Filter/kernel

This matrix, that we use to calculate the difference is known as the filter or the
kernel. This filter slides through the image to generate a new metric called a feature
map. The values of the feature map tell, the particular pixel lies on the edge or not.
For this example, we are using 3*3 Prewitt filter as shown in the above image. As
shown below, when we apply the filter to perform detection on the given 6*6 image
(we have highlighted it in purple for our understanding) the output image will contain
((a11*1) +(a12*0)+ (a13*(-1)) + (a21*1)+(a22*0)+(a23*(-1))+(a31*1)+(a32*0)+(a33*(-
1))) in the purple square. We repeat the convolutions horizontally and then vertically
to obtain the output image.

We would continue the above procedure to get the processed image after edge-
detection. But, in the real world, we deal with very high-resolution images for Artificial
Intelligence applications. Hence, we opt for an algorithm to perform the convolutions,
and even use Deep Learning to decide on the best values of the filter.

© Edunet Foundation. All rights reserved. | 201


Note: If you notice in the above example with an input of 6*6 image after applying
3*3 filter, the output image is only 4*4. Usually, the formula is if the size of the input
image is n*n and the filter size is r*r, the output image size will be (n-r+1)*(n-r+1).

Methods of Edge Detection

There are various methods, and the following are some of the most commonly used
methods-
 Prewitt edge detection
 Sobel edge detection
 Laplacian edge detection
 Canny edge detection

Prewitt Edge Detection

This method is a commonly used edge detector mostly to detect the horizontal and
vertical edges in images. The above are the Prewitt edge detection filters.

Sobel Edge Detection: This uses a filter that gives more emphasis to the centre of
the filter. It is one of the most commonly used edge detectors and helps reduce
noise and provides differentiating, giving edge response simultaneously. The
following are the filters used in this method-

© Edunet Foundation. All rights reserved. | 202


Laplacian Edge Detection: The Laplacian edge detectors vary from the previously
discussed edge detectors. This method uses only one filter (also called a kernel). In
a single pass, Laplacian detection performs second-order derivatives and hence are
sensitive to noise. To avoid this sensitivity to noise, before applying this method,
Gaussian smoothing is performed on the image.

Canny Edge Detection: This is the most commonly used highly effective and
complex compared to many other methods. It is a multi-stage algorithm used to
detect/identify a wide range of edges.
 Convert the image to grayscale
 Reduce noise – as the edge detection that using derivatives is sensitive to
noise, we reduce it.
 Calculate the gradient – helps identify the edge intensity and direction.
 Non-maximum suppression – to thin the edges of the image.
 Double threshold – to identify the strong, weak and irrelevant pixels in the
images.
 Hysteresis edge tracking – helps convert the weak pixels into strong ones
only if they have a strong pixel around them.

5.4 Convolutional Neural Networks

What are CNNs?


In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep
neural networks, most commonly applied to analyzing visual imagery. A
convolutional neural network is a particularly effective artificial neural network, and it
presents a unique architecture. Layers are organized in three dimensions: width,
height, and depth. The neurons in one layer connect not to all the neurons in the
next layer, but only to a small region of the layer's neurons. The final output is
reduced to a single vector of probability scores, organized along the depth
dimension. Convolutional neural networks have been used in areas such as video
recognition, image recognition, and recommender systems.
They were inspired by biological processes, in which the connectivity pattern
between neurons resembles the organization of the animal visual cortex.
Individual cortical neurons respond to stimuli only in a restricted region of the visual
field known as the receptive field. The receptive fields of different neurons partially
overlap such that they cover the entire visual field. CNNs use relatively little pre-
processing compared to other image classification algorithms. This means that the
network learns to optimize the filters (or kernels) through automated learning,
© Edunet Foundation. All rights reserved. | 203
whereas in traditional algorithms these filters are hand-engineered. This
independence from prior knowledge and human intervention in feature extraction is a
major advantage.
The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution which is discussed in the beginning of
chapter. Convolutional networks are a specialized type of neural networks that use
convolution in place of general matrix multiplication in at least one of their layers .

Building Blocks of Convolutional Neural Networks

Fig: Architecture of a Convolutional Neural Network

There are three types of layers in a Convolutional Neural Network:

5.4.1 Convolutional layers

Convolutional layers are the major building blocks used in convolutional neural
networks. A convolution is the simple application of a filter to an input that results in
an activation. Repeated application of the same filter to an input result in a map of
activations called a feature map, indicating the locations and strength of a detected
feature in an input, such as an image.

The innovation of convolutional neural networks is the ability to automatically learn a


large number of filters in parallel specific to a training dataset under the constraints
of a specific predictive modeling problem, such as image classification. The result is
highly specific features that can be detected anywhere on input images. The output
from multiplying the filter with the input array one time is a single value. As the filter
is applied multiple times to the input array, the result is a two-dimensional array of
output values that represent a filtering of the input. As such, the two-dimensional
output array from this operation is called a “feature map“.

Once a feature map is created, we can pass each value in the feature map through a
nonlinearity, such as a ReLU, much like we do for the outputs of a fully connected
layer.

© Edunet Foundation. All rights reserved. | 204


Fig: Example of a Filter Applied to a Two-Dimensional Input to Create a Feature Map

Convolutional Layers in Keras

Fig: Convolutional Layers in Keras

The layer used for convolution of images is 2D Convolution layer. Most important
parameters of Conv2D Layer are:

Filters

The first required Conv2D parameter is the number of filters that the convolutional
layer will learn. Layers early in the network architecture (i.e., closer to the actual input
image) learn fewer convolutional filters while layers deeper in the network (i.e., closer
to the output predictions) will learn more filters. Conv2D layers in between will learn
more filters than the early Conv2D layers but fewer filters than the layers closer to

© Edunet Foundation. All rights reserved. | 205


the output. Max pooling is then used to reduce the spatial dimensions of the output
volume.

Kernel Size

The second required parameter you need to provide to the Keras Conv2D class is
the kernel size, a 2-tuple specifying the width and height of the 2D convolution
window. The kernel size must be an odd integer as well. Typical values for kernel
size include: (1, 1), (3, 3), (5, 5), (7, 7). It’s rare to see kernel sizes larger than 7×7.

Strides

The strides parameter is a 2-tuple of integers, specifying the “step” of the convolution
along the x and y axis of the input volume. The strides value defaults to (1, 1),
implying that:

1. A given convolutional filter is applied to the current location of the input volume.
2. The filter takes a 1-pixel step to the right and again the filter is applied to the input
volume.
3. This process is performed until we reach the far-right border of the volume in
which we move our filter one pixel down and then start again from the far left.

Typically, you’ll leave the strides parameter with the default (1, 1) value; however,
you may occasionally increase it to (2, 2) to help reduce the size of the output
volume (since the step size of the filter is larger).

Padding

Fig: Padding in CNN

If the size of the previous layer is not cleanly divisible by the size of the filters
receptive field and the size of the stride then it is possible for the receptive field to
attempt to read off the edge of the input feature map. In this case, techniques like
zero padding can be used to invent mock inputs for the receptive field to read. The
padding parameter to the Keras Conv2D class can take on one of two values: valid
or same.
 Padding 'valid' is the first figure. The filter window stays inside the image.

© Edunet Foundation. All rights reserved. | 206


 Padding 'same' is the third figure. The output is the same size.

5.4.2 Pooling Layers

The pooling layers down-sample the previous layers feature map. Pooling layers
follow a sequence of one or more convolutional layers and are intended to
consolidate the features learned and expressed in the previous layers feature map.

As such, pooling may be considered a technique to compress or generalize feature


representations and generally reduce the overfitting of the training data by the
model. They too have a receptive field, often much smaller than the convolutional
layer. Also, the stride or the number of inputs that the receptive field is moved for
each activation is often equal to the size of the receptive field to avoid any overlap.

Pooling layers are often very simple, taking the average or the maximum of the input
value in order to create its own feature map. The pooling operation is specified,
rather than learned. Two common functions used in the pooling operation are:
 Average Pooling: Calculate the average value for each patch on the feature
map.
 Maximum Pooling (or Max Pooling): Calculate the maximum value for each
patch of the feature map.

The result of using a pooling layer and creating down sampled or pooled feature
maps is a summarized version of the features detected in the input. They are useful
as small changes in the location of the feature in the input detected by the
convolutional layer will result in a pooled feature map with the feature in the same
location.

Max Pooling Layer

Maximum pooling, or max pooling, is a pooling operation that calculates the


maximum, or largest, value in each patch of each feature map. The results are down
sampled or pooled feature maps that highlight the most present feature in the patch,
not the average presence of the feature in the case of average pooling. This has
been found to work better in practice than average pooling for computer vision tasks
like image classification. In modern CNNs, max pooling is typically used, and often of
size 2×2, with a stride of two. This implies that the input is drastically down-sampled,
further improving the computational efficiency.

© Edunet Foundation. All rights reserved. | 207


Fig: Max Pooling Layer in CNN
5.4.3 Fully Connected Layers

Fully connected layers are the normal flat feed-forward neural network layer.
These layers may have a non-linear activation function or a softmax activation in
order to output probabilities of class predictions.

Fully connected layers are used at the end of the network after feature extraction
and consolidation has been performed by the convolutional and pooling layers. They
are used to create final non-linear combinations of features and for making
predictions by the network. Now we got introduced with artificial neural networks and
convolutional neural networks, let's get started with another deep learning technique
named transfer learning.

Practical code for CNN:

This project demonstrates training a simple Convolutional Neural Network to


classify CIFAR images. Because this tutorial uses the Keras Sequential API, creating
and training your model will take just a few lines of code. The CIFAR10 dataset
contains 60,000 color images in 10 classes, with 6,000 images in each class. The
dataset is divided into 50,000 training images and 10,000 testing images. The
classes are mutually exclusive and there is no overlap between them.

5.4.4 CNN Architectures

A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer


neural networks, designed to recognize visual patterns directly from pixel images
with minimal pre-processing. The ImageNet project is a large visual database
designed for use in visual object recognition software research. The ImageNet
project runs an annual software contest, the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC), where software programs compete to correctly
classify and detect objects and scenes. Here I will talk about CNN architectures of
ILSVRC top competitors.

© Edunet Foundation. All rights reserved. | 208


VGG16 vs ResNet vs MobileNet
VGGNet (2014)
The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the
community and was developed by Simonyan and Zisserman. VGGNet consists of 16
convolutional layers and is very appealing because of its very uniform architecture.
Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained on 4 GPUs for 2–
3 weeks, it is currently the most preferred choice in the community for extracting
features from images. The weight configuration of the VGGNet is publicly available
and has been used in many other applications and challenges as a baseline feature
extractor. However, VGGNet consists of 138 million parameters, which can be a bit
challenging to handle.

VGGNet architecture

© Edunet Foundation. All rights reserved. | 209


ResNet
At the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Kaiming
He et. al. introduced a novel architecture which “skip connections” and features
heavy batch normalization. Such skip connections are also known as gated units or
gated recurrent units and have a strong similarity to recent successful elements
applied in RNNs. Thanks to this technique they were able to train a NN with 152
layers while still having lower complexity than VGGNet. It achieves a top-5 error rate
of 3.57% which beats human-level performance on this dataset.

MobileNet
The MobileNet model is designed to be used in mobile applications, and it is
TensorFlow’s first mobile computer vision model. MobileNet uses depthwise
separable convolutions. It significantly reduces the number of parameters when
compared to the network with regular convolutions with the same depth in the nets.
This results in lightweight deep neural networks.

© Edunet Foundation. All rights reserved. | 210


So, in this chapter we explored the concepts of deep learning like artificial neural
networks, image processing, convolutional neural networks, transfer learning and
optical character recognition. Happy Learning.

Reference
1. Public information, Deloitte Research
2. http://www.oreilly.com/data/free/the-new-artificial-intelligence-market.csp
3. https://www.weforum.org/agenda/2018/09/artificial-intelligence-shaking-up-job-
market/
4. https://en.wikiversity.org/wiki/Artificial_intelligence/Introduction
5. https://techvidvan.com/tutorials/artificial-intelligence-applications/
6. https://www.xenonstack.com/blog/machine-learning-pipeline/
7. https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/ai-
overview
8. https://www.javatpoint.com/history-of-artificial-intelligence
9. https://aidemos.microsoft.com/
10. https://lobe.ai/
11. https://www.tutorialspoint.com/operating_system/os_linux.html
12. https://buildmedia.readthedocs.org/media/pdf/lym/latest/lym.pdf
13. https://phoenixnap.com/kb/linux-commands-cheat-sheet
14. https://www.guru99.com/file-permissions.html
15. https://www.hostinger.in/tutorials/linux-commands
16. https://www.guru99.com/introduction-to-shell-scripting.html
17. https://www.geeksforgeeks.org/cat-command-in-linux-with-examples/
18. https://www.geeksforgeeks.org/basic-shell-commands-in-linux/
19. https://www.tecmint.com/13-basic-cat-command-examples-in-linux/
20. https://phoenixnap.com/kb/use-nano-text-editor-commands-linux
21. https://linuxize.com/post/how-to-use-nano-text-editor/
22. https://www.javatpoint.com/how-to-install-vi-editor-in-ubuntu
23. http://www.compciv.org/recipes/cli/basic-shell-scripts/
24. https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)
25. https://www.python.org/doc/essays/blurb/
26. https://en.wikipedia.org/wiki/Python_(programming_language)
27. https://docs.python.org/3/library/
28. https://www.python.org/downloads/release/python-394/
29. https://www.datacamp.com/community/tutorials/functions-python-tutorial

Disclaimer: The content is curated for educational purposes only.


© Edunet Foundation. All rights reserved.
30. https://www.tutorialspoint.com/difference-between-method-and-function-in-
python
31. https://www.w3schools.com/python/python_conditions.asp
32. https://www.geeksforgeeks.org/python-data-types/
33. https://www.tutorialspoint.com/python/python_strings.htm
34. https://www.datacamp.com/community/tutorials/python-string-format
35. https://docs.python.org/3/library/
36. https://www.tutorialspoint.com/What-are-Python-function-attributes
37. https://towardsdatascience.com/
38. https://en.wikipedia.org/wiki/Data_analysis
39. https://techdifferences.com/difference-between-descriptive-and-predictive-data-
mining.html
40. https://www.tutorialspoint.com/numpy
41. https://numpy.org/
42. https://www.tutorialspoint.com/python_data_science
43. https://www.geeksforgeeks.org/generating-random-number-list-in-python/
44. https://www.w3schools.com/python/numpy/numpy_random.asp
45. https://www.geeksforgeeks.org/numpy-asscalar-in-python/
46. https://data-flair.training/blogs/numpy-statistical-functions/
47. https://www.w3schools.in/python-tutorial/decision-making/
48. https://www.geeksforgeeks.org/python-broadcasting-with-numpy-arrays/
49. https://realpython.com/
50. https://www.tableau.com/learn/articles/data-
visualization#:~:text=Data%20visualization%20is%20the%20graphical,outliers
%2C%20and%20patterns%20in%20data
51. https://towardsdatascience.com/introduction-to-data-visualization-in-python-
89a54c97fbed
52. https://www.w3schools.com/python/matplotlib_subplots.asp
53. https://www.w3schools.com/python/matplotlib_markers.asp
54. https://matplotlib.org/stable/users/installing.html
55. https://www.geeksforgeeks.org/graph-plotting-in-python-set-1/
56. https://www.geeksforgeeks.org/python-matplotlib-pyplot-ticks/
57. https://www.geeksforgeeks.org/matplotlib-pyplot-legend-in-python/
58. https://jakevdp.github.io/PythonDataScienceHandbook/04.06-customizing-
legends.html
59. https://www.oreilly.com/library/view/python-data-
science/9781491912126/ch04.html
60. https://jakevdp.github.io/PythonDataScienceHandbook/04.05-histograms-and-
binnings.html
61. https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-
annotation.html
62. https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-
dimensional-plotting.html
© Edunet Foundation. All rights reserved. | 212
63. https://www.geeksforgeeks.org/three-dimensional-plotting-in-python-using-
matplotlib/
64. https://pythontic.com/visualization/charts/piechart
65. https://www.geeksforgeeks.org/plot-a-pie-chart-in-python-using-matplotlib/
66. https://www.tutorialspoint.com/numpy/numpy_matplotlib.htm:
67. https://www.edureka.co/blog/python-numpy-tutorial/
68. https://pandas.pydata.org/docs/getting_started/index.html
69. https://datatofish.com/install-package-python-using-pip/
70. https://packaging.python.org/
71. https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-
pandas-objects.html
72. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
73. https://www.bigcommerce.com/ecommerce-answers/what-csv-file-and-what-
does-it-mean-my-ecommerce-
business/#:~:text=A%20CSV%20is%20a%20comma,Microsoft%20Excel%20or
%20Google%20Spreadsheets.
74. https://fileinfo.com/extension/json
75. https://www.w3resource.com/JSON/structures.php
76. https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
77. https://datatofish.com/load-json-pandas-dataframe/
78. https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-
python-pandas
79. https://towardsdatascience.com/pandas-groupby-explained-453692519d0
80. https://towardsdatascience.com/a-step-by-step-guide-to-pandas-pivot-tables-
e0641d0c6c70
81. https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
82. https://datatofish.com/plot-dataframe-pandas
83. https://towardsdatascience.com/an-introduction-to-scikit-learn-the-gold-
standard-of-python-machine-learning-e2b9238a98ab
84. https://www.javatpoint.com/regression-vs-classification-in-machine-learning
85. https://towardsdatascience.com/mathematics-for-machine-learning-linear-
regression-least-square-regression-de09cf53757c
86. https://www.javatpoint.com/linear-regression-in-machine-learning
87. https://www.i2tutorials.com/tag/ordinary-least-square-method-in-machine-
learning/
88. https://www.javatpoint.com/logistic-regression-in-machine-learning
89. https://www.datasciencecentral.com/profiles/blogs/understanding-the-
applications-of-probability-in-machine-learning
90. https://www.allerin.com/blog/how-to-fine-tune-your-artificial-intelligence-
algorithms
91. https://www.mygreatlearning.com/blog/gridsearchcv/
92. https://www.kdnuggets.com/2020/05/hyperparameter-optimization-machine-
learning-models.html

© Edunet Foundation. All rights reserved. | 213


93. https://www.javatpoint.com/machine-learning-decision-tree-classification-
algorithm
94. https://towardsdatascience.com/support-vector-machines-svm-c9ef22815589
95. https://medium.com/analytics-vidhya/role-of-distance-metrics-in-machine-
learning-e43391a6bf2e
96. https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
97. https://towardsdatascience.com/basic-probability-theory-and-statistics-
3105ab637213
98. https://www.javatpoint.com/machine-learning-naive-bayes-classifier
99. Andreas C. Müller and Sarah Guido , Introduction to Machine learning with
Python , O’reilly , October 2016.
100. https://www.guru99.com/unsupervised-machine-
learning.html#:~:text=Unsupervised%20Learning%20is%20a%20machine,deals
%20with%20the%20unlabelled%20data.
101. https://towardsdatascience.com/unsupervised-learning-and-data-clustering-
eeecb78b422a
102. https://www.geeksforgeeks.org/clustering-in-machine-
learning/#:~:text=Clustering%20is%20the%20task%20of,data%20points%20in
%20other%20groups.
103. https://towardsdatascience.com/understanding-k-means-clustering-in-
machine-learning-6a6e67336aa1
104. Chire, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via
Wikimedia Commons\
105. https://en.wikipedia.org/wiki/Deep_learning
106. https://www.geeksforgeeks.org/difference-between-machine-learning-and-
deep-learning/
107. https://en.wikipedia.org/wiki/Artificial_neural_network
108. https://towardsdatascience.com/whats-the-role-of-weights-and-bias-in-a-
neural-network-4cf7e9888a0f
109. https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html
110. https://en.wikipedia.org/wiki/Overfitting
111. https://medium.com/analytics-vidhya/the-perfect-fit-for-a-dnn-596954c9ea39
112. https://www.kdnuggets.com/2017/11/create-good-validation-set.html
113. https://medium.datadriveninvestor.com/data-science-essentials-why-train-
validation-test-data-b7f7d472dc1f
114. https://machinelearningmastery.com/what-is-computer-vision/
115. https://en.wikipedia.org/wiki/Pixel
116. https://www.javatpoint.com/dip-types-of-images
117. https://stackabuse.com/introduction-to-image-processing-in-python-with-
opencv/
118. https://www.naturefocused.com/articles/photography-image-processing-
kernel.html
119. https://setosa.io/ev/image-kernels/

© Edunet Foundation. All rights reserved. | 214


120. https://docs.microsoft.com/en-us/azure/cognitive-services/face/concepts/face-
detection
121. https://towardsdatascience.com/computer-vision-detecting-objects-using-
haar-cascade-classifier-4585472829a9
122. https://docs.opencv.org/3.4.3/d7/d8b/tutorial_py_face_detection.html
123. https://en.wikipedia.org/wiki/Convolutional_neural_network
124. https://machinelearningmastery.com/convolutional-layers-for-deep-learning-
neural-networks/
125. https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-
learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
126. https://en.wikipedia.org/wiki/Transfer_learning
127. https://machinelearningmastery.com/how-to-use-transfer-learning-when-
developing-convolutional-neural-network-models/
128. https://nanonets.com/blog/ocr-with-tesseract/

© Edunet Foundation. All rights reserved. | 215

You might also like