machine-learning-for-beginners-the-ultimate-guide-to-learn-and-understand-machine-learning-a-practical-approach-to-master-machine-learning-to-improve-and-increase-business-results_compress
machine-learning-for-beginners-the-ultimate-guide-to-learn-and-understand-machine-learning-a-practical-approach-to-master-machine-learning-to-improve-and-increase-business-results_compress
Anderson Coen
© Copyright 2019 - Anderson Coen - All rights reserved.
The content contained within this book may not be reproduced, duplicated or
transmitted without direct written permission from the author or the
publisher.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
Description
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
What is machine learning? Does it really help businesses provide better
services and earn more? How can I improve my business processes to
increase ROI (Return On Investment)? I am unable to focus on important
tasks because I am getting bogged down by menial tasks. If you are
confronted by one or more of these questions, this book is for you!
You should also get this book if you have heard machine learning but
couldn’t start because it looks too overwhelming. This book will demonstrate
it’s quite easy, and many situations have similar solutions – the only things
needed are a bit of conceptual knowledge and some free time.
Learn machine learning and data analysis concepts through practical
examples coded in Python. The following libraries are covered in great detail:
● NumPy
● SciPy
● Sklearn (Scikit-learn)
● Pandas
● TensorFlow
● Matplotlib
There’s also a chapter dedicated for an introduction to Raspberry Pi. If you
are interested in robotics and automation, this chapter is not to be missed.
Machine learning is being used in industries that are not directly related to
computer science. If you are new to Python, there is also an introductory
chapter that covers the basics, so you can start with data analysis and
machine learning as soon as possible. Buy this book and start your machine
learning journey today!
Table of Contents
Introduction
Chapter 1: Machine Learning
What Is Machine Learning Really?
Types of Machine Learning
More Categories of Machine Learning
Machine-Learning Challenges
Present Day Examples
Chapter 2: Coding with Python
Fundamental Programming
Setup
Initializing PyCharm
Data Structures
Advanced Programming
Revisiting Mathematical Concepts
Vectors
Matrices
Basic Statistics
Probability
Distribution
Data Analysis – Foundation of Machine Learning
Python Libraries for Data Analysis
The Pandas Framework
Machine-Learning Projects
Predicting If a Country’s GDP is Related to Its Better Life Index
Predicting Real Estate Prices for Investment
Chapter 3: Working with Raspberry Pi
What is Raspberry Pi?
Selecting the Model
Hardware Components
First Project
Installation and Setup
Remote Access to Raspberry Pi
Using Camera with Raspberry Pi
Sending and Receiving Signals Using GPIO of the Raspberry Pi
Chapter 4: Working with TensorFlow
The Projects
Project #1: Predicting Student Grade
Project #2: Predicting Student Grade
Project #3: Neural Network using TensorFlow
Chapter 5: Advanced Machine Learning
A Corporate Project
Create a Predictive Chat Bot
Conclusion
Where Will Machine Learning Be in the Next 20 Years?
References
Appendix A: Machine-Learning Concepts
Introduction
Flying without wings, breathing underwater without gills, climbing
mountains, inventing cures for incurable diseases, taming elemental powers,
landing on the moon, the list of human feats is never-ending. But, there is one
thing that has always eluded mankind – predicting the future. The biggest
disadvantage humans have is never knowing when they are going to die.
There is no way to predict future events with certainty. With all the
technological advancements, the biggest questions are still unanswered. Will
human beings ever be able to correctly predict the future?
Let me ask you a question. Did anyone in the past come close to predicting
the future? How can you know without access to relevant information? The
relevant information required to make a decision is generated from analyzing
accumulated in the past is called data analysis. If we take the same process a
little further, create trends from existing data and use them to predict future
outcomes, we enter the domain called data science.
And, it’s not just about making decisions; for humans, it’s a matter of
survival and growth. Every child relies heavily on data analysis to perform
experiments, gather environment responses, and analyze them to learn and
adapt. When a mother scolds her child because he threw the bowl across the
room, he learns to not do that again. When the same mother hugs and kisses
the child when the first word they say is “ma,” the child learns that to speak
this word is happiness, and will always feel happy saying it for the rest of his
life.
So, where does machine learning fits on all this? Machines are great at doing
repetitive tasks, but a programmer has to write a script to tackle every
possible scenario for the machine to work efficiently. What if, just like a
child, the machine could learn from their experiences and experiments?
Instead of telling the machine what to do for all possible situations, the
machine is told a set of rules. In case an unfamiliar situation happens, the
machine will determine the appropriate response using the rules and gauge
the environment response. If it was favorable, the machine will record it and
apply the same response in a similar situation later on. If an unfavorable
response is met, the machine will use the set of rules again and start over if a
similar situation happens again.
We know that real-life scenarios are much more complex than machine
logics. Applying the same logic on similar situations isn’t suitable when
dealing with complex variables, such as humans. Consider a robot, Ms. X,
who is the maid of a household. The first time she meets her charge, she tries
to joke in front of the little Jana, and she bursts into laughter. Ms. X recorded
this as a successful experiment and during dinner the same night and cracked
the same joke. Jana’s older brother Jack commented, “That’s the lamest joke
ever.” Ms. X is now confused. A human being wouldn’t be confused in these
circumstances; it knows a joke might be funny to one person but not for
someone else. You might think this is just basic, but in fact, learning this
intricacy is one of the things that sets humans at the top of the animal
kingdom.
You might have seen a sci-fi movie or read a futuristic novel that talks about
a hive mind. A collection of robots or machines that have come together to
unlock unlimited processing power gaining abilities like future prediction,
imminent threat detection, then eventually conquering and controlling the
human race. Well, it might be or not be true in the future; one thing is for
certain, we are moving towards fully enlightened independent robots that can
thrive on their own.
The project of machine learning has one main goal – to give machines the
degree of learning and inference capabilities humans possess. Maybe when
machines reach that level, they would be able to accurately predict the demise
of humans and this planet.
I assume you have some basic knowledge about computers and programming
languages before you started reading this book. Don’t worry, every topic I
cover in this book, I will explain right from the basics. I am also assuming
you have access to a good, stable internet connection and an upgraded
computer system in good condition. Data analysis and machine learning
algorithms put a lot of strain on the computer resources. My Master’s degree
thesis was to improve an already well-optimized algorithm. The data set was
huge and the algorithm was very complex. I didn’t have access to a computer
with a powerful configuration. I still remember how I had to leave the
computer on for 20-22 hours for the program to finish processing. If you
want to learn machine learning, start saving, and invest in the latest computer
system.
Chapter 1: Machine Learning
Watch the Terminator movie series if you haven’t. It indirectly portrays
robots following the age old saying, “If at first you don’t succeed, try, try
again!” The machine network, SkyNet, created by scientists to facilitate
humans in research and development, gets affected by illusions of grandeur.
Assuming machines are the dominant race, SkyNet attacks humans. One
thing leads to another and SkyNet has to send robots to the past to neutralize
one human threat that was the cause of Skynet’s destruction in the future.
SkyNet sends one robot, who, after a lot of trouble and the hero getting killed
in the process, gets sent to the scrapyard by the heroine. SkyNet learns from
the encounter and the next time sends an upgraded robot. Yes, machine
learning at its finest! Well, without recalling anything else from the movie
franchise, here’s the moral of the entire series: machines are faster at
processing in a given direction but humans are faster at adapting in fluid
situations. No matter what kind of robot SkyNet threw out of the production
facility, the human protagonists, albeit taking heavy losses and many times
aided by some machines, always adapted and defeated the evil machines. The
first installment of this movie franchise was released in 1980s starring Arnold
Schwarzenegger as the time travelling robot. It was an instant hit.
But, why I am talking about a movie franchise in a book about machine
learning? The general public in the 1980s had no idea about fully automated
robots. In 2019, it looks very likely we are going to see fully automated
robots in two or three decades. Many things that are now part of everyone’s
lives or look very much possible in the near future have been inspired by
movies. Movie and television franchises like Star Wars and Star Trek have
influenced scientific development for decades. Therefore, it’s important not
to ignore this avenue of inspirations.
Do you remember a movie or TV show where the plot revolves around
machine learning?
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised Learning
A type of machine learning where the machine learns the system by training
on a set of sample data that lead to the desired output. Once the machine is
ready, live data is fed, and the machine predicts future outcomes using the
relation it has established during the training. Supervised learning is a
popular type of machine learning in beginners. Most people starting with
machine learning find supervised learning easier to understand. Why
wouldn’t they? This is how humans learn too!
Classification
An example would be an email filtering system that sends and directs emails
sent to you to either your inbox or spam folder. This type of supervised
learning is called classification where labels are given to data. The system
will be trained on existing emails that have been correctly directed, so the
system can build a relation and automatically direct emails in the future. A
system that classifies data into two classes (spam or inbox in this case) is
called a binary classification system.
If you have the Outlook.com email service from Microsoft, you might have
noticed, there's no way to set a spam filter that most email services provide.
Instead, you have to manually move emails from inbox to spam folder and
vice versa. The email filter of Outlook.com learns by monitoring your
actions, and with time, the spam filtering becomes smarter. It might feel like
a hassle in the start, but it becomes very convenient on the long-term as you
don’t have to worry about blocking every single email address or every
keyword in the spam filter. Of course, in a complex system like Outlook.com,
labeling an email as spam or not is not the only classification task. All such
systems that have various classifications are called multi-class classification
system. A better example of a multi-class classification is gesture recognition
now widely available in smartphones. During setup, the smartphone trains on
user gestures to recognize and differentiate personalized versions of gestures.
Regression
The machine learning applied to systems that take some inputs and produce
continuous outputs is called regression analysis. The system is defined by a
number of (explanatory) variables that are considered to affect the system in a
way that it continuously produces a response variable (outcome). The
analysis finds the relation between the explanatory and outcome variables to
predict future system behavior.
A real-life example would to predict the SAT scores of a few students by
sampling their life and study patterns. Using the data gathered from students
who have already given the test as training for the prediction model, the
learning model will have the ability to predict which future students will have
more chances of passing the test using their behavioral data.
Reinforcement Learning
In reinforcement learning, there is a sort of feedback mechanism (or function)
that provides reward signals when the machine-learning model interacts with
a given system. The signals are used to gradually improve the learning
model. We can say this is the automated trial-and-error method of machine
learning. Reinforcement learning differs from the supervised learning in the
sense that in the former learning model, the machine directly interacts with
the system rather than using previous user-system interactions to learn and
predict future outcomes.
The perfect example of reinforcement learning would be a strategy game
where the computer learns the moves made by the player and gradually learns
to finally beat every move of the player.
Unsupervised Learning
In both supervised and reinforcement machine learning analysis, we would
already know the system’s response to a given set of inputs either by
sampling previous interactions or having the machine interact with the
system. But, what if we have to deal with a data set without knowing what
outcome it results in or what feedback it gives when applied to a system? We
have to resort to unsupervised learning analysis, where meaningful
information is extracted from a data set without knowing the outcome or
feedback of the system. In other words, the data is analyzed without
considering any of the system’s attributes to predict future behavior of the
system.
Clustering
A technique of categorizing data into subgroups (clusters) without any prior
knowledge of the data is called clustering. For example, online marketers
would like to segment the audience based upon their interests without
knowing how they would interact or have interacted with their marketing
campaign. This helps marketers to predict the audience response and create
targeted marketing campaigns for different audience segments.
Another practical application of clustering would be to find anomalies in a
system. For example, a bank would deploy a learning model that tracks all
credit card transactions of a client creating a spending profile. Using this
profile, the system would be able to detect outlier transactions that might be a
case of theft or fraud.
Machine-Learning Challenges
Machine learning is heavily dependent on the training data and the model
(and algorithm) chosen to represent the system. Let’s see what can happen
that fails your machine-learning project.
Insufficient Training Data
Humans can work with very small data set and reach the correct conclusions.
You touch a hot iron and you will probably be cautious around an iron for the
rest of your lives. You might see dark clouds and a fearful storm once in your
life and whenever you see those same clouds again, you will know a storm is
coming. This process starts right from the childbirth and keeps going until the
death. Machines are not so smart, you might say. They require considerable
data to train a machine in using a particular prediction algorithm. For
example, for a successful image recognition project, you might need access to
millions of images as training data.
Various researches in the last two decades have found the same conclusion:
instead of spending resources on better algorithms, spend resources on
collection of large data sets because, given the right amount of data, many
algorithms will reach the same conclusion.
Similar but Not the Same Training Data
For example, you write a script to predict traffic congestion on major New
York highways during rush hours but train it using data collected from Los
Angeles. The data might be of similar type but has no relation with the
scenario we will use our script to predict. We have to make sure the training
data represents the right scenarios.
Poor Quality or Skewed Data
The training data must be free of or contain minimal errors, noise, and/or
outliers. Training a model using erroneous data will affect its performance in
making the right predictions. A better approach would be to clean the training
data, and in fact, most data scientists spend more time preparing the training
data than coding the model and algorithm. Important considerations must be
need in how to deal with outliers (discard or normalize) and missing data
(ignore or fill in with values such as mean, etc.).
Another aspect of sampling training data is to make sure the data isn’t
skewed or biased. For example, how many times you have seen poll results
and accompanied predictions that you know are completely wrong. The most
obvious reason is that the participants of the poll aren’t representative of the
whole situation. For example, if 1,000 US residents from Idaho took part in a
poll, you can’t use this data to make predictions for the entire USA.
Say No to Overfitting
Humans tend to blow up the extent of their experience and apply
indiscriminately. How many times the restaurant waiter or manager has come
to your table and ask for your feedback and you have replied, “Everything
was amazing!” even though the fries weren’t good. Consider there are 100
people who order fries every day from the restaurant, the waiters and
manager are able to collect feedback from 60 people and only 5% of those
told them the fries are bad. The restaurant would discard those three reviews
thinking they are outliers even though they are indicative of a serious issue.
This is how overgeneralization can truly ruin a successful business.
Another example would be you going on a tourist vacation to a foreign
country. The taxi driver you hired from the airport to your hotel charges steep
rates. Due to the frustrating experience, you might form the opinion that all
taxi drivers in that country are manipulators, and you might even avoid taking
a taxi for the rest of your trip. Of course, this is a case of overgeneralized
opinion that doesn’t portray the facts.
Machines can also fall into the same pit; in machine learning it’s called
overfitting. Overfitted data can be a serious issue because the model will
perform exceptionally well on such data during training but fail miserably
when given live data. The process of reducing such overfitting risks is called
regularization.
Underfitting Is Trouble Too
Underfitting is the opposite of underfitting and is caused by the model not
able to properly detect the subtleness of training data. Relating money with
happiness is an issue that’s too complex to accurately predict; there are
infinite variables that can affect the relationship. Therefore, a model to
predict this behavior will never be 100% accurate because it cannot
accommodate all those variables.
Bonus: Testing
How do you test a prediction model after training? You wouldn’t want to ship
it to the users for testing because if they have paid for the product, and the
prediction results are coming up as wrong, your business will suffer. In some
industries, this is the norm; for example, gaming studios release alpha and
beta versions that are free to play and test and then predict performance once
the game is available for purchase. But in other industries, it’s important to
test before releasing the product to the public. One of the basic procedures to
perform testing is to divide the training data into two parts, a training set and
a test set. This way, you don’t need to worry about testing data after you have
trained your model using the training set.
Fundamental Programming
For this book, I have chosen to use PyCharm and Anaconda with Python 3.7.
PyCharm is a very popular and sophisticated Integrated Development
Environment (IDE) that provides various advantages over IDLE, the standard
Python IDE. Anaconda is a specialized distribution that provides all the
libraries related to data analysis and machine learning by default. It helps
programmers avoid wasting time on finding and installing the required
libraries that don’t come with the standard Python distribution.
Anaconda also provides a package management system “conda” that helps
create virtual environments. Virtual environments are used to install different
setups for different projects on the same computer system. It helps avoid
conflicts between packages. If you are going to only work on one application
for the rest of your life (which is rarely the case), you can skip virtual
environment setup. But, it’s highly recommended.
Setup
I installed PyCharm in the same directory where Python is installed but that’s
optional.
Install Anaconda
Go to the URL https://www.anaconda.com/distribution/, download the
correct installer (I chose the one for Windows – 64 bit), and install once the
download is complete. Note that Anaconda is a huge distribution without
hundreds, if not thousands, of Python libraries. After-install size is a
staggering figure of 2.5 Gigabytes (GB). During the install, you will see the
following screen; make sure you check the box that says “Add Anaconda to
my PATH environment variable” even though it isn’t advised. Checking this
option will make your life a lot easier using Anaconda in the future.
The installation can take a considerable amount of time depending upon the
computational power of your computer system. After the installation is done,
we can test if Anaconda is working correctly. Open Windows start menu and
type “Anaconda Prompt”. Type “python” and Python will start.
Bonus: Install TensorFlow
TensorFlow is a great platform to build and train machine-learning models.
Let’s install it on our system using a virtual environment. What was a virtual
environment? To create isolated spaces to keep required tools, libraries, etc.,
separate for different projects, virtual environments are used. This is very
useful for programming languages that can be used for multiple application
fields. Imagine yourself working on two projects, one on machine learning
and one on developing a 2D game. Creating a virtual environment would help
keep things organized for both projects.
Let’s open Windows command prompt. We are going to use Anaconda to
create a virtual environment and install TensorFlow in it. The latest version
of TensorFlow 2.0 is still not fully compatible with standard Python 3.7. You
can find so many programmers running into issues trying to install
TensorFlow 2.0 on Python 3.7. This is another reason to use Anaconda to
install TensorFlow; you will avoid any compatibility issues. Enter the
following command in the Windows command prompt to create a virtual
environment and install the platform in it.
conda create -n tf TensorFlow
“-n” means new, “tf” is the name we gave to the virtual environment getting
created with “TensorFlow” installed in it. During the install, you might be
asked to proceed with installing some new packages. Just type “y” and hit
“ENTER” to continue. Once complete, you can activate this newly created
virtual environment with the following command.
conda activate tf
You will now see “(tf)” written before the current path in the command
prompt. It tells you are working in a virtual environment.
Bonus: Install Keras
Keras is a high-level API for creating neural networks using Python. While
we are in the same virtual environment “tf,” we can invoke the standard
Python package manager “pip” to install Keras.
pip install keras
Once everything is installed and ready, we can jump to PyCharm to finalize
the setup and starting coding.
Initializing PyCharm
Search “pycharm” in Windows start menu and you will see an option that
says “JetBrains PyCharm Community Edition...”. Select the option and wait
for a window to pop up. This is how it would look like.
Click on “Create New Project” to start. It will ask for a location and name. I
used “tensorEnv” as project name; you can choose whatever name you like.
Now, the PyCharm interface will load. Right-click where it says “tensorEnv”
in the “Project” and select “New” and then “Python File”. Enter a name; I
used “test.py”.
There’s two more steps to do. First is setting the right Python interpreter for
the project and then adding a configuration to the project. Click on “File”,
and select “Settings”. In the new pop up, select “Project Interpreter” under
“Project: tensorEnv”.
Click on the gear icon on the right side of the screen, select “Add...” You will
see a new window on your screen.
Select “Conda Environment” and then select “Existing environment”. In the
above image, you are seeing the following value in the “Interpreter” field.
D:\LearningNewSkills\Python\Anaconda3\envs\tf\python.exe
For you, that field might be empty. We have to navigate and select the correct
interpreter “pythonw.exe” for our project (notice the ‘w’). Hint: “python.exe”
and “pythonw.exe” are usually in the same folder. Remember “tf” in the
above path is the environment we created using Windows command prompt.
Replace “tf” with the environment name you used then when looking for the
right folder. Once the right file is selected, click “OK” until you are back on
the main PyCharm interface.
Now, we have to setup the configuration. Click on “Add Configuration” on
the top-right side of the screen.
In this script, we have involved the user using the input() method. We have
used the int() to perform string to integer conversion because input() always
registers the user input as a string. “i -= 1” is a short version of “i = i -1”.
Data Structures
Lists and Tuples
Lists are the simplest data structure in Python, but that doesn’t mean it’s
primitive.
aLst = [1, 3 , 5.55, 2, 4 ,6, 'numbers', 'are', 'not', 'letters']
print(aStr)
print(bStr)
The above will result in the following output:
This is a
simple string
This is a\nraw string. Special characters like \n are taken as simple characters.
Note how the first string takes two lines because the “\n” is taken as a
newline.
We have already seen that when taking input from the user, the input data is
always taken as a string, and in some cases, we have to perform type
conversion to get the correct data.
inp = input("Enter a number: ")
print(type(inp))
Here is a test of the above code.
Enter a number: 110
<class 'str'>
Even though the input looks like a number, it isn’t, and we have to convert to
integer.
inp = int(inp)
Likewise, we can convert other data types into string.
intNum = 245
cStr = str(intNum)
print(type(cStr), cStr)
The output will be.
<class 'str'> 245
Let’s perform some data structure operations on a string.
# get number of characters in a string
print(len(aStr))
print(len(bStr))
print(aDict["a"])
print(aDict[0])
print(aDict[0][2])
The output is.
A member
[2, 4, 6]
6
Let’s change a value and then a key.
aDict["a"] = "First member"
print(aDict["a"])
aDict["1st"] = aDict.pop("a")
print(aDict)
The output for both print methods is below.
First member
{'b': 'B member', 0: [2, 4, 6], 1: [3, 5, 7], '1st': 'First member'}
Note, the new key-value pair now appears at the end of the dictionary.
We can get all the keys and values as separate lists from a dictionary.
print(aDict.keys())
print(aDict.values())
The output is a little strange-looking, see below.
dict_keys(['a', 'b', 0, 1])
dict_values(['A member', 'B member', [2, 4, 6], [3, 5, 7]])
The “dict_keys” and “dict_values” are view objects and don’t reside in the
memory. You can say they are merely references to the original dictionary.
You cannot iterate over these view objects, but you can check membership.
print('B member' in aDict.values())
The output is.
True
Checking if a key is in a dictionary is simpler.
print("a" in aDict)
The output is.
True
If we try to get a value using a key that is not present in a dictionary, an error
occurs.
print(aDict["c"])
The output will be.
KeyError: 'c'
There is a safe method, get(). If the key is not found, the default message is
returned instead of raising an error.
print(aDict.get("c","Key not found"))
The output will be.
Key not found
Sets
A dictionary with only keys and no values is called a set. This is the closest
you will get to mathematical sets in Python. Sets are mutable, iterable,
unordered, and all elements must be unique. Because the sets are unordered,
the set elements don’t have fixed indices; therefore, retrieving elements with
indices and slicing doesn’t work.
aSet = {'data collection', 'data analysis', 'data visualization'}
bSet = set([1, 2, 3])
cSet = {1, 1, 2, 3}
print(aSet)
print(bSet)
print(cSet)
print(2 in bSet)
print((bSet == cSet)) #ignored duplicate element leading to cSet and bSet
having the same elements
The output of the above code will be.
{'data collection', 'data analysis', 'data visualization'}
{1, 2, 3}
{1, 2, 3}
True
True
Let’s see what more we can do with sets.
#set operations
print(bSet & cSet) #set intersection – return common elements
print(aSet | bSet) #set union – join sets
print(bSet - cSet) #set difference – return elements of bSet that are not in
cSet, null set in this case
cSet.add(4) #add a single element, ignored if element already present in the
set
print(cSet)
cSet.remove(4) #raises error if element is not found in the set, differs from
pop() as it doesn't return the removed element
print(cSet)
cSet.discard(4) #ignored if element isn't found in the set instead of raising an
error
nums = [4, 5, 6]
cSet.update(nums) #adds multiple elements to the set at the same time
print(cSet)
cSet.clear() #removes all elements making the set empty
print(cSet)
The output is.
{1, 2, 3}
{'data visualization', 1, 2, 'data collection', 3, 'data analysis'}
set()
{1, 2, 3, 4}
{1, 2, 3}
{1, 2, 3, 4, 5, 6}
set()
Advanced Programming
Python is a high-level language that offers various distinct functionalities
such as comprehensions and iterators. But, first, let’s look at how to create
functions and handle exceptions.
Creating and Using Functions
Functions are code blocks that are explicitly called to perform specific tasks.
They can, optionally, take single or multiple inputs and return a single output.
The best practice is that a function should not have more than two inputs
(also called function arguments). The single output can be a data structure if
more than one value needs to be returned.
Here is a complete script with the function “checkodd()” that takes an array
of integer numbers and decides if each element is odd or even.
def checkodd(num):
arr = []
for itr in num:
if(itr%2 != 0):
arr.append(str(itr))
return arr
def main():
inputArr = []
for i in range(1, 11):
inp = int(input("Enter an integer number: "))
inputArr.append(inp)
oddNums = checkodd(inputArr)
print("The odd numbers in the user inputs are: ", ",
".join(oddNums))
if __name__ == "__main__":
main()
Here is the result of a test I did.
Enter an integer number: 3
Enter an integer number: 5
Enter an integer number: 4
Enter an integer number: 2
Enter an integer number: 6
Enter an integer number: 9
Enter an integer number: 8
Enter an integer number: 7
Enter an integer number: 12
Enter an integer number: 31
The odd numbers in the user inputs are: 3, 5, 9, 7, 31
Let’s discuss the different aspects of this code.
The definition of a function should appear before it’s call. It’s not a
requirement but rather a good practice rule. I had to pass 10 inputs to the
function, so I combined all the inputs in an array and passed to the function. I
applied the same logic to the function return.
The “if __name__ == "__main__":” is where the execution of the script
starts. You can say it acts like a pointer because every script you write can be
imported into another script as a module. In that case, Python needs to know
where the script execution should start.
This a very simple example where we could have achieved the results
without writing a custom function.
Variable Scope
When writing functions, it’s important to understand the scope of a variable.
It means where you have declared a variable and where it’s available. A
variable can be either local or global depending upon where they are
declared.
In the above script, “arr” is a local variable (list) of function checkodd(). If
we try to access it from main() function, we will get an error. In the same
way, the list “inputArr” is local to the main() function, if we try to access it
from checkodd(), we will get an error. Here’s a benefit of using PyCharm. If
you try to access a variable that isn’t available in a function, PyCharm will
give an error warning.
def main():
for i in range(1, 11):
inp = int(input("Enter an integer number: "))
inputArr.append(inp)
oddNums = checkodd(inputArr)
print("The odd numbers in the user inputs are: ", ",
".join(oddNums))
if __name__ == "__main__":
main()
The output now, with some random inputs below.
Enter an integer number: 45
Enter an integer number: 2
Enter an integer number: 36
Enter an integer number: 785
Enter an integer number: 1
Enter an integer number: 5
Enter an integer number: 2
Enter an integer number: 3
Enter an integer number: 5
Enter an integer number: 7
[45, 2, 36, 785, 1, 5, 2, 3, 5, 7]
The odd numbers in the user inputs are: 45, 785, 1, 5, 3, 5, 7
List Comprehensions
In previous examples, we have used loops to iterate over a list. Python offers
a faster way that’s more powerful than using a loop on a list in many
situations. It is called list comprehension.
Let’s create a function to populate an empty list with prime numbers between
1 and 100.
import math
primeNums = []
def populatePrime(num):
max_divizr = 1 + math.floor(math.sqrt(num))
for divizr in range(2, max_divizr):
if(num % divizr == 0):
return
primeNums.append(str(num))
return
def main():
for i in range(2, 101):
populatePrime(i)
print("The prime numbers between 1 and 100 are: ", ",
".join(primeNums))
if __name__ == "__main__":
main()
The script imports a standard library in Python “maths” to implement a faster
algorithm to find prime numbers in a given range. Mathematically, we know
that an integer cannot be a prime number if it’s divisible by a number less
than or equal to its square root. We also know that range() never includes the
maximum value given. We have declared a global list “primeNums” to
collect the prime numbers. We can make the code more dynamic by adding
an input() asking the user to enter the maximum range the prime numbers
should be found in. We have to change the main() function definition to
below.
def main():
inp = int(input("Enter an integer number. The prime numbers will
be output up until this number: ")) + 1
for i in range(2, inp):
populatePrime(i)
print("The prime numbers between 1 and 100 are: ", ",
".join(primeNums))
Notice the one we added to the user input to get the correct maximum range
value with the variable “inp”.
We can use list comprehensions instead of using the nested for-if in the
populatePrime() function. Here’s the updated function definition.
import math
def populatePrime(num):
return [str(x) for x in range(2, num) if all(x % y for y in range(2, 1
+ math.floor(math.sqrt(x))))]
def main():
inp = int(input("Enter an integer number. The prime numbers will
be output up till this number: "))
primeNums = populatePrime(inp)
print("The prime numbers between 1 and 100 are: ", ",
".join(primeNums))
if __name__ == "__main__":
main()
The output of the above code if user enters 100 is.
Enter an integer number. The prime numbers will be output up till this
number: 100
The prime numbers between 1 and 100 are: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97
As you can see, the results are the same. List comprehensions offer, in most
cases, faster execution times but are very hard to read for people who are new
to Python. This is evident from the below line of code.
[str(x) for x in range(2, num) if all(x % y for y in range(2, 1 +
math.floor(math.sqrt(x))))]
You don’t have to use list comprehensions if you don’t want to, but when
dealing with large data sets, the execution times do matter.
Vectors
An object with magnitude and specific direction is called a vector. We can
visualize a vector as an arrow figure whose length represents the magnitude
and the arrowhead indicating its direction. The opposite of vector quantities
are scalar quantities that have only magnitude.
One of the easiest examples of scalar and vector quantities is speed and
velocity. For example, a car is travelling 50 kilometers every hour towards
south. The speed of the car is 13.89 m/s and it’s velocity is 50 km/hr North.
There is one unique vector that doesn’t have a particular direction, the zero
vector. Represented with a bolded 0, the zero vector has zero magnitude and
doesn’t point in a specific direction.
Why are vectors important? Let’s consider distance vs. displacement.
Without direction, distance covered doesn’t portray the complete story. You
might have traveled 10 kms to the mall and 10 kms back to your home,
travelling a total distance of 20 kms. But, your displacement is zero because
you ended up where you started. It is this reason displacement is always used
for navigational purposes.
Vector Operations
Moving a vector while keeping its magnitude and direction intact is called
translation. Translation doesn’t change a vector and helps in performing
vector operations.
Addition
To add two vectors a and b, which is written as a + b, we translate vector b
so its tail touches the head of vector a. Now, join the tail of vector a with the
head of vector b, this new vector shows the result of a + b.
Addition operation obeys the cumulative law; it means the order of operators
doesn’t matter.
Subtraction
Subtraction is also addition between two vectors but direction of the second
vector is reversed. So, to calculate b - a, we actually perform b + (-a).
Scalar Multiplication
Scalar multiplication is multiplying a vector with a real number (called
scalar). The direction of the vector remains the same but the magnitude of the
vector increases by the scalar value.
Vector Multiplication
There are two different types of vector multiplication, dot product and cross
product.
Dot product is used to find how much one vector is in the same direction as
another vector. Dot product doesn’t have anything to do with the magnitude
of the vectors. A positive result means both vectors have similar direction;
zero indicates both vectors are perpendicular; while a negative results shows
both vectors have almost opposite direction. Dot product always return only a
number.
Cross product of two vectors results in a new vector that is perpendicular to
the vectors and a magnitude equal to the area of parallelogram created by the
two vectors. If two vectors are parallel, the cross product is zero, and if the
vectors are perpendicular, the cross product is maximum. The direction of the
new vector can be easily found using the right hand rule. If you point your
index and middle fingers in the direction of two vectors, the direction of your
upright thumb represents the direction of new vector.
Matrices
An arrangement of numbers, symbols, and/or their combinations in rows and
columns forming a rectangular array is called matrix. A matrix with 3 rows
and 3 columns with all elements equal to one is written as follows.
Matrix Operations
Various scalar operations including addition and subtraction can be
performed on a matrix. A matrix can also be operated along with another
matrix.
“17” and “4” comprise the primary diagonal while “6” and “3” comprise the
secondary diagonal.
Identity Matrix
A square matrix that has all elements in primary diagonal set as one and all
elements in the secondary diagonal set as zero is called an identity matrix.
Here is a 2x2 identity matrix.
Scalar Multiplication
If a matrix is multiplied by a number, it’s called scalar multiplication.
Matrix Addition
We can add two matrices which requires both matrices to have same number
of rows and columns. The element of one matrix is added with the element of
the other matrix with the corresponding position.
Matrix Multiplication
Two matrices can be multiplied only if the number of columns of first matrix
is equal to the number of rows of second matrix. The resultant matrix will
have a number of rows of first matrix and number of columns of second
matrix.
Matrix Determinant
Determinant is a special characteristic of a matrix that helps in various
applications such as finding the inverse of a matrix to solve some linear
equations. The matrix must be square, i.e., same number of rows and
columns. The determinant can be zero. Let’s consider the below matrix.
If we multiply the above matrix with the original matrix, we will get an
identity matrix.
Bonus: we can write a vector as a single column or single row matrix. This
way, all matrix operations can be applied to vectors.
Matrix Transpose
Transposing a matrix involves switching the elements in the rows to columns.
Let’s consider the following matrix.
Basic Statistics
You cannot perform data analysis and machine learning without basic
statistics. In most machine-learning applications, you train your script to
analyze an entire data set through a single, yet varying perspective and find
related potential predictors. The statistics that involves a single
variate/variable is called univariate statistics.
Univariate statistics are largely based on linear models which are heavily
involved in machine learning.
Outlier
Have you ever thought about the concept of superheroes? People among us
(the data set) but very different? So much different that they outside the
characteristic boundaries of all else. An average untrained American male
can lift around 155 lbs. How much superman can lift? According to the
comics, 2 billion tons! If we include superman in the calculation of how
much an average American male can lift, the result will not be truly
representative of the general population (data set).
Detecting outliers in a data set and what to do with them is very important. In
most cases, outliers are discarded because they don’t represent the data set
correctly. This is because outliers usually happen due to an error while
gathering data. But, that’s not always the case. Let’s take an example. An
industrial oven bakes cookies, and a sensor is used to monitor the oven’s
temperature. The oven has two doors; Door A is used to feed cookie dough
into the oven while Door B is used to takeout baked cookies. The sensor
records the temperature in Fahrenheit every second. Here are the readings.
349.5, 350.5, 350.1, 350, 150, 348.5, 349, 350, 349.5, 149.25, 351.5, 350,
349.5, 350.1, 149.7
Something strange is happening; every 5 seconds, the temperature drops well
below the desired temperature range to bake cookies (ideally the temperature
should be 350°F). There can be two possibilities:
1. The sensor is malfunctioning.
2. Heat is leaking.
1 is a possibility, but the data is too precise to indicate any malfunction. The
temperature drops every 5 seconds like clockwork. 2 is a greater possibility,
but why is it happening all the time? Something must be happening every 5
seconds to cause this heat loss. After some careful process investigation, it
was found that the temperature drops steeply because both the oven doors, A
and B, are opened at the same time. Opening both the doors by a few seconds
apart resolves the issue to a great extent.
Average
The concept of average is to find the centerpoint of a data set. Why? The
centerpoint of a data set tells a lot about the important characteristics of the
data set.
Mean
The most common average is the mean. It is calculated by summing all the
elements in a data set and dividing that sum by the number of elements in the
data set. Remember when we said “an average untrained American male can
lift around 155 lbs?” It was a mean of the weight lifted by a specific number
of untrained American males. Mean gives as a general idea of the entire data
set.
349.5, 350.5, 350.1, 350, 150, 348.5, 349, 350, 349.5, 149.25, 351.5, 350,
349.5, 350.1, 149.7
These were the temperatures recorded every second in an industrial oven.
How to calculate the mean? We add all the readings and divide by the
number of readings. The mean comes up as 4647.15/15 = 309.81°F. The
temperature should have remained around 350°F for the best results.
Median
Median represents the true center by position in a data set. The data set is first
sorted in ascending order. If the data set has an odd number of values, the
median is the value that has an equal number of values on both sides. Let’s
find the median in our industrial oven example. First, we sort the data set in
ascending order.
149.25, 149.7, 150, 348.5, 349, 349.5, 349.5, 349.5, 350, 350, 350, 350.1,
350.1, 350.5, 351.5
There are 15 recorded values which is an odd number. The median in this
case is 349.5, the value at 8th position because it has an equal number of
values (7) on both sides. If we had an even number of values, we would have
to calculate the mean of the two values that have an equal number of values
on both sides. For example, here’s how many kilometers I drove every day
for the last eight days.
101, 215, 52, 87, 64, 33, 459, 16
Let’s sort the data set.
16, 33, 52, 64, 87, 101, 215, 459
The two middle values are 64 and 87. The median in this case is ( 64+87 ) / 2
= 75.5 kms
Mode
The most repeated (common) value in a data set is called mode. If none of the
values repeat, the data set doesn’t have a mode. In our industrial oven
example, 350 appears three times and hence is the mode of the data set.
Variance
It is the measure of variability of each value in the data set with respect to its
mean. It is used in investments and revenues to optimize various assets to
achieve a target average. Let’s take an example. You have a physical store, an
online store and a kiosk in the mall. Your physical store generates a revenue
of $30k a month, your online store generates a revenue of $24k a month and
the kiosk generates $12k a month. Your average revenue is $22k each month.
To see which asset contributes more to your average revenue, we use the
formula:
sum of ( asset revenue - mean revenue )2 / number of assets.
For our stores, the variance is, ( ( 30 - 22 )2 + ( 24 - 22 )2 + ( 12 - 22 )2 ) / 3 =
56
The lower the variance the lesser away the individual revenue contributions
are from the average revenue.
Standard Deviation
We find standard deviation by taking the square root of variance. For the
above example, the standard deviation is 7.48. Standard deviation shows
what’s the extent of deviation of the data set as a whole from its mean.
Probability
Distribution
Distribution can be of many types, but for this book, whenever we talk about
distribution, we are talking about probability distribution unless explicitly
stated otherwise. Let’s consider an example.
We are going to flip a coin three times, taken as one event, and record the
outcome. We will repeat the process until we have all the possible unique
outcomes. Here’s a list of all unique outcomes. H H H
HHT
HTH
THH
HTT
THT
TTH
T T TWhat is the probability of finding exactly one head? We look at the
possible outcomes and count all the outcomes that have exactly one head. Out
of total eight possible outcomes, we have three possibilities of getting exactly
one head, which makes the probability ⅜. What’s the probability of getting
zero heads? It is ⅛. Three heads? Also, ⅛. Two heads? The same as one
head, which is ⅜.
This is a simple example with a very small set of total possible outcomes. It
is easier to understand the different probabilities as fractions. But, consider a
data set that has millions of possible outcomes, which is the case in most real-
life scenarios. The fractions become too complex to understand. To make the
probability of different possible outcomes easier to understand, the
probability distributions are created. It is a visual representation that makes
more sense than looking at fractions.
Our current events are random yet discreet because only known values can
occur. The probability distribution is referred to as discrete. If we denote the
event as X, the following chart represents the discrete probability distribution
of random variable X.
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]
We can also create arrays with elements having a specific sequence.
seqArr = np.arange(0, 257, 32) #create sequence by telling how much space
(step) should be between each element
print(seqArr)
#addition
print(np.add(aArr, bArr)) #also aArr + bArr
#subtraction
print(np.subtract(aArr, bArr)) #also aArr - bArr
#multiplication - not matrix multiplication
print(np.multiply(aArr, bArr)) #also aArr * bArr
#division
print(np.divide(aArr, bArr)) #also aArr / bArr
The output of the above script is:
[[ 5 9 13]
[13 9 5]]
[[-1 -1 -1]
[-1 -1 -1]]
[[ 6 20 42]
[42 20 6]]
[[0.66666667 0.8 0.85714286]
[0.85714286 0.8 0.66666667]]
We can also perform dot and cross products on arrays.
cArr = np.array([
[2, 4, 6],
[6, 4, 2],
[8, 9, 1]
])
dArr = np.array([
[3, 5, 7],
[7, 5, 3],
[8, 9, 1]
])
#dot product
print(np.dot(cArr, dArr))
#cross product
print(np.cross(cArr, dArr))
The outputs are:
[[82 84 32]
[62 68 56]
[95 94 84]]
[[-2 4 -2]
[ 2 -4 2]
[ 0 0 0]]
Transposing an array is very simple.
print(dArr.T) #transpose
The output is:
[[3 7 8]
[5 5 9]
[7 3 1]]
In many situations, you will have to sum of elements present in an array.
#sum all elements in an array
print(np.sum(dArr))
We can now start plotting charts. Here’s an example of a simple line chart.
import matplotlib.pyplot as plt
import numpy as np
arr1 = np.array([2, 4, 6])
arr2 = np.array([3, 5, 7])
plt.plot(arr1, arr2)
plt.show()
The above script will output a straight line chart on the screen like this.
The plot has a few buttons on the bottom left side. You can reset the view
with the “Home” button, cycle through different views using the “Arrow”
buttons, move the plot around using the “Crosshair.” The “Zoom” button will
you let you zoom in (left click and drag) or out (right click and drag) of the
plot view. The “Configuration” buttons gives you more options to change the
plot view and looks like below. The last button, “Save,” lets you save the plot
as an image.
We can customize the plots by adding a title, labels, and a legend to our plot
view. Here is the updated script.
import matplotlib.pyplot as plt
import numpy as np
plt.legend()
plt.show()
The output now is:
Next, let’s work with bar charts and histograms. Many people think both are
the same, but our example will show how they differ. Here’s an example of
bar charts.
import matplotlib.pyplot as plt
import numpy as np
plt.legend()
plt.show()
The output is:
We can see that bar charts can be easily used to visually compare two
different data sets.
A histogram is more often used to visualize the distribution of a data set over
some specified period (usually time period). Here’s a code to visualize the
ages of a population sample in age groups of 10 years.
import matplotlib.pyplot as plt
import numpy as np
arr1 = np.array([16, 55, 21, 45, 85, 57, 32, 66, 94, 12, 25, 29, 30, 32, 45, 16,
12, 74, 63, 18])
rangeBin = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
plt.hist(arr1, rangeBin, histtype='bar', rwidth=0.9, label="Age Groups")
plt.legend()
plt.show()
Here is the output chart.
plt.title("Scatter Plot")
plt.legend()
plt.show()
The output of the above script will be:
In a scatter plot, we can choose the marker type and size. Here, we have used
a star, for a full list of available marker types, visit
https://matplotlib.org/3.1.1/api/markers_api.html.
Another chart you can create with the Python matplotlib library is the stack
plot which is used to show the contribution of several assets in achieving a
certain point. It’s much easier to understand if I show you an example.
import matplotlib.pyplot as plt
import numpy as np
plt.title("Stack Plot")
plt.legend()
plt.show()
The output is:
We have used color codes for the stack plot, ‘m’ for magenta, ‘c’ for cyan, ‘r’
for red, and ‘k’ for black.
Let’s move onto pie charts. Here’s our activities example using a pie chart.
import matplotlib.pyplot as plt
import numpy as np
slices = np.array([8, 0.75, 0.5, 14.75]) #taking last values of all activity arrays
to create an array detailing time spent on Sunday in different activities
plt.show()
The output is:
Pretty chill Sunday! Pie charts are a great way of showing how different
things add up to complete the whole picture. We have used “explode”
argument to make “Eating” stand out from the rest. The “autopct” argument
can take special formatting to show the percentage each activity takes up the
hours of my typical Sunday.
Data Acquisition
Up till now, we have declared random arrays to showcase different features
of numpy and matplotlib libraries. In real-life tasks, you will need to access
data from a certain source. The data set might be in a comma separated
values (.csv) file which is sometimes also saved as a simple text (.txt) file, or
maybe in another format like an excel (.xslx) file or maybe in XML or JSON
format. Also, the source file might be located on your computer or on the
internet.
How to acquire data in such cases? Let’s look at a few examples.
Get data from local text file
Navigate to your PyCharm project directory (you can see the link of project
folder besides the project name in Project pane of PyCharm) and create a file
“example.txt”. Add the following sample data to it, save and close the file.
1,45
2,23
3,12
4,54
5,74
6,21
7,5
8,68
9,24
10,8
There are many libraries that can be used to import data from a text/csv file.
We are going to show two approaches, using the standard csv library and
then the numpy library.
Here’s the script using standard csv library to read data and plot the data
using matplotlib.
import csv
import matplotlib.pyplot as plt
x = []
y = []
plt.plot(x, y)
plt.show()
file.close()
The output is:
plt.plot(x, y)
plt.show()
The output is the same. Yes, it’s that simple with numpy! On top of that, it’s
faster. The “unpack” argument tells numpy to map the read data to the given
variables.
Get data from internet
Most data accumulators provide APIs to enable programmers to
communicate with their data. We are going to connect with Quandl using
their API to download stock prices of certain companies and see the price
trend.
Our code will be divided into three parts (each part will be a separate
function).
def get_data(comp):
stockPriceURL = "https://www.quandl.com/api/v3/data
sets/WIKI/"+comp+".csv?api_key="+api_key
getsource = urllib.request.urlopen(stockPriceURL).read().decode()
return getsource
def filter_data(rawData):
stockDate = []
stockClose = []
dataSplit = rawData.split('\n')
for row in dataSplit[1:]:
if row != "":
elems = row.split(',')
stockDate.append(parse(elems[0]))
stockClose.append(float(elems[4]))
return stockDate, stockClose
def plot_data(final_data):
plt.plot(final_data[0], final_data[1])
plt.xlabel("Year")
plt.ylabel("Closing Stock Price")
plt.title("Closing Stock Price Trend for TESLA (TSLA)\nfor the
last ten yrs")
plt.show()
def main():
#company = input("Enter the company symbol you want to see
price trend of: ") #enable this if user input is required
company = "TSLA"
rawData = get_data(company)
final_data = filter_data(rawData)
plot_data(final_data)
if __name__ == "__main__":
main()
The output chart will be like this.
Let’s talk a bit about our code.
To use this code, we will have to first sign up with Quandl to use their API.
The sign up is completely free, so complete it by going to this link:
https://www.quandl.com/sign-up-modal?defaultModal=showSignUp
and activate the account. You will be assigned an API key that you can place
in the above code to use Quandl API services.
The data provided by Quandl has a header row. We have ignored the header
row by slicing the list containing the parsed data (dataSplit[1:]). In the above
script, I have focused on gathering dates (elems[0]) and the respective closing
price (elems[4]) to plot the chart. The data from the .csv file is read as strings.
We have to convert it to the correct format (date and floating) before trying to
plot them using matplotlib. The date conversion is done through parse of the
standard dateutil.parser library. The conversion of closing prices from string
to floating is straightforward.
Another thing to note in the script is how we have returned two lists
(stockDate and stockClose) in the function filter_data() and is gathered by the
single list final_data, which results in concatenation of returned lists.
This is not the only way to acquire data from Quandl. We could have used
numpy arrays instead of lists in our code that would have made things a little
faster. Quandl also offers a special library for Python “quandl” that you can
install using pip and use to access data from Quandl.
Let’s setup quandl library on Python and see how we can perform the actions
of the last script using the new library. Switch to “terminal” in PyCharm.
Activate the “tf” virtual environment if not already active, and then use the
following command to install quandl.
pip install quandl
Here is the code to display closing price trend for Apple (AAPL).
import numpy as np
import matplotlib.pyplot as plt
import quandl
def get_data(comp):
getsource = quandl.get("EOD/"+comp+".4", returns="numpy",
authtoken=api_key, start_date="2009-01-01", end_date="2019-01-01")
return getsource
def filter_data(rawData):
dates = np.array([i[0] for i in rawData])
closep = np.array([i[1] for i in rawData])
return dates, closep
def plot_data(final_data):
#plt.plot(final_data[0], final_data[1], label="Price")
plt.plot_date(final_data[0], final_data[1], '-', label="Price",
color=”gold”)
plt.xlabel("Half-Year")
plt.ylabel("Closing Stock Price")
plt.xticks(rotation=20)
plt.title("Closing Stock Price Trend for APPLE (AAPL)\nusing all
available data")
plt.subplots_adjust(bottom=0.15)
plt.legend()
plt.show()
def main():
#company = input("Enter the company symbol you want to see
price trend of: ") #enable this if user input is required
company = "AAPL"
rawData = get_data(company)
final_data = filter_data(rawData)
plot_data(final_data)
if __name__ == "__main__":
main()
The above will output the following chart.
We can see from the trend that in mid-2014, the stock prices crashed for
Apple. But, this wasn’t because the company went out of business, Apple
introduced split shares which allowed cheaper shares opening doors to micro
investors. Other than that, Apple’s stock price has a very distinctly repetitive
trend. Any experienced stock trader can look at this trend and find
opportunities of investing at the right time and making profits.
Let’s talk about our script. TSLA stocks are not available on the free Quandl
account so we will work with the available data of AAPL stocks. Quandly
offers a lot of stock data, but we are only interested in the End of Day closing
stock price, which is the fifth column of the Quandl data. In the get_data()
function, we import the stock data as a numpy recarray. Numpy recarrays are
different than normal numpy arrays because they can contain different data
types wrapped in a tuple. The start and end dates are used to set a range for
data, but in our example, both are useless because we have very limited set of
data available on our free Quandl account.
We actually don’t need to filter_data() function because the data is ready for
plotting. For the sake of showing a new technique, we have used this function
to create standard numpy arrays from the recarray Quandl generated. The
function returns both arrays and collected as a tuple by the main() function.
Note that this is not a good practice as we are dealing with a huge data set.
Can you think of a better way to pass data from filter_data() to main()?
In the plot_data() function, we have shown two different plotting methods,
one we already know plt.plot() and the other plt.plot_date(). The plot_date()
is very useful in plotting two quantities in which one is time. We have also
used some advanced plot formatting to make the plot look nicer. The labels
on the ‘x’ axis are rotated to avoid overlap and the plot margin from the
bottom is increased so nothing overflows out of the visible window. One
more thing, plot_date() usually plots a scatter plot instead of a line chart, we
have added the ‘-’ argument to force plot_date() in creating a line chart.
You might see the following warning because we didn’t explicitly set the
data format conversion. But, we can ignore this warning because our script is
correctly plotting the data.
To register the converters:
>>> from pandas.plotting import
register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
The quandl library is a very powerful tool if you want to write scripts on
stocks and other finance applications. The free account has some API call
limitations, but you can download the entire data set of any symbol (company
traded in a stock market) using the API URL method to your computer and
then keep practicing different analysis and machine-learning techniques with
the offline data set.
A project
We have learned enough to consider a real-world scenario and attempt to
solve the issue(s) using the tools we have learned so far. Here’s a scenario.
We have been contracted by a potato chips manufacturing facility that makes
200g packets. We have to find if a randomly picked chips packet from the
assembly line has a net weight within the accepted tolerance range of ±10g.
We ask the facility to sample a random number of samples and record the
mean weight and standard deviation. The values come up as 197.9g and 5.5g
respectively.
We start by assuming the system probability has a normal distribution. We
have a complex and randomly sampled system so it’s impossible to ascertain
the probability of a specific event through the normal probability
determination methods. To find the probability of such as a complex system,
we create the probability distribution curve and find area under the curve.
In our code, we have used a library we haven't used so far, the “scipy”
library. The erf() from math library is used to estimate the error function of a
normal distribution system. We are also using the subplot().
# Import required libraries
from math import erf, sqrt
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
===Calculation Results===
Lower Bound = -0.4246
Upper Bound = 0.4861
Probability of finding a chips packet with weight within the tolerance range
(pIn) = 91.1%
Probability of finding a chips packet with weight outside the tolerance range
(pOut) = 8.9%
There is a 91.1% chance of picking a chips packet from the production line
and its weight being within the tolerance range. Is it a high enough
percentage? It depends upon the production facility and the standards it has to
keep up with. Note that there are industrial standards and government
inspections to make sure the final product follows the claims made. The
weight of the packet is one such important parameter.
print(aSer)
print(aSer.values) # get all values as a list
print(aSer.index) # get all indices as a list
print(bSer)
print(bSer['frst']) # get element using key
print(bSer[0]) # get element using index
aDict = {
'a': 1,
'b': 2,
'c': 0
}
cSer = Series(aDict) # creating a series from a dictionary
print(cSer)
print(dSer)
dSer.name = 'Data'
dSer.index.name = 'Index'
print(dSer)
now = datetime.now()
print(now) # the returned data has year, month, day, hour, minutes, seconds,
milliseconds, and microseconds
print(nexTime)
ranTime = '1990-03-07'
print(tSer)
print(tSer2)
print(tSer + tSer2) # performing arithmetic operations on time series is easy
print(tSer * tSer2) # notice how NaT becomes the first element
The outputs are:
DatetimeIndex(['1958-01-31', '1969-07-20'], dtype='datetime64[ns]',
freq=None)
DatetimeIndex(['1958-01-31', '1969-07-20', 'NaT'], dtype='datetime64[ns]',
freq=None)
1958-01-31 2.520516
1969-07-20 0.305652
dtype: float64
1958-01-31 -0.720893
1969-07-20 1.213476
NaT -0.229652
dtype: float64
NaT NaN
1958-01-31 1.799623
1969-07-20 1.519128
dtype: float64
NaT NaN
1958-01-31 -1.817023
1969-07-20 0.370901
dtype: float64
We can also perform index slicing on a time series. Time series can also
contain duplicate indices. Here’s a script that showcases bot concepts.
longtSer = Series(np.random.randn(500), index=pd.date_range('7/3/2016',
periods=500)) # create a long time series and populate with random numbers
siteStats = {
"Day Number": [1, 2, 3, 4, 5, 6, 7],
"Visitors": [1405, 24517, 32415, 74512, 9541, 32145, 33],
"Bounce Rate": [65, 42, 54, 74, 82, 10, 35]
}
print(dFrame)
print(dFrame2)
The output is:
Day Number Visitors Bounce Rate
Mon 1 1405 65
Tue 2 24517 42
Wed 3 32415 54
Thur 4 74512 74
Fri 5 9541 82
Sat 6 32145 10
Sun 7 33 35
Visitors Bounce Rate
Day Number
1 1405 65
2 24517 42
3 32415 54
4 74512 74
5 9541 82
6 32145 10
7 33 35
One important thing to remember with pandas dataframe is that when you
apply any methods such as set_index(), you get a new dataframe and the
original dataframe remains intact. This is a safety feature so you don’t
accidentally overwrite the source data. If you want to overwrite the original
data set, you have to use the “inplace=True” argument.
We can access a specific column or columns of a dataframe. We can also
convert those columns to a numpy array, which leads to a question: can we
convert a numpy array to a dataframe? Yes, we can. Here’s a script that does
all that.
print(dFrame['Bounce_Rate']) # returns a single column of dataframe as a
series with the same index as the dataframe
print(dFrame.Bounce_Rate) # only works if column headers don't have
spaces, returns the same output as above print
print(dFrame3)
The output of the above script is:
Mon 65
Tue 42
Wed 54
Thur 74
Fri 82
Sat 10
Sun 35
Name: Bounce_Rate, dtype: int64
Mon 65
Tue 42
Wed 54
Thur 74
Fri 82
Sat 10
Sun 35
Name: Bounce_Rate, dtype: int64
Visitors Bounce_Rate
Mon 1405 65
Tue 24517 42
Wed 32415 54
Thur 74512 74
Fri 9541 82
Sat 32145 10
Sun 33 35
[[ 1405 65]
[24517 42]
[32415 54]
[74512 74]
[ 9541 82]
[32145 10]
[ 3335]]
As the output suggests, when we access a single column of a dataframe by
using the column header and we get a list. In our script, we created a
dataframe from a dictionary. The dictionary keys were set without spaces that
makes each column a parameter of the dataframe.
Reading Data
The pandas framework also provides out of the box support to get data from
most common data set file formats from both local or external sources. We
have already learned how to connect with Quandl using it’s API URL or the
quandl Python library to access data. What if the data is already downloaded
from Quandl and currently residing on your computer? We can also read data
from there. Login to your Quandl account by going to their website. Search
for “AAPL”, which is the stock price data for Apple Inc. Quandl provides
data export feature in many file formats. Let’s download in the .csv format to
the current PyCharm project folder.
import pandas as pd
print(dataF.head())
print(dataF.head())
print(dataF.head())
2017-12-
28 171.00 171.850 170.480 ... 165.957609 166.541693 16480187.0
2017-12-
27 170.10 170.780 169.710 ... 165.208036 166.074426 21498213.0
2017-12-
26 170.80 171.470 169.679 ... 165.177858 166.045222 33185536.0
2017-12-
22 174.68 175.424 174.500 ... 169.870969 170.367440 16349444.0
2017-12-
21 174.17 176.020 174.100 ... 169.481580 170.367440 20949896.0
[5 rows x 12 columns]
Date Open High Low ...
Adj_Low Adj_Close Adj_Volume
2017-12-
28 171.00 171.850 170.480 ... 165.957609 166.541693 16480187.0
2017-12-
27 170.10 170.780 169.710 ... 165.208036 166.074426 21498213.0
2017-12-
26 170.80 171.470 169.679 ... 165.177858 166.045222 33185536.0
2017-12-
22 174.68 175.424 174.500 ... 169.870969 170.367440 16349444.0
2017-12-
21 174.17 176.020 174.100 ... 169.481580 170.367440 20949896.0
[5 rows x 12 columns]
Date Open_Price High Low ...
Adj_Low Adj_Close Adj_Volume
2017-12-
28 171.00 171.850 170.480 ... 165.957609 166.541693 16480187.0
2017-12-
27 170.10 170.780 169.710 ... 165.208036 166.074426 21498213.0
2017-12-
26 170.80 171.470 169.679 ... 165.177858 166.045222 33185536.0
2017-12-
22 174.68 175.424 174.500 ... 169.870969 170.367440 16349444.0
2017-12-
21 174.17 176.020 174.100 ... 169.481580 170.367440 20949896.0
[5 rows x 12 columns]
Date Open High Low ...
Adj_Low Adj_Close Adj_Volume
2017-12-
28 171.00 171.850 170.480 ... 165.957609 166.541693 16480187.0
2017-12-
27 170.10 170.780 169.710 ... 165.208036 166.074426 21498213.0
2017-12-
26 170.80 171.470 169.679 ... 165.177858 166.045222 33185536.0
2017-12-
22 174.68 175.424 174.500 ... 169.870969 170.367440 16349444.0
2017-12-
21 174.17 176.020 174.100 ... 169.481580 170.367440 20949896.0
[5 rows x 12 columns]
As you can see, all the print statements have given identical outputs. That
was the goal of the above script, to show different ways to read data from
.csv files set with slightly different data form and to normalize them to one
form.
Writing Data
Using the pandas framework, we can save dataframes in file formats other
than we imported it from. For example, let’s save our dataframe as an HTML
file.
dataF.to_html('dataF.html') # save dataframe as an HTML file
The new HTML file will be saved in the directory of current project. If you
open the file using a browser, you will see the following.
If you open the same HTML file in a text editor, you will see the following.
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Open</th>
<th>High</th>
<th>Low</th>
<th>Close</th>
<th>Volume</th>
<th>Dividend</th>
<th>Split</th>
<th>Adj_Open</th>
<th>Adj_High</th>
<th>Adj_Low</th>
<th>Adj_Close</th>
<th>Adj_Volume</th>
</tr>
...
Getting Data from Internet
Here is a code that gets stock price data of Google (GOOG) from Yahoo
Finance site using the special pandas_datareader library. We have to first
install this library using pip. Switch to Terminal in PyCharm and make sure
“tf” environment is activated; if not, activate it with following command.
activate tf
Then run the pip command.
pip install pandas_datareader
Once installed, run the following script.
import pandas_datareader.data as web
import matplotlib.pyplot as plt
from matplotlib import style
rangeStart = '1/1/2009'
rangeStop = '1/1/2019'
style.use('ggplot')
dFrame['Adj Close'].plot()
plt.show()
The output will be:
Date High Low ... Volume Adj Close
[5 rows x 6 columns]
The plotted graph will look like the following. Note how it looks different
than our previous charts because we used the preset style “ggplot”.
Machine-Learning Projects
We have learned enough tools and procedures to start dealing with real-life
scenarios. Let’s do two projects that will show us how machine learning can
be beneficial in solving problems.
Remember when we talked about whether happiness and money are related?
Let’s write some script to find out the relationship between the two using data
of a few countries.
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita =
pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]
A real estate tycoon Mr. Gill from New York wants to expand his investment
portfolio and is actively considering making substantial investment in other
states of the US. Luckily, a common acquaintance introduced you to him and
he offered you a big paycheck contract. You have only job, to find if he
should invest in real estate of a specific state right away, delay, or stop
thinking about new investment ventures altogether?
The first step is to gather real estate pricing data for the area. We are going to
use Quandl as the data source. After we have the data, we will find
correlation of the Housing Price Index (HPI) of all the states with each other
and the national HPI. We will plot all the information to visualize the HPI
trends for the last 44 years.
import pandas as pd
import quandl
import matplotlib.pyplot as plt
from matplotlib import style
usStates =
pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states#List')
usStates = usStates[0]['postal abbreviation[1]']
supFrame = pd.DataFrame()
supFrame.to_pickle('supFrame.pickle')
#supFrame.to_pickle('supFrame2.pickle') enable this line if this percent
values are to be saved
'''
iniPercSupFrame = pd.read_pickle('supFrame2.pickle') # this is the
dataframe with percent change relative to starting value
'''relpercSupFrame = supFrame.pct_change() # this calculates the percent
change relative to the previous value
supFrame = pd.read_pickle('supFrame.pickle')
print(supFrame.head())
print(supFrame.tail())
print(relpercSupFrame.head())
print(relpercSupFrame.tail())
print(iniPercSupFrame.head())
print(iniPercSupFrame.tail())'''
'''supFrame.plot()
#relpercSupFrame.plot()
#iniPercSupFrame.plot()
style.use('fivethirtyeight')
plt.legend().remove()
plt.show()'''
''' # only run this once then afterwards use the pickle file so we don't have to
query quandl API on every script run
usFrame = quandl.get('FMAC/HPI_USA.1', authtoken=api_key)
usFrame['NSA Value'] = ( (usFrame['NSA Value'] - usFrame['NSA Value']
[0]) / usFrame['NSA Value'][0] ) * 100 # get percent change relative to
starting value
usFrame.to_pickle('usFrame.pickle')'''
usFrame = pd.read_pickle('usFrame.pickle') # this has housing price index
for the entire of USA
#stateYR = iniPercSupFrame['TX'].resample('A').mean()
stateYR = iniPercSupFrame['TX'].resample('A').ohlc() # get open, high, low
and close along with the monthly HPI for the state of Texas
print(stateYR.head())
iniPercSupFrame['TX'].plot(ax=ax1, label='Monthly TX HPI')
stateYR.plot(ax=ax1, label='Yearly TX HPI')
plt.legend(loc=4) # to show legend at bottom right hand side of the plot
plt.show() # show relation between monthly HPI and annual mean HPI for
the state of Texas
Note that I have commented out some parts of the code, uncomment them
when you run the script for the first time. The outputs are below.
AL AK AZ ... WV WI WY
AL 1.000000 0.956590 0.946930 ... 0.986657 0.992753 0.957353
AK 0.956590 1.000000 0.926650 ... 0.977033 0.943584 0.989444
AZ 0.946930 0.926650 1.000000 ... 0.933585 0.945189 0.927794
AR 0.995986 0.974277 0.946899 ... 0.993022 0.988525 0.971418
CA 0.949688 0.935188 0.982345 ... 0.946577 0.951628 0.935299
[5 rows x 50 columns]
AL AK AZ ... WV WI WY
count 50.000000 50.000000 50.000000 ... 50.000000 50.000000 50.000000
mean 0.971637 0.947688 0.943983 ... 0.967408 0.967758 0.948303
std 0.022655 0.034883 0.022095 ... 0.027050 0.022545 0.036489
min 0.895294 0.814556 0.884497 ... 0.862753 0.901193 0.813296
25% 0.958299 0.937193 0.934415 ... 0.956899 0.954704 0.935313
50% 0.976359 0.955945 0.945212 ... 0.973251 0.971728 0.958090
75% 0.987204 0.965108 0.953891 ... 0.985205 0.985719 0.968983
max 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000
[8 rows x 50 columns]
Date open high low close
The next phase is to introduce some more data attributes and then perform
machine learning to make some predictions.
''' # only run this once, then afterwards use the pickle file so we don't have to
query quandl API on every script run
mortggFrame = quandl.get('FMAC/MORTG', trim_start="1975-01-01",
authtoken=api_key)
mortggFrame['Value'] = ((mortggFrame['Value'] - mortggFrame['Value'][0]) /
mortggFrame['Value'][0]) * 100 # get percent change relative to starting
value
print(mortggFrame.head())
mortggFrame = mortggFrame.resample('D').mean()
mortggFrame = mortggFrame.resample('M').mean() # these two resampling
operations is a hack to shift data column from start of month to end of month
print(mortggFrame.head())
mortggFrame.to_pickle('mortggFrame.pickle')'''
mortggFrame = pd.read_pickle('mortggFrame.pickle')
mortggFrame.columns = ['Mort30yr']
print(mortggFrame.head())
''' only run this once, then afterwards use the pickle file so we don't have to
query quandl API on every script run
unEmp = quandl.get("USMISERY/INDEX.1", start_date="1975-01-31",
authtoken=api_key)
unEmp["Unemployment Rate"] = ((unEmp["Unemployment Rate"] -
unEmp["Unemployment Rate"][0]) / unEmp["Unemployment Rate"][0]) *
100.0 # get percent change relative to starting value
unEmp.to_pickle('unEmp.pickle')
'''
unEmp = pd.read_pickle('unEmp.pickle')
print(unEmp.head())
HPIMega = iniPercSupFrame.join([mortggFrame, unEmp])
HPIMega.to_pickle('HPIMega.pickle')
HPIMega = pd.read_pickle('HPIMega.pickle')
The “HPIMega.pickle” is our final prepared data that we will use to perform
machine learning. The outputs generated by this section of code are:
Date Mort30yr
1975-01-31 0.000000
1975-02-28 -3.393425
1975-03-31 -5.620361
1975-04-30 -6.468717
1975-05-31 -5.514316
Date Unemployment Rate
1975-01-31 0.000000
1975-02-28 0.000000
1975-03-31 6.172840
1975-04-30 8.641975
1975-05-31 11.111111
Date AL AK ... Mort30yr Unemployment Rate
1975-01-31 0.000000 0.000000 ... 0.000000 0.000000
1975-02-28 0.572095 1.461717 ... -3.393425 0.000000
1975-03-31 1.238196 2.963317 ... -5.620361 6.172840
1975-04-30 2.043555 4.534809 ... -6.468717 8.641975
1975-05-31 2.754944 6.258447 ... -5.514316 11.111111
... ... ... ... ... ...
2016-05-31 266.360987 411.870585 ... -61.823966 -41.975309
2016-06-30 269.273949 412.844893 ... -62.142100 -39.506173
2016-07-31 271.680625 411.627829 ... -63.520679 -39.506173
2016-08-31 272.869925 409.194858 ... -63.520679 -39.506173
2016-09-30 272.330718 406.662586 ... -63.308590 -38.271605
HPIMega = pd.read_pickle('HPIMega.pickle')
We have to also install scikit-learn library to perform machine learning
operations. Go to the Terminal section of PyCharm and run command
“activate tf” if “tf” virtual environment is not active. Then, run the pip
command.
pip install scikit-learn
Here is the code.
import pandas as pd
import numpy as np
from statistics import mean
from sklearn import svm, preprocessing, model_selection
def movAvg(values):
return mean(values)
HPIMega = pd.read_pickle('HPIMega.pickle')
HPIMega.rename(columns={"NSA Value": "US_HPI"}, inplace=True)
HPIMega = HPIMega.pct_change() # because to predict future we should
have percent change relative to last value not the start value
#HPIMega['custom_mort_mean'] =
HPIMega['Mort30yr'].rolling(10).apply(movAvg, raw=True) # example of
rolling apply
cLinFunc = svm.SVC(kernel='linear')
cLinFunc.fit(X_train, y_train)
print("Predicting accuracy: %.2f%%" % (cLinFunc.score(X_Test,
y_test)*100))
The above code gives the following output.
Prediction accuracy: 71.23%
The accuracy will fluctuate within a certain range. So a recap: we have
written a script that predicts the labels 0 and 1. Remember that label would
only be 1 if future HPI is greater than current HPI (meaning the real estate
prices will go higher in the future). Our script correctly predicted on roughly
71 instances out of 100 tries when the HPI will rise or drop in the future. We
can use some more data to the system, for example, US GDP, to further
enhance the accuracy.
Are you going to meet the real estate tycoon with your current prediction
model now or ask for more time, get more data, improve accuracy, and then
have a meeting? What are you going to do?
Chapter 3: Working with Raspberry Pi
Ever heard of an overkill? Your computer or laptop might be a versatile
beast, but sometimes it can feel like using a tank to go to Walmart.
Depending upon your application, it might be easier to deal with a smaller,
reprogrammable computer that offers a specific set of hardware capabilities.
There have been various compact reprogrammable computers available for
decades that engineers and coding enthusiasts have used to develop various
solutions. Raspberry Pi is also one such computers.
The current model of Raspberry Pi is 4 Model B. For this book, we are going
to use the more common Model 3 B because you will find more content
about this model on the internet. You can purchase the minicomputer by
finding authorized resellers for your country on this web page
https://www.raspberrypi.org/products/raspberry-pi-3-model-b/
You will also have to purchase some peripherals. Here is a list.
1. A breadboard
2. A jumper wires kit (male to female is what we require for this
book)
3. A bunch of LED lights (green, yellow, red)
4. A bunch of resistors (one 1k Ω, one 2k Ω, three 300-1k Ω)
5. Distance sensor (two HC-SR04)
6. Raspberry Pi camera module
7. A USB microSD card reader for your computer if it doesn’t have a
reader slot
8. A microUSB power supply (2.1 A rating)
9. A microSD card with at least 16GB of storage capacity and an
adapter if you have purchased an older model
10. For development, you will need a monitor,
keyboard, and mouse
Hardware Components
The Model 3 B replaced the Model 2 B in early 2016. It has the following
hardware configuration:
● Quad Core 64bit CPU 1.2gHz Broadcom BCM2837
● 1GB Ram
● Wireless LAN and Bluetooth Low Energy (BLE) BCM43438 on board
● 100 Base Ethernet
● 40-pin GPIO (extended)
● 4 USB v2 ports
● Composite video port
● 4 Pole stereo audio port
● Full size HDMI
● CSI port to connect Raspberry Pi camera
● CSI port to connect Raspberry Pi touchscreen display
● MicroSD port
● MicroUSB power point that supports up to 2.5 A rated power sources
According to the manufacturers, this particular model complies with
following European standards.
First Project
The codes in this section are inspired by the work of “sentdex” Youtube
channel. The images used are also provided by the channel.
Raspberry camera was last updated in spring 2016. It is an 8MP (Mega Pixel)
Sony IMX219 sensor camera that supports capture of high definition video
and images. You can also perform advanced video effects like time-lapse and
slow-motion. The new camera has better image quality even in low-lit
environments and exceptional color fidelity. The supported video formats are
1080p on 60 frames/sec, 720p on 60 frames/sec and the low-quality VGA90
that’s suitable for long video recording that can clog the memory.
Let’s plug the camera in the special camera slot and connect all other
peripherals before powering up the Raspberry Pi. Once everything's ready,
login to the Raspberry OS using the remote connection. Open the terminal
and run the following command.
cd Desktop/
We also have to enable camera interfacing by starting Raspberry
Configuration.
sudo raspi-config
Select “Interfacing Options”, “Camera” is the first option, select it, select
“Yes” to enable camera interface. Select “Finish” to close the configuration.
You might have to reboot the Raspberry Pi, go ahead if it needs to be done.
Once it powers back on, run the following command again if needed.
cd Desktop/
Run another command to create a script file on the desktop.
nano cameraex.py
This will open the script file. Let’s add the following script.
import picamera
import time
cam = picamera.PiCamera()
cam.capture(‘firstImage.jpg’)
Enter CTRL+X to close the script file, save it with the same name, and run
the script using the terminal with the following command.
python cameraex.py
On the desktop, there will be a new image file. Open it to see the result. We
can perform various image manipulations. For example, to vertically flip the
image, we can change the “cameraex.py” script like this.
import picamera
import time
cam = picamera.PiCamera()
cam.vflip = True
cam.capture(‘firstImage.jpg’)
Let’s record a video. We have to replace the cam.capture() with another line
of code.
import picamera
import time
cam = picamera.PiCamera()
#cam.vflip = True
#cam.capture(‘firstImage.jpg’)
cam.start_recording(‘firstvid.h264’)
time.sleep(5)
cam.stop_recording
Let’s run the script using the following command.
python cameraex.py
To view the video, we need to use a media player. Raspbarian OS comes
loaded with media player called “OMX Player”. Let’s call it in the terminal to
view the recorded video.
omxplayer firstvid.h264
We can also test the camera with a live feed. Remove all the code from the
script file and add the following lines, run the script from the terminal.
import picamera
cam = picamera.PiCamera()
cam.start_preview()
You should see a live video feed on your screen now. It might take a few
seconds to appear. Also, if you are remotely running the Raspberry OS, make
sure the monitor connected with the Raspberry Pi is turned on because all the
images and videos only show up in that monitor, not on the remotely
connected system monitor.
Sending and Receiving Signals Using GPIO of the Raspberry Pi
Sending Signal
Let’s set up a series circuit on the breadboard using a 300-1k Ω resistor, a red
LED (you can choose any other color), and the two male to female jumpers.
If you have no idea how to create circuits on a breadboard, here’s a tutorial
that can get you up to speed https://www.youtube.com/watch?
v=6WReFkfrUIk.
Make sure you remember the pins where you connect the female end of
jumpers on the GPIO. To make things easier, let’s connect the female end of
jumper at the GPIO23 position. Here is the map of GPIO pins on the
Raspberry Pi to make more sense.
Once the circuit is ready, power up the Raspberry Pi and let’s start with the
coding. Open the terminal, navigate to the desktop and just to double check,
we will install the GPIO RPI using the following command.
sudo apt-get install python-rpi.gpio
If the RPI is already installed, the terminal will let you know. Let’s create a
script file “ledex.py” and add the following script to it.
import RPI.GPIO as GPIO
import time
# female end of jumper connected at GPIO23
GPIO.setmode(GPIO.BCM)
GPIO.setup(23, GPIO.OUT)
GPIO.output(23, GPIO.HIGH)
time.sleep(5)
GPIO.output(23, GPIO.LOW)
GPIO.cleanup()
Save the script file and run it through the terminal using the following
command.
python ledex.py
You will be able to see the LED light up for 5 seconds and then turn off.
Receiving Signal
Let’s create a circuit using the HC SR04 distance sensor, four jumpers, and
three 1k Ω resistors. How does the distance sensor work? We can send a
sound signal (trigger) using the sensor which strikes a physical object in front
of the sensor and is received back by the sensor (echo). Since we know the
speed of sound, we can calculate the time difference between the trigger and
echo to calculate the distance between the sensor and the physical object.
Once everything is connected as shown in the image below, power up the kit
and start the terminal to run a few commands.
cd Desktop/
nano distancesensor.py
In the new script file, add the following script.
import RPI.GPIO as GPIO
import time
GPIO.setmode(GPIO.BCM)
GPIO.setup(TRIG, GPIO.OUT)
GPIO.setup(ECHO, GPIO.IN)
GPIO.output(TRIG, True)
time.sleep(0.0005)
GPIO.output(TRIG, False)
GPIO.setmode(GPIO.BCM)
# declaring pin connections
TRIG = 4
ECHO = 18
GREEN = 17
YELLOW = 27
RED = 22
GPIO.setup(GREEN,GPIO.OUT)
GPIO.setup(YELLOW,GPIO.OUT)
GPIO.setup(RED,GPIO.OUT)
def greenLEDon():
GPIO.output(GREEN, GPIO.HIGH)
GPIO.output(YELLOW, GPIO.LOW)
GPIO.output(RED, GPIO.LOW)
def yellowLEDon():
GPIO.output(GREEN, GPIO.LOW)
GPIO.output(YELLOW, GPIO.HIGH)
GPIO.output(RED, GPIO.LOW)
def redLEDon():
GPIO.output(GREEN, GPIO.LOW)
GPIO.output(YELLOW, GPIO.LOW)
GPIO.output(RED, GPIO.HIGH)
return distance
while True:
distance = getDistance()
time.sleep(0.005)
print(distance)
The Projects
predict = 'G3' # goal is to predict this as close to the actual values, will
remove this column from stdntFram before training and testing
X = np.array(stdntFram.drop([predict], 1))
y = np.array(stdntFram[predict])
allPred = linMod.predict(X_test)
Let’s take on another project. This time we are going to implement the K-
Nearest Neighbors (KNN) algorithm to relate an integer ‘k’ with its closest
neighbors. We are going to deal with a data set that will require more
preprocessing than we did in the earlier project. The KNN algorithm is
usually used to classify data points, for example, a movies database can be
modeled to learn about good and bad movies and predict a future movie’s
classification as bad or good. In our project, we are going to classify cars in
view of certain car attributes.
Let’s go to https://archive.ics.uci.edu/ml/data sets/Car+Evaluation and
download the “Car Evaluation Data Set” as we did for the first project. We
are going to add a header to this file that will make it easier to use pandas to
read the data. Just open the data file in PyCharm and add the following line
as the first line of the file. Don’t forget to save the file afterwards.
buying,maint,door,persons,lug_boot,safety,class
Let’s start coding.
import sklearn
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn import preprocessing
data = pd.read_csv("car.data")
#print(data.head()) # check the data frame to see if we need to prepare the
data before building the model
lblEnc = preprocessing.LabelEncoder()
buying = lblEnc.fit_transform(list(data["buying"]))
maint = lblEnc.fit_transform(list(data["maint"]))
door = lblEnc.fit_transform(list(data["door"]))
persons = lblEnc.fit_transform(list(data["persons"]))
lug_boot = lblEnc.fit_transform(list(data["lug_boot"]))
safety = lblEnc.fit_transform(list(data["safety"]))
clasify = lblEnc.fit_transform(list(data["class"]))
knnModel = KNeighborsClassifier(n_neighbors=9)
knnModel.fit(X_train, y_train)
allPred = knnModel.predict(X_test)
# create the model using the above definition, typical values are used for
optimizer, loss and metrics
neuralModel.compile(optimizer="adam",
loss="sparse_categorical_crossentropy", metrics=["accuracy"])
A Corporate Project
You have an online store and want to optimize the chat service your support
staff uses, so outside office hours, a chat bot will take over chat service and
offer answers to most common queries.
The chat bot will learn the user queries and over time provide better answers.
Let’s start coding without any delay. Go ahead and install the “nltk” and
“tflearn” libraries in the virtual environment. I am going to use a json file to
give training and testing data to our chat bot. I have filled the json file with
some random strings that look like most common user queries during the chat
in the past.
Here’s the complete code inspired by the work of “techwithtim” Youtube
channel. The chat bot will have to process natural language, in Python, we
can use the nltk library for this purpose.
import nltk
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
import numpy
import tflearn
import tensorflow
import json
import pickle
try:
with open("data.pickle", "rb") as f:
words, labels, training, output = pickle.load(f)
except:
words = []
labels = []
docs_x = []
docs_y = []
labels = sorted(labels)
training = []
output = []
for w in words:
if w in wrds:
bag.append(1)
else:
bag.append(0)
output_row = out_empty[:]
output_row[labels.index(docs_y[x])] = 1
training.append(bag)
output.append(output_row)
training = numpy.array(training)
output = numpy.array(output)
tensorflow.reset_default_graph()
model = tflearn.DNN(net)
try:
model.load("model.tflearn")
except:
model.fit(training, output, n_epoch=1000, batch_size=8,
show_metric=True)
model.save("model.tflearn")
To start using the chat bot, we can add two functions to the script as shown
below.
def bag_of_words(s, words):
bag = [0 for _ in range(len(words))]
s_words = nltk.word_tokenize(s)
s_words = [stemmer.stem(word.lower()) for word in s_words]
for se in s_words:
for i, w in enumerate(words):
if w == se:
bag[i] = 1
return numpy.array(bag)
def chat():
print("Start talking with the bot (type quit to stop)!")
while True:
inp = input("You: ")
if inp.lower() == "quit":
break
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print(random.choice(responses))
chat()
Conclusion
I hope you enjoyed this book. We started from the basics of Python and
worked towards implementing some pretty serious machine-learning
concepts. I also hope that after reading this book, you feel content with my
efforts and are motivated towards exploring more about machine learning on
your own.
There are various resources available online that you can follow to further
your journey in this field. You will find several good YouTube channels with
in-depth tutorials on machine learning.
No matter how you proceed now, keep this book handy because it will help
you find answers quickly. One of the problems learning with online resources
is that you will quickly forget where you saw something particular and waste
a lot of time searching for it. In all honesty, bookmarks are still a thing that
make books better.
Machine learning is already penetrating every walk of life, and the trend will
continue to grow unless there’s a world-wide catastrophe that completely
stops the internet. Since someone remarked, “Data is the new oil,” industry
experts have argued how valuable user data truly is for businesses and why
there's a need to give everyone the ability to monetize access to their data.
The social media platform you joined for free just to connect with your
friends years ago now shows you targeted ads, which the advertisement
company is heavily charged, and on top of that, sells your data to data
aggregators. Do you think the transaction is fair?
There is definitely a need to monitor and control what data companies can
access and for what reason. The European (General Data Protection
Regulation) GDPR is a very good initiative in this regard. I hope more
countries follow suit and adapt GDPR to their local internet industry. Too
much good stuff is also bad for you!
But, the future is not gloom and doom. Machine learning is already making
lives easier and as more industries become aware of its capabilities, it will
become easier to gradually optimize their services. Consider a fully capable
artificially intelligent robot providing medical services in places where there's
an extreme shortage of doctors and health facilities. Imagine a house that’s
aware of your needs and correspondingly prepares everyday like clockwork
without any manual adjustments. The future feels good!
References
Cortez, P. and Silva A. Using Data Mining to Predict Secondary School
Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th
FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12,
Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7
Geron, Aurelien. Hands-On Machine Learning with Scikit-Learn &
TensorFlow. 2d. ed. 1005 Gravenstein Highway North, Sebastopol, CA
95472: O’Reilly Publication, 2016
Rashka, Sebastian. Python Machine Learning. 1st. ed. 35 Livery Street,
Birmingham B3 2PB, UK: Packt Publishing Ltd., 2015
Appendix A: Machine-Learning Concepts
This will be a brief chapter where we will explore the theoretical background
of various machine-learning concepts that you will encounter from time to
time.
Data Scrubbing
A process that runs in the background to fix errors in the main memory
through periodical inspections is called data scrubbing. The errors are fixed
with redundant data, such as duplicating existing data or using checksums.
Data scrubbing removes varied and unique error instances with the same
error that makes it easier to filter and fix.
In data analytics and machine learning, we know that data integrity is vital to
get correct results from the implementation of an algorithm. Whenever data is
accessed from an unreliable source, it should first be error-checked before
any action is performed on it. Data is inherently dirty (having errors) and has
a tendency of fallback to dirtiness even after cleaning. Incorrect data can lead
to wrong decisions that in turn can cause huge business losses. Most
companies that utilize data analysis and machine learning in Customer
Relationship Management (CRM), marketing tools and Business Intelligence
(BI) software cite dealing with inconsistent data as the biggest issue. It’s like
putting low-grade gas in your car. It might sound better due to cheaper rates
and no immediate signs of significant decrease in performance. But, in the
long run, it’s going to ruin the car’s engine.
The quality of data used for analysis and predictions is the difference between
successful and failing businesses.
But, how you would error check without knowing if some data is correct or
not? You have to find inconsistencies in data or check for the possibility of
missing data. All computer systems have some sort of error checking so these
errors don’t affect the output. Data analysts and scientists have to take care of
data scrubbing themselves.
The future belongs to the Internet of Things (IoT) where all the systems will
be connected to provide seamless functionality. Every system produces a
mammoth amount of data and joining them together to create a supersystem
would mean every subsystem should be able to understand data generated by
another subsystem. Standardizing data production and transmission will be
necessary along with implementation of highly precise data scrubbing
methods.
In the long term, data scrubbing helps in determining the cause of errors that
must be fixed in order for the system to improve its performance. Even after
the errors are removed, data scrubbing must not be removed from the system.
Using machine learning with data scrubbing results in a self-learning
completely automatic error detection and correction system that can even
predict when an error will occur by analyzing system parameters. For
example, you have written a code that continuously downloads data from a
statistical website. The data scrubbing module actively checks the data to
ensure there’s no missing or erroneous data. The internet connection goes
down that leads to missing data (nulls and NaNs) being fed to the system.
The data scrubbing module checks for possible reasons and learns that it’s
due to a faulty internet connection. Now, whenever our script will start
getting missing data, the data scrubbing module will check the internet
connection. If it’s down, it will try to reconnect, and if that fails, it will pause
data import and notify the user through available methods of the problem.
This way, the data scrubbing module will keep self-learning whenever a new
error is met. Lastly, here’s a data scrubbing checklist for quick reference.
● Identify frequent sources of error and create a pattern. The sources can
be anything from bad internet connection to incorrect data import
settings.
● Check for all common data issues such as finding duplicates and
missing data. Investigate data with respect to the data standardization
rules to identify inconsistencies.
● Clean the data from all errors and inconsistencies.
● Update the data and system regularly, so they are always working with
the latest tools and standards.
● Secure access to the data and the system by deploying stringent data
security processes including staff training.
● Comply with all the regulatory laws and regulations governing the data.
An example is full compliance with GDPR for your website data.
● Enrich user experience by using high quality data that allows for better
user insights.
Neural Network
A set of algorithms created to mimic the working of a human mind is called
neural networks. Humans deduce outcomes using patterns and neural
networks also aim to use the same technique to reach conclusions.
Neural networks are chiefly used in classification and clustering. Neural
networks can be combined with other machine learning algorithms to predict
behavior of complex systems. Neural networks are made of numerous layers,
each can have different properties (arguments). Each layer in itself is made
from nodes, which are called neurons. The layers act as input and output of
the neural network. The neurons act as binary diodes that either switch on or
off in response to an input.
The layer structure helps neural networks to find relationship between data
attributes that isn’t possible with other single-layer predictive algorithms. The
input data goes through each layer just like a car goes through an automated
car washing facility.
Neural networks form the basis of deep learning because generally neural
network algorithms deploy three or more layers. The input of each layer is
the output of the previous layer (except the first layer that takes the raw data).
Many machine-learning experts consider neural networks as makeshift AI
because the layers act in a sequence on the data to reach a conclusion. Neural
networks are great at predicting the working of a complex system but fail to
establish the actual relationship between the input and the output.
Decision Trees
Decision trees contain a series of nodes much like a flowchart that the user
can use to predict the probability of an event by following the node path.
Decision trees are easier to understand for non-technical personnel. They are
not greatly affected by missing data or outliers, which means less time and
resources are needed to clean the data. There are some disadvantages as well.
One of the biggest disadvantages of using decision trees is overfitting.
Decision trees are better for systems that have non-linear relationships
between its attributes. The algorithms mostly rely on supervised machine
learning. Each internal node acts corresponds to a system attribute and each
leaf (external) node corresponds to a classification label. Decision trees are
also very similar nested sets of “if”, “then”, and “else” statements which
makes them easier to code. There are two types of decision trees.
1. Continuous variable decision tree: a decision tree that has a continuous
target variable (regression algorithms)
2. Categorical variable decision tree: a decision tree where the target
variable has specific categories (classification algorithms)
If the system data has high variance (very frequent data changes in the data),
decision trees might be difficult but not impossible to implement. Decision
trees are the algorithm version of Julius Caesar’s famous quote “divide and
conquer.” The “sklearn” library provides a module “tree” that can be used to
create and work with decision trees in Python. Decision trees are inverse of
natural trees in the perspective of location of roots and leaves. In a decision
tree, the leaves (terminal nodes) lie at the bottom while roots are at the top. It
also means a decision tree has a top-to-bottom hierarchy.
Algorithm Selection
You have read about and coded various predictive algorithms throughout this
book. You might be wondering which algorithm is best at predicting a
situation. The answer is, it depends on a lot of factors. There is no “one for all
situations” algorithm. As a data scientist, it will be your responsibility to
analyze the system data and behavior and choose the best model you think
most closely represent a given system. You will have to find the relationship
between system attributes, usually it’s done by plotting a trend of different
system attributes. Once, the relationship is estimated, the correct model is
picked along with the algorithm ideal for the specific system attribute
relationship.
It is important to understand that you can misinterpret the data and pick the
wrong model, leading to an algorithm implementation that will not give
accurate and precise predictions. A good data scientist will always keep this
in mind and rigorously analyze the system attribute data from different
perspectives before moving to pick a predictive algorithm. To automate this
process, the data scientist might train different models using the data set and
compare prediction results to choose one of them.
Data Mining
The process of inspecting large data sets and finding patterns by combining
techniques of machine learning, database systems, and statistics is called data
mining. It is very useful to detect patterns in data that might not be easily
apparent. When we say data mining, we are actually not mining data, but
mining what the data represents.
The process of data mining has usually six parts (these processes are not
sequential).
1. Anomaly detection: The process of finding an outlier, deviation, spike,
drop, or a huge change from the majority of data points that might be
errors or interesting instances to investigate.
2. Dependency modeling: Also known as association rule learning, the
purpose is to find relationship between different attributes of a system.
3. Clustering: The procedure of finding groups or clusters of similar data
points in a system is called clustering without prior knowledge of data.
4. Classification: The process of labeling or generalizing new data
according to preset data structure is called classification. We have
already seen this in action in our book.
5. Regression: The process of finding a function model that fits the
entirety or majority of the data set. This helps in dependency modeling.
6. Summarization: This process provides a concise data set representation
through many methods including data visualization and report
generation.
Data mining can be used wherever you can find digital data. This has led to
several cases of data misuse. It is the responsibility of the data
accumulators/aggregators to make sure the data collected is only available for
the purposes the data is collected for. Implementation of stricter rules is a big
problem, but the EU is way ahead of North America in this respect.
Unfortunately, there will be no way to completely stop data misuse in the
future no matter how strict rules are implemented. This is one of the biggest
challenges in Internet of Things (IoT).