Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
Introduction
1. Getting Started
2. Use Case 1 – Web Scrapping
3. Use Case 2 – Image Processing
4. Use Case 3 - Di erent File Type Processing
5. Use Case 4 – Sending and Receiving Emails
6. Use Case 5 – Dynamic Time Warping for Speech Analytics
7. Use Case 6 – Time Series Analysis and Forecasting
8. Use Case 7 – Fraud Analysis
9. Use Case 8 – Processing Geospatial Data
10. Use Case 9 – Creating Recommender Systems
Afterword
11. Post your Review
12. Website and Free Gift (Code to Download)!
13. References
Text and code copyright 2021 Daneyal Anis
Language: English
This book is dedicated to everyone who persevered through these
tough times and kept an eye on their dreams
INTRODUCTION
T
here are detailed instructions available via the Python
website on how to install Python on your machine –
whether it is Windows, Mac OS, or Linux under
Python website in the Getting Started section.
I recommend you read through it as there is a lot of good
information and helpful links for beginners. However,
installing Python by itself is not very helpful or user friendly.
You are better o installing Python alongside an IDE
(Integrated Development Environment) – as that comes
with tools and development environment to execute and
debug your code.
For that I recommend starting with Anaconda
Distribution. It is an open-source tool that installs industry
standard IDEs and foundational Python libraries that we will
be describing in this book in more detail. When you click on
the above link, it will take you to a page to download and
install package for your operating system e.g., Windows,
macOS or Linux.
We will be using Jupyter Notebook as our development
environment for this book – Jupyter is part of the Anaconda
distribution package and will be installed on your machine
along with Python.
Jupyter is a powerful web-based development
environment that we will be using in this book to execute our
code and I have made all the source code used in this book
available on my website as a ‘.ipnyb’ Jupyter file.
Once Anaconda is installed on your machine, launch
Anaconda Navigator from your menu. Screenshot below for
reference:
I
nternet is a massive source of data but not all websites
make this data easily accessible via API calls or
downloadable as .csv files. To be able to access and
process this data, you need a way to be able to ‘scrape’ the
data o the website. That’s where the power of Python
comes in.
It comes with a powerful library called BeautifulSoup that
allows you to parse and extract data from any web page for
your use. Let’s cover an example end to end.
The main steps involved in extracting data from a web
page end to end include:
Now let's see what HTML code content we got from the
get call code we made.
page.content
Below is the output we get:
This means there are two tables on the wiki page. Let's
look at the table output for first table
df = dfs[0]
print(df)
Note that we used [0] in the data frame array to grab the
first table. Here are the results we get when we execute the
above line of code:
We can also access individual columns by using column
names in the table:
print(df['Film'])
print(df['Year'])
print(df['Awards'])
Here are the results we get when we execute the above
lines of code:
Now, assume we are only interested in movie name and
year and don’t want the rest of the table. We can create a
subset of the data frame, only containing these two columns:
df2 = df[['Film','Year']]
print(df2)
Now that the data is nicely organized in a Panda data
frame, let's export to excel for additional analysis by using
'to_excel' function.
df2.to_excel('movies.xlsx')
I
mage processing is a very popular use case for data
science – from simple usage like applying image filters
to your Instagram photos to more complex use cases
like cancer cell analysis by applying pattern recognition.
As described in detail in the second best-selling book in
this series, Ultimate Step by Step Guide to Deep Learning
using Python, Convolutional Neural Networks (CNN) are
widely used with image data. The name derives from the
Convolutions that are realized between matrices in each
layer. It essentially is a mathematical operation that
translates each image into a matrix, so it can be processed
and manipulated as required for the purpose at hand.
Luckily, Python comes with several built-in powerful
libraries for image processing that we will use to illustrate
this concept in more detail below.
We will start with pyplot and scikit-image libraries which
is an open-source Python library and works well with
Numpy arrays.
import matplotlib.pyplot as plt
%matplotlib inline
We import sample images and filters from scikit-image
library. Full list of test images available can be seen at this
URL:
https://scikit-image.org/docs/dev/api/skimage.data.html
from skimage import data,filters
We use the checkerboard test image from the library and
print it in grayscale
image = data.checkerboard()
plt.imshow(image, cmap='gray')
type(image)
Let’s convert this image into a Numpy array:
np.ndarray
Now let’s mask this image:
mask = image < 85
image[mask]=256
plt.imshow(image, cmap='gray')
As introduced in the first book in the series, Scipy is a
foundational Python library used for mathematical and
scientific calculations. It works well with Numpy data
structures and comes with built-in libraries for image
processing, as part of scipy.ndimage sub-module.
Detailed documentation available via this URL:
https://docs.scipy.org/doc/scipy/reference/tutorial/ndima
ge.html#correlation-and-convolution
Let’s see it in action:
from scipy import ndimage
image = data.chelsea()
Original Image:
plt.imshow(image)
Now let's apply a light Gaussian filter to this cat image to
make it blurry. Details of the Gaussian filter and how it
works are included in the second book in the series and you
can also refer to additional documentation on types of filters
available by following this URL:
https://docs.scipy.org/doc/scipy/reference/generated/scip
y.ndimage.gaussian_filter.html
blurred_image = ndimage.gaussian_filter(image,
sigma=3)
plt.imshow(blurred_image)
And now, let's make it blurry by increasing the sigma
value for the Gaussian filter:
very_blurred_image = ndimage.gaussian_filter(image,
sigma=5)
plt.imshow(very_blurred_image)
D
ata scientists often work with di erent file types to
extract and process data before it is ready for
analysis. Knowing how to work with di erent file
types is an important skill to have in your toolkit. In this
chapter, we will go over Python processing modes for
di erent file types and how to organize this data into built-
in Python data structure. Let’s get started!
Python can work with the following file types:
L ET ’ S first start with reading from CSV file and storing its
information in a Panda data frame. For this example, we will
create a .csv file in the same folder where the Python code
resides for easy processing.
Let’s first read a CSV file containing comma separated
values for Oscar winning movies. The file looks like this:
We will import contents of the csv file into a Pandas data
frame:
import pandas as pd
df_csv = pd.read_csv('csvtest.csv')
Now let’s display the contents of this file in a data frame:
display(df_csv)
We can do the same with an excel file. Our sample excel
file looks like this:
To read a json file, like excel and csv files, you can use the
Pandas read_json function. For this purpose, we use JSON
file sample from:
https://json.org/example.html
Our JSON file sample looks like this:
df_json = pd.read_json('jsontest.json')
Now print all children of the root of the tree, and their
corresponding tags and attributes
for child in root:
... print(child.tag, child.attrib)
P
ython makes it easy to send and receive emails. For
that all you need is powerful libraries called smtplib
and imap. Now why would you want to send and
receive emails using a programming language you ask? It is
to automate the mundane activity of sending mass emails to
an email list while using a custom template. It is also to
allow you to parse through your di erent inboxes and look
for key information. Python makes it all possible!
Before we get into Python code, let’s first explain what
SMTP is. It stands for Simple Mail Transfer Protocol. It was
created in 1982 and is still in use by big email providers of
the world like Gmail, Yahoo Mail, and others. In simplest
terms, it is the language used by mail servers to
communicate with each other to send and receive emails. Got
it? Let’s move on.
Python library smtplib uses the SMTP protocol and has
built-in functions to send emails.
Let's first import the libraries we are going to use. Notice
that we imported BeutifulSoup library too as we will be using
it later to parse HTML content for our emails. You will also
notice that we imported the MIME standards - that's what
will allow us to email messages in di erent formats like
binary, ASCII, HTML, and others.
import smtplib
from email import encoders
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from bs4 import BeautifulSoup as bs
N
ow that we have covered some basic Python real
world use cases like web scraping, file, and image
processing as well as automating sending and
receiving emails, it is time to address more advanced use
cases. How about warping time?
Now, now…don’t worry…we are not messing with the
space time continuum and don’t want to start new
multiverses like in the comic books. Instead, in this chapter,
we are going to discuss the concept of time warping using
Python. What is time warping you ask? It is a spin on
traditional time series analysis where you want to compare
two datasets that occurred over the same period – however,
you run into a challenge where the x-axis which represents
time is not in the same scale between the two data sets i.e.,
does not have the same start and end time.
Ever wonder how AI powered home assistants like
Amazon Alexa and Google Home recognize your voice and a
specific phrase like “Stop” no matter how fast or slow you
say it? Or comparing results of financial markets between
months but one month had a smaller number of days in the
previous year because it was a leap year? That’s where time
warping comes in – you basically ‘warp’ your time axis to
make the two data sets comparable.
In the example in this chapter, we will two matching
audio phrases that are said at a di erent pace and one
completely di erent audio phrase of the same length and
then use time warping libraries in Python to compare the
results to find the right match. This type of use case is very
common in speech and pattern recognition in the real world.
The phrase we will use is “One Flew Over the Cuckoo’s
Nest” and it will be stored in two di erent audio files spoken
in di erent voices and in di erent inflections. The
contrasting phrase we will use is “I Love Python And I Can’t
Stop”. I will make the audio files along with the Python code
available to you, so you can test it for yourself as well.
Let the fun begin!
First you need to install FastDTW library from Python for
time warping analysis. Run the following the command in
your Python application prompt:
pip install FastDTW
Once installed, you are ready to import the FastDTW
library and use it in your code. While you are at it, you can
also import other libraries you will need for your analysis
including scipy wavfile library to process audio files,
matplotlib for plotting the results and numpy for advanced
calculations – which we will get into below.
T
ime series forecasting is a very common use case in
predictive analytics – specially in an operational and
sales space where line managers or salespeople want
to know based on historical sales patterns and seasonality
how much product and sta , they should have on hand to
meet the incoming demand. This is of course based on the
assumption that past is a predictor of the future – that is not
always the case when you have a crazy year like 2020 where
the pandemic completely threw o several predictive models
that rely on this assumption. Nevertheless, for the purposes
of this chapter and for simplicity, we will stick with the
assumption that past predicts the future.
We will also introduce a more sophisticated time series
forecasting library called Prophet from Facebook that makes
the more advanced calculations and traditional models like
ARIMA and Kalman Filter easy to apply at scale without a
masters in statistical analysis.
Now be warned that the Prophet library is still relatively
new and must be fully tested in the market but does have
promise. I found it slow in practice as we used it on a very
small dataset of e-commerce orders from July 2018 to Dec
2019 from Kaggle public dataset to predict sales for 2020 (it
will be available to you along with the code via a download
link later in the book).
Facebook Prophet also is not the most straightforward
library to install as it has several dependencies and requires
a C++ compiler before it installs successfully. You can follow
the link below to complete the installation steps:
https://facebook.github.io/prophet/docs/installation.html
Once you have Facebook Prophet fully installed, you are
ready to import the library and predict the future!
import pandas as pd
from fbprophet import Prophet
import warnings
warnings.simplefilter(action='ignore',
category=FutureWarning)
We are now ready to tell the model to predict 365 days (or
1 year) into the future and then plot the results by putting
the date in the x-axis and e-commerce orders on the y-axis:
In the above visualization, the dots are actual values of e-
commerce order per week and the trend line shows the
future. The bands around the trend line show our
uncertainty levels (in this case we set it to 95%).
We can also look at the forecast more closely by executing
this command:
model.plot_components(forecast_pd)
The top visual shows our trend – which is increasing
slightly as 2020 progresses and the bottom visual shows the
seasonality with a huge variation between August to
September.
How about that? Predicting the future with just a few
lines of code – using the ever-evolving Facebook Prophet
library to make the time series forecasting simpler to apply!
8
USE CASE 7 – FRAUD ANALYSIS
F
raud analysis is also a very common machine learning
use case specially in the financial services industry.
Data Scientists pour over millions of records of
historical financial transaction data to determine rulesets
that define fraudulent transactions and then build
algorithms to detect and stop fraud in its tracks. With the
fraudsters constantly changing their patterns the algorithm
used by data scientists also must be flexible enough to be
able to learn from the history and experience to adjust
accordingly.
In the example in this chapter, we will use a public
dataset for digital financial transactions through Kaggle.
This data will be available to you along with the code in this
chapter via a link later in the book.
Most common algorithms used for fraud analysis are
typically decision trees or some variations there of including
gradient boosting via XGBoost and the like. We will explore 6
million records of this dataset and then recommend an
algorithm that yields the highest level of prediction accuracy.
We will start with importing libraries we will need for
fraud analysis
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns
from sklearn.model_selection import train_test_split,
learning_curve
from sklearn.metrics import average_precision_score
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance, plot_tree
import warnings
warnings.simplefilter(action='ignore',
category=FutureWarning)
warnings.simplefilter(action='ignore',
category=DeprecationWarning)
warnings.simplefilter(action='ignore',
category=UserWarning)
P
rocessing geographical and geospatial data for
intelligent automation is becoming more and more
mainstream with the advent of intelligent drones and
autonomous vehicles capable of detecting and moving
around obstacles. In addition, geographical and demographic
analysis is also becoming quite common when you are
looking at opportunity assessment for new businesses.
Rome is the eternal city and my favorite city in Europe. I
have a lot of fond memories from it when I went back
packing through it with my university friends back in the
day. Given that it has been so hard hit in recent times
economically, I want to pay tribute to it by using it as part of
this book.
Given that Rome is such a popular tourist destination and
has so much history, it has expensive real estate as well as
population density. With the impacts to its economy,
investors can be looking at boroughs of Rome that have a
high population and relatively lower real estate prices. In
addition, using FourSquare data, I will also look at type of
businesses in each borough to be able to recommend the best
Rome neighborhood to start a business in and type of
business to start based on the real estate prices and
population density.
For this analysis, I used the following data sources:
OBJECTS = summary.Borough
y_pos = np.arange(len(objects))
performance = summary.Count
plt.bar(y_pos, performance, align='center', alpha=0.4)
plt.xticks(y_pos, objects)
plt.ylabel('Venue')
plt.title('Total Number of Venue in Borough')
plt.xticks(rotation=90)
plt.show()
The above bar chart shows us that Centro Storico and
Trieste have close to 100 venues, followed by Tremini,
Trastavere, Corso Francia, Della Vittoria and Bologna that
have venues in 40-60 range. Remaining boroughs are less
venue rich like Georgio VII, Balduina, Caracalla and
Camillucia. Camillucia specially seems low in venues and
potentially ripe for further investment.
Let's find out how many unique categories can be curated
from all the returned venues:
print('There are {} uniques
categories.'.format(len(rome_venues['Venue
Category'].unique())))
Now let's create the new data-frame and display the top
10 venues for each neighborhood:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
We also create columns according to number of top
venues:
columns = ['Borough']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common
Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common
Venue'.format(ind+1))
We create a new dataframe with venues and boroughs
sorted:
boroughs_venues_sorted =
pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough'] =
rome_grouped['Borough']
for ind in np.arange(rome_grouped.shape[0]):
boroughs_venues_sorted.iloc[ind, 1:] =
return_most_common_venues(rome_grouped.iloc[ind, :],
num_top_venues)
boroughs_venues_sorted.head(12)
As you can see from above analysis, using one hot
encoding, we categorized the venue types further and sorted
them based on occurrence by each borough. That gave us the
following results showing the top 3 most common venues by
each borough. This will help us further determine the best
investment opportunity in each borough depending on the
types of venues that exist currently.
R
ecommender systems is one of the most widely used
applications of Machine Learning in today’s world.
Here, technology meets marketing and psychology
to find the best matches between consumers and products.
All of us have experience with recommender systems:
sns.set(style="whitegrid")
sns.kdeplot(data=netflix_movies['duration'],
shade=False)
Based on the above chart, the average duration of movies
on Netflix is between 80 to 150 minutes. Now let's generate a
word cloud of the most common genres for movies by using
the word cloud library.
from wordcloud import WordCloud, STOPWORDS,
ImageColorGenerator
from PIL import Image
from collections import Counter
genres=list(netflix_movies['listed_in'])
gen=[]
for i in genres:
i=list(i.split(','))
for j in i:
gen.append(j.replace(' ',""))
g=Counter(gen)
text = list(set(gen))
plt.rcParams['figure.figsize'] = (13, 13)
In the code below, we format the word cloud by selecting
the background color and max number of words that show
up in the word cloud:
wordcloud =
WordCloud(max_words=1000000,background_color="whi
te").generate(str(text))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("o ")
plt.show()
Based on above word cloud, it appears that dramas, cult
movies and musicals are the most common movie genres on
Netflix. For full coverage, let's generate a similar word cloud
for tv shows.
genres=list(netflix_shows['listed_in'])
gen=[]
for i in genres:
i=list(i.split(','))
for j in i:
gen.append(j.replace(' ',""))
g=Counter(gen)
from wordcloud import WordCloud, STOPWORDS,
ImageColorGenerator
text = list(set(gen))
wordcloud =
WordCloud(max_words=1000000,background_color="whi
te").generate(str(text))
plt.rcParams['figure.figsize'] = (13, 13)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("o ")
plt.show()
Based on the above word cloud, Action, Adventure,
International and Mysteries are the most common tv show
genres on the platform. Now that we have analyzed the
Netflix movies and tv show genres, duration, and content
length, let's create our own recommender system to
recommend Netflix content to the users.
For that we will use the TFidVectorizer library from
SKLearn that comes built-in with recommender system
algorithms. The TF-IDF(Term Frequency-Inverse Document
Frequency) score is the frequency of a word occurring in a
document, down-weighted by the number of documents in
which it occurs. This is done to reduce the importance of
words that occur frequently in plot overviews and therefore,
their significance in computing the final similarity score.
from sklearn.feature_extraction.text import
TfidfVectorizer
We apply some data cleaning to the text by removing stop
words and null values. Stop words are generally filtered out
before processing a natural language. These are the most
common words in any language like articles, prepositions,
pronouns, and conjunctions and do not add much
information to the text that is being analyzed.
tfidf = TfidfVectorizer(stop_words='english')
netflix_overall['description'] =
netflix_overall['description'].fillna('')
We then construct the required TF-IDF matrix by fitting
and transforming the data:
tfidf_matrix =
tfidf.fit_transform(netflix_overall['description'])
And finally output the shape of tfidf_matrix:
tfidf_matrix.shape
Y
our review will help me improve this book and
future content and will also help other readers find
this book!
Thanks again for purchasing this book and your
continued support!
12
WEBSITE AND FREE GIFT (CODE TO DOWNLOAD)!
D
on’t be a stranger and please check out my website:
https://daneyalauthor.com/datascience
You will be able to download the code and
datasets used in this book by using the above link.
13
REFERENCES
1
. One of the richest in content websites about machine
learning:
Machine Learning, Medium.
https://medium.com/topic/machine-learning
2. Detailed tutorials and articles:
Learn Machine learning, artificial intelligence, business
analytics, data science, big data, data visualizations tools
and techniques, Analytics Vidhya.
https://www.analyticsvidhya.com/
3. Machine learning blog & code: Machine Learning
Mastery. https://machinelearningmastery.com/
4. Start coding with Jupyter notebook:
Running the Jupyter Notebook, Running the Jupyter
Notebook - Jupyter/IPython Notebook Quick Start Guide 0.1
documentation
https://jupyter-notebook-beginner-
guide.readthedocs.io/en/latest/execute.html
5. Check Python documentation when coding:
Latest Python documentation:
https://www.python.org/doc/
6. When stuck, google your question, and always check
Stack Overflow:
Where Developers Learn, Share, & Build Careers, Stack
overflow.
https://stackoverflow.com/
7. A machine learning community with coding
challenges and public datasets:
Your Machine Learning and Data Science Community,
Kaggle.
https://www.kaggle.com/
Public datasets used in this book under Creative
Commons license: