0% found this document useful (0 votes)
111 views

Columbiax - BAMM 101 - Python For Analytics

This document provides an introduction to using Python for analytics. It covers Python syntax, data types, lists, tuples, dictionaries, conditionals, loops, functions, and useful Python libraries. It then discusses acquiring and gathering data from APIs, SQL databases, and the internet. Finally, it reviews visualizing and analyzing data using NumPy, Pandas, and machine learning techniques in Python.

Uploaded by

Marc Cauchy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Columbiax - BAMM 101 - Python For Analytics

This document provides an introduction to using Python for analytics. It covers Python syntax, data types, lists, tuples, dictionaries, conditionals, loops, functions, and useful Python libraries. It then discusses acquiring and gathering data from APIs, SQL databases, and the internet. Finally, it reviews visualizing and analyzing data using NumPy, Pandas, and machine learning techniques in Python.

Uploaded by

Marc Cauchy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Columbiax – BAMM 101 - Python for Analytics

Contents
Columbiax – BAMM 101 - Python for Analytics...........................................................................................1
Python Crash Course...................................................................................................................................3
Syntax......................................................................................................................................................3
Types.......................................................................................................................................................3
On the correct usage of Lists, Tuples and Dictionaries............................................................................4
Tests........................................................................................................................................................4
Loops.......................................................................................................................................................5
For loop...............................................................................................................................................5
While loop...........................................................................................................................................5
Exception management.......................................................................................................................5
Function definition..................................................................................................................................6
Useful Stuff..............................................................................................................................................6
New Concepts..........................................................................................................................................6
Mutability............................................................................................................................................6
Useful Python Libraries................................................................................................................................7
DateTime.............................................................................................................................................7
Acquiring data.............................................................................................................................................8
Process for extracting and parsing the data........................................................................................9
Placing HTTP requests.........................................................................................................................9
Placing HTTP requests and getting a response..................................................................................10
Placing a HTTP request and getting and decoding/parsing a JSON response....................................10
Placing a HTTP request and getting and decoding/parsing an XML response....................................11
Parsing Web Pages............................................................................................................................13
Gathering Data from SQL Databases.........................................................................................................14
Visualizing data..........................................................................................................................................15
Numpy...................................................................................................................................................15
Pandas...................................................................................................................................................16
Creating a dataframe.........................................................................................................................17
Referencing data in a dataframe.......................................................................................................17
Creating dataframes from Internet data: Panda datareaders...........................................................18
Cleaning Data with Pandas....................................................................................................................19
Visualizing / Summarizing the data....................................................................................................19
Cleaning and Transforming the Data.................................................................................................19
Machine Learning with Python..................................................................................................................21
Python Crash Course
Syntax
{} and ; are replaced by indentation solely (!!!)

Python “namespace” (~ addresses) contains all possible values returnable by a program as well as
variables and functions names. Variables can be seen as a “name” referring to a location in Python
“namespace”, not pointers!!! They are said to be “immutable”

 For example, typing a = 1 then b = a will result in a new variable b with value 1; but a and b are 2
different variables for the program (not aliases!)
 Variables and functions (ouput) are interchangeable in Python

Operators: = for value allocation, == for comparison/test, also <,>, >=, <= operators will work

+,-,*,/ for mathematical operations, % for the remaining (for example 5%2 will return 1)

Strings can be concatenated with the + operator, not “&”

Types
Types do not need to be explicitely defined! For example, writing a = 1 will automatically define variable
a as an integer with value 1

 Floats; x = 1.3
 Integers; x = 1
 Booleans (not, and, or); x = True; y = False
 Lists – ordered/sequential entry (of any types above, types can be mixed!); Lists are mutable
(i.e. contrary to other variables, their content can be edited, partially or in totality); x = [0,1,2,3,
…];
o Index: x[1] will return the second element in the list (indexation starts at 0!!); x[1,2] will
return the 2nd and 3rd element; x[0:5] will return the first 6 elements; x[2,] will return all
elements from the 3rd onwards
o Add one element to a list (after last): x.append(y)
o Add multiple elements to a list: x.extend([y,z]) will add elements y and z at the end of
the list
o Insert one element to a list: x.insert(0,y) will add element y in 1st position of the list “x”
o Get length of list (number of elements contained): len(x)
o Display last element of a list: x[-1]
o Reverse a list: x[::-1]
o Remove last element from a list: x.pop()
o Remove element in position 2 from a list: x.pop(1)
o Remove (the first) value y of a list: x.remove(y)

 Strings: x = “HelloWorld”; is a list of characters with same properties


 Tuples – same as list but immutable; x = (1,3,4.5); all lists’ indexing operations can be used
except edition ones…
 Dictionaries: can be defined as an unordered list of pairs of values – 1 key and 1 value; keys are
immutable and values are mutable; x = {A:”Marc”, B:”Delphine”, C:”Constantin”}
o Access is possible through the key: x[A] will return “Marc”
o All keys are returned in a list with x.keys()
o A sorted list of keys (alpha or numeric) is returned with: x.sorted()
o Deleting an entry from a dictionary (key and value) is done through: del(x[key])
 Sets: unordered collection of unique values (no keys); x = {“Marc”,”Delphine”}
o Add an object through: x.add(“Constantin”)
o Multiples operators allow to compare and manipulate several sets:
 set1 <= set2: is set1 a subset of set2?
 set1 > set2: is set 2 a subset of set1
 set1&set2: return the intersection of set1 and set2
 set1|set2: return the union of set1 and set2
 set1-set2: return the difference of set1 and set2

On the correct usage of Lists, Tuples and Dictionaries

[Lists] are mutable: content can be changed => use for storing lists of variables of same type or not,
that can/will change with time (for example: the names of your dogs)

(Tuples) are immutable: content cannot be changed => use for storing lists of constants/variables
(of same type or not) that are not expected to change with time (for example the months of the year)

{Index:Dictionaries} are mutable: content can be changed => use for storing lists of paris of
variables (of same type or not) that can/will change with time and need to be identified by a textual
index (for example a phonebook with pairs of names and telephone numbers)

{{ Dictionaries of dictionaries}} can be used for more complex structures requiring


more than pairs of data…

Tests
if(condition):

Do this or that…

elif(condition2) // “else if”

Do that or this

else

Do this

The if/elif/else structure also replace the “case/case if” structure


Loops
For loop
for x in range(5):

Do this

 Will iterate 5 times

fruits = [“apple”,”banana”,”orange”]

for x in fruits:

print(x)

 will print each element of the list “fruits”

for x in “banana”:

print(x)

 will print each letter of the string “banana”

The functions break and continue can be inserted (with if/else conditions) to define breakout and skip
conditions:

for x in “banana”:

if x=”n”:

break

print(x)

 will print each letter of the string “banana” until a “n” character is met

for x in “banana”:

if x=”n”:

continue

print(x)

 will print each letter of the string “banana” except “n” characters met

While loop

Exception management
Whenever something has a chance of going wrong (for example if trying to gather / parse some data
from the internet), python code should be embedded in exceptions management loop, under the form:

try:

Do something
except:

Do something if an error has occurred (give notice to the program or end user…)

Function definition
def funcName(arg1, arg2):

blabla operation 1

blabla operation 2

return whatever

Not the function operations…

Another function can be passed as an argument

A function can return more than 1 variable: return x, y

Arguments can have a default value defined:

def funcName(arg1 = 1, arg2):

 In this case, arg1 will not need be passed as argument

A function not returning a return value will return none

Classes and Objects


To create a new class with parameters and methods:

class MyClass:

var1 = True

var2 = 1

def mymethod(self):

print(“This object var2 parameter value is “ +self.var2)

Note: keyword self refers to the invoked object of the class. It must be passed as the first argument of
any class method defined and parameters of the class used by the method must be referred in the
function as self.parameter!!!

Overriding the class __init__() function: he __init__() function is automatically called whenever an object
of the class is created (i.e. the class constructor). It can be defined as a class method overriding the
default __init__() function:
class MyClass:

def __init__(self, var1, var2):

self.var1 = False

self.var2 = 2

To initiate an object of the class, access a parameter or invoke a method:

myobject = MyClass(True, 3) // here, arguments “True” and “3” are passed to the constructor as “var1”
and “var2”

myobject.var1 // will display “True”

myobject.mymethod() // will display “This object var2 parameter value is 3”

Class Inheritance
To create a child class inheriting the properties and methods of the parent (aka “super”) “MyClass” class,
give the child class the parent/super class as an argument:

class MyChildClass(MyClass):

pass

The child class “MyChildClass” inherits the parameters, methods (including defined __init__ function) of
the parent/super class. The keyword pass can be used if no parameters or method need to be further
defined.

The function super() can be used from within the child class to refer to the properties and methods
inherited from the parent/super class:

class MyChildClass(MyClass):

def __init__(self, var1, var2, var3):

super().__init__(var1, var2) //here the child object will execute the parent class constructor, defining
the parameters var1 and var2

self.var3 = var3 // then the child class var3 parameter (not defined for the parent class) is defined

To override an existing method or parameter of the parent function, simply redefine the function within
the child class:

class MyChildClass(MyClass):

def mymethod(self):

print(“I am the child class”)


More details on the super() function: https://appdividend.com/2019/01/22/python-super-function-
example-super-method-tutorial/

Useful Stuff
print(“HelloWorld”,y,z) will display the content (work with both values and variables, with one or several
elements)

id(x) is the memory address block recording the value of variable x (good to illustrate the mutability
concept below)

Importing a library is done using statement import with either the library (will import all functions) or a
function belonging to a library; import datetime or import datetime.date

Type conversion can be done explicitely: a = int(x), b = float(y)…

New Concepts
Mutability
Every variable and objects in Python is either values mutable or immutable:

 Immutable means the value cannot be changed (e.g. Integer, Float and Boolean): a variable is
allocated a “fixed” value; changing the value of the variable means pointing the variable to
another value that happens to reside in Python namespace (i.e. am immutable variable is not
referring to a fixed address)
 Mutable means the value can be changed / edited (e.g. list): changing the value of a mutable
objects means actually replacing the value stored in the corresponding memory address

Consequences:

 Mutable objects are persistent in memory (for example, if edited multiple times, they will
remain and continue to exist after being created – the address block does not change)
 Immutable objects are non-persistent: if edited multiple times, the value will simply be
“replaced” (the variable will be reallocated to the new value which happens to sit in in another
address block)

Differentiated behaviors are exemplified below:

Reference: https://en.wikipedia.org/wiki/Immutable_object
Useful Python Libraries
DateTime
Useful for manipulation of time

 Creating a date: d = datetime.date(yyyy,mm,dd)


 Creating a time: t = datetime.time(hh,mm,ss)
 Creating a date & time (timestamp): dt = datetime.datetime(yyyy,mm,dd,hh,mm,ss)
 Reporting today’s date: datetime.date.today()
 Reporting now’s datetime (date & time): datetime.datetime.now()
 Reporting now’s time (not date): datetime.datetime.now().time()
 Reporting a time delta between 2 dates or times or datetimes: datetime.timedelta (expressed in
days, hours, minutes and seconds)
 Extracting days, minutes and seconds from a time delta: timedelta.days, timedelta.minutes,
timedelta.seconds
 Adding/substracting days to a date: NewDatetime = datetime.today()
+timedate.timedelta(days=5)
 Adding/substracting time to a date: NewDatetime = datetime.now()
+timedate.timedelta(days=5, hours=2, minutes=10, seconds=30)
 Converting a string to a timedate or date: datetime.datetime.strptime(date, ‘string_format’)
 In reverse, converting a date or datetime to a string: datetime.datetime.strftime(datetime,
‘string_format’)

Note: there is no built-in function to add a timedelta to a datetime object

Leap years are managed

Acquiring data
 Flat files (CSV, pdf, xls…)
 Web files (HTML, XML, JSON…)
 Databases (mySQL, postgres, NoSQL such as mongoDB etc)

Technologies/standard:

 RESTful services = web services conforming to REST standard (almost all servers recently) ==
web servers delivering data using standardized functions integrated in the URL, e.g.
http://www.epicurious.com/search/thai%20chili to serve a request on the website search
engine using GET command with input “thai chili”, which allows to request data requests
without actually navigating to the website!
 API requests are non-human data requests, sent in HTML format; the reply is usually a XML or
JSON file (can be multiple GB big!) – example: www.googleapis.com
 JSON is the most common format used and is human-readable, similar as XML; it is made of
types, each of which has a direct Python equivalent:
Example of JSON format:

{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}

Process for extracting and parsing the data


1. Place an HTML request
2. Import the response in a request class object
3. Verify that the back and forth went correctly by checking the response code (should be 200 or
201, any response code starting with 4xx indicates an error)
4. Decode the content (including specifying the received HTML page format! – usually utf-8); the
output is returned in a single string
5. Parse the content (from inside the string)

Placing HTTP requests


Libraries to send requests:

 Requests (documentation: http://docs.python-requests.org/en/master )


 urllib.requests / urlib2.requests (Python2)

Librairies to parse and process the resulting data:

 json (JSON APIs)


 lxml (XML APIs)
 BeautifulSoup, Selenium (for unformed HTML data – “web scraping”)

Placing HTTP requests and getting a response

Method requests.get to use the GET html method:

req = requests.get(url_withsearchword)

To see if the requests went well, use requests attribute status_code:

req.status_code should give code “200” or “201” (“OK”) – 4xx codes means an error
To know what is the encoding format of the response (usually utf-8 or utf-16), use attribute
requests.encoding

req.encoding should return the correct response page format

To collect the response with correct decoded format, use method requests.content.decode(format)
indicating the correct format

req.content.decode(req.encoding)

Placing a HTTP request and getting and decoding/parsing a JSON response


The returned string using a requests request can easily be converted and restructured into Python
objects using the json library.

Useful functions (json library):

json.loads(<string>) will convert a json string to an assembly of python objects

Once converted with this method, a Python dictionary should be obtained (itself possibly containing
multiple dictionnaries, list, tuples etc.. as per the type conversion table below)

json.dumps(<python_object>) will convert a python object into a json string

Json library documentation: https://docs.python.org/3/library/json.html

Placing a HTTP request and getting and decoding/parsing an XML response


Example of an XML file:
A whole XML file gathered through an API is returned as one Python string. The returned string using a
requests request can then easily be converted and restructured into Python objects using the lxml
library and in particular the object etree.

 The whole XML tree is imported into an etree object, as defined in lxml library

from lxml import etree

root = etree.XML(xml_as_string_text)

To display the entire XML tree in a “nice format” while indicating the correct formatting:

print(etree.tostring(root, pretty_print=True).decode("utf-8"))

In a XML tree such as:

<Bookstore>
<Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">
<Title>New York Deco</Title>
<Authors> => element called “Authors” without text
<Author Residence="New York City"> => attribute “Residence” with
value “New York City”
<First_Name>Richard</First_Name> => element called
“First_name” with text value “Richar”
<Last_Name>Berenholtz</Last_Name>
</Author>
</Authors>
</Book>
<Book ISBN="ISBN-13:978-1579128562" Price="15.80">
<Remark>
Five Hundred Buildings of New York and over one million other books
are available for Amazon Kindle.
</Remark>
<Title>Five Hundred Buildings of New York</Title>
<Authors>
<Author Residence="Beijing">
<First_Name>Bill</First_Name>
<Last_Name>Harris</Last_Name>
</Author>
<Author Residence="New York City">
<First_Name>Jorg</First_Name>
<Last_Name>Brockmann</Last_Name>
</Author>
</Authors>
</Book>
</Bookstore>

To retrieve data from an XML tree, function iter():

for element in root.iter():


... print("%s - %s" % (element.tag, element.text))
To retrieve data from 1 or several specific XML tag(s), the XML tag(s) name can be passed as an
argument to iter():

>>> for element in root.iter("another", "child"):


... print("%s - %s" % (element.tag, element.text))
Alternatively, the XPath can be used with function findall() to locate a specific XML branch in the tree:

>>> for element in root.findall("Book/Title"):


... print("%s - %s" % (element.tag, element.text))
 Here, the XPath “Book/Title” is a branch whose parent element is called “Book” and child
element is “Title”, this code will iteratively retrieve whole subchildren elements of “Title”

The function element.find() with another tag name in argument can also be used:

for element in root.iter("Author"):

print(element.find('First_Name').text,element.find('Last_Name').text)

The element class also has the attributes listed in a list called “attrib” to read / access the tag attributes:

for element in root.findall("Book/Authors/Author"):

print(element.find("First_Name").text, element.find("Last_Name").text,"living in:


",element.attrib['Residence'])

Documentation for lxml etree: https://lxml.de/tutorial.html


Parsing Web Pages
Several options such as:

 Beautiful soup: library to convert and explore HTML into dictionnaries (documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – can also be used to scrap XML
 Selenium: library emulating a browser and able to interpret local scripts output (javascript,
JQuery)

Beautiful Soup – Simple parsing


The approach is to rely on CSS classes to (i) identify where the information of interest is located in a web
page and (ii) automatize the retrieval of information in the corresponding tags, putting the content of
the web page in hierarchized dictionnaries.

It means the parsing function must be tailored to each specific website…

Typical HTML tags of interest:

 <p>: paragraph, should contain text


 <li>: list, should contain text
 <a>: url links, to capture to repeat the parsing on the next pages…
 <form>: to submit data (login/passwords, keywords…) and continue parsing on another page

Full documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Converting a full web page into dictionaries without beautiful soup:

url = "http://www.epicurious.com/search/" + keywords

response = requests.get(url)

results_page = BeautifulSoup(response.content,'lxml') // 2nd argument is the parser used – can be lxml


(faster) or html5lib

print(results_page.prettify())

Fathering the data in the page:


Gathering Data from SQL Databases

Library for MySQL connections:

Import pimysql

Create a database connection (can create multiple connections in parallel):

db = pimysql.connect(“MySQLserverAddress”,”logon”, ”pwd”, “DatabaseName”)

Create a cursor (buffer / command object between MySQL DB and Python objects):

cursor = db.cursor()

Executing a SQL command:

cursor.execute(“SQL COMMAND PASSED AS A STRING”)

Collecting the results from the cursor / buffer:

cursor.fetchall() -> return the whole result table

cursor.fetchone() -> return the first row of the result table

Results are returned as a list (1 row) or list of lists (multiple rows)

When finished working with the database, close the cursor:

cursor.close()

Visualizing data
Amongst the huge number of Python librairies, are very useful for visualisation:
 Numpy: executing numerical and array operations, very fast
 Pandas: provide data structuration, useful for most data analysis
 Matplotlib, seaborne, bokeh, plotly (drawing graphs), gmplot (geo representation on google
maps…) etc for visualization itself

Numpy
 Implement multi-dimensional arrays type (in essence lists with consistent types object – not
mixed) which are much faster to handle than lists and more memory efficient (because they
restrict the content to the same type of data); arrays are mutable;
 Support linear algebra, Fourier transformation, random number generation; in general very
useful for mathematical calculations

Documentation: https://docs.scipy.org/doc/numpy-1.13.0/reference/index.html

Creating an array using the array() function:

x = np.array([1,2,3,4,5])

Creating a single-dimensional array and specifying the type of the data that will get into it:

xf = np.array(x,'float') Could be a float, int, string… can also convert an existing array of a given type into
another type (such as in the example before)

Basic statistical functions callable as methods to the array object:

x.sum(), x.mean(), x.std(), …

Creating 2-dimensional arrays (row per row):

x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]] => Here: 3 rows and 6 columns array

Note: once the array object is created, it should be referred to as a list with []

Indexing a cell of an existing 1D array: x[1]

Indexing a cell of an existing 2D array: x[0,1]

Remember that index start at 0!

Transformation methods on arrays:


Slicing an existing 2D array (taking a portion):

X[rows excluding last one, columns excluding last one]

For example: x[0:3,1:2] will take rows 0, 1 and 2 (excluding row 3) and column 1 (only!)

X[2:,3:] will take all rows from row 2 onward and all columns from column 3 onward
Reshaping an array: changing the number of rows and column, must respect the number of elements it
contains; for example, a 4x3 array can be reshaped as a 6x2 array (12 elements each):

x.reshape(6,2)

Random numbers generation:


Generating a 1D array of random numbers (normally distributed):

np.random.normal(size=10)

Same but in a 2D array:

np.random.normal(size=(10,10))

Generating a 2D array of integer in an interval of (-20, 20):

np.randint (-20, 20, size=(10,10))

Other distributions are supported (exponential, Poisson…)

Note: for vector multiplications (machine learning…), it is exponentially (as the number of rows/columns
increase) faster to use multiplication of numpy arrays as compared to simple lists.

Looking at correlation between independent variables


Quick look at correlations between one independent variable in relation to the others
Variables must be numerical and normalized with values between -1 and 1

import pandas as pd

Import import matplotlib.pyplot as plot

df.corr()[‘ColumnNameOfVariableLookedAt’].plot()

Will show a plot showing the correlation (from -1 to 1) of the variable chosen in relation to the others:
(with correlation in Y axis, and other variables in X axis)

Quick look at correlations between multiple independent variables


Variables must be numerical and normalized with values between -1 and 1

import pandas as pd

Import import matplotlib.pyplot as plot

plot.pcolor(pd.dataframe.corr())

plot.show()

Will display an image such as:

Where correlated variables with show up as yellow

Deeper look at correlations between multiple independent variables


Require some computation time!

from pandas.plotting import scatter_matrix

p=scatter_matrix(dataFrame, alpha=0.2, figsize=(12, 12), diagonal='kde')


Where each square is a scatterbox representation of the correlation between 2 variables; highly
correlated variables should return a distinct pattern such as:

for a positive correlation and

for a negative correlation


And the diagonal indicating the shape of each variable’s values distribution:

indicate a quite normal distribution

indicate a variable with 3 local minimums -> will not work well with a regression analysis if
treated as a dependent variable…

Pandas
 Default / recommended structure for storing data for analytics ~ kinda similar to a
programmatic (smarter) version of Excel spreadsheet in term of functionalities
 Implement dataframes types, which are 2-dimensional arrays, where columns can be named
and indexed with their name (rather than index number)
 Rely on numpy and matlib library (so should be imported along with pandas)
 Support time series
 Comes with libraries for data collection / retrieval (both formats and API): xls, csv, html, google,
world bank…

Documentation: https://pandas.pydata.org/pandas-docs/stable/

2 main objects:

 Series: 1D array object


 Dataframe: 2D array object; each column is a named Series; each row can be index by a variable
of any type (sequential integer, timestamp…)

Libraries to import:

import pandas

import pandas-datareader (connector libraries to data sources)

Creating a dataframe
Creating a 2D dataframe with named columns:

pd.DataFrame([[1,2,3],[1,2,3]],columns=['A','B','C']) will create:

A B C

0 1 2 3

1 1 2 3
df = pd.DataFrame([['r1','00','01','02'],['r2','10','11','12'],
['r3','20','21','22']],columns=['row_label','A','B','C'])

df.set_index('row_label',inplace=True)

The second line will indicate that the row index column is the one called ‘row_label’ (the option
inplace=True indicates that it is an existing column, not a new column to be added); accordingly, a row
can be called by it’s index:

Slicing/Referencing data in a dataframe

Using functions loc() and iloc():

df[‘r1’] => will return the first row; similarly for columns:

df[‘A’] => will return the column called ‘A’

df.loc[‘r1] or df.iloc[0] => will return the first row

df[[‘A’,’B’]] => will return columns A and B (warning: the list of column must be passed in a list[]!)

df[‘r1’],[‘A’] => will return the cell value at the intersection of the row and columns called

df.loc['r1':'r2'] => will return the “slice” made up of row 1 to row 2 (all columns)

df.loc['r1':'r2','B':'C'] => will return the “slice” made up of row 1 to row 2 and column B to C

Creating dataframes from Internet data: Panda datareaders

Format
Data Description Reader Writer
Type

text CSV read_csv to_csv

text JSON read_json to_json

text HTML read_html to_html

text Local clipboard read_clipboard to_clipboard

binary MS Excel read_excel to_excel

binary HDF5 Format read_hdf to_hdf

binary Feather Format read_feather to_feather

binary Parquet Format read_parquet to_parquet

binary Msgpack read_msgpack to_msgpack

binary Stata read_stata to_stata


Format
Data Description Reader Writer
Type

binary SAS read_sas  

binary Python Pickle Format read_pickle to_pickle

SQL SQL read_sql to_sql

SQL Google Big Query read_gbq to_gbq

Example: creating a dataframe from a HTML table

Note: Hit or miss… returning 403 errors on many tables; possible workaround to avoid bot detection
returning 403: use header option (disguise as a browser – does not work on sites with elaborate anti-bot
protection..):

header = {Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1}

df_list = pd.read_html(‘url') => will return a list of dataframes (1 dataframe per html table found in the
given url) that can then be displayed:

df_list[0]

Cleaning Data with Pandas

Visualizing / Summarizing the data

Documentation: https://towardsdatascience.com/data-visualization-exploration-using-pandas-only-
beginner-a0a52eb723d5

Overview of the columns of a panda dataframe:

dataFrame.info() method : returns the list of columns and their types as well as the number of rows and
memory used by the dataframe

dataFrame.iloc[1:10] : returns the first 10 rows of a dataframe

dataFrame[dataFrame[‘Column1’ == ”someSpecificValue’]] : displaying the content of rows for which a


given column has a specific value

dataFrame[dataFrame[‘Column1’ != ”notThisValue]] : displaying the content of rows for which a given


column does not have a specific value
dataFrame[dataFrame[‘Column1’ == ”someSpecificValue’][‘Column2’, ‘Column3’ ] : displaying the
content of “Column1”, “Column2” and “Column3” for which “column1” has a specific value

dataFrame[‘Column Name’].unique() : return the list of the unique values of a column => good to spot
the outliers in the case of a (preferably finite) list of categories…

dataFrame[dataFrame['Column1']=='Unspecified'].groupby('Column2’).count() : count the number of


rows whose column1 value is “unspecified” by possible “Column2” value (~ equivalent to a pivotable
with count() function)

dataFrame[‘Column1’].value_counts() : list the number of occurences of each possible value of


‘Column1’, sorted in descending order

Documentation on groupby() function: https://www.shanelynn.ie/summarising-aggregation-and-


grouping-data-in-python-pandas/

dataFrame[‘NumericalColumn’].describe() : return count and distribution characteristics (mean, std,


min, max and quartile information) about a column numerical value; for example:
count 805742
mean 5 days 00:05:11.538976
std 12 days 06:08:17.201098
min -134 days +00:00:00
25% 0 days 02:34:46
50% 0 days 21:10:44.500000
75% 4 days 14:29:59.750000
max 148 days 13:10:54

Note: describe() can also be applied to the entire dataframe, in which case the statistical information will
be returned for all numerical columns.

Cleaning and Transforming the Data

Pandas.str(): convert to a string

Pandas.int(): convert to an integer

Pandas.apply_map(clean_function): apply an ad hoc cleaning function (to define separately) to each


cell, for example:

>>> def clean_function(item):


... if ' (' in item:
... return item[:item.find(' (')]
... elif '[' in item:
... return item[:item.find('[')]
... else:
... return item

String matching comparison: REGEX: https://docs.python.org/3.6/howto/regex.html

Implicitly applying an equivalent of a if() function following the syntax:

data[ ConditionDescription ]

For example:

data = data[ data[‘ColumnName'].notnull() ] => keep only rows where the value of ‘columnName’ is not
null

data = data[ (data[‘ColumnName'].notnull()) & (data[‘OtherColumnName'].notnull()) &


(data[‘LastColumnName'].notnull()) ] => idem with condition applying on multiple columns

Converting a date/time from a string to a proper datetime format: use datetime.strptime():

Data[‘DateColumn’] = data[‘DateColumn’].apply(lambda x:datetime.datetime.strptime(x, '%m/%d/%Y


%I:%M:%S %p') : convert a str format in the specified regex format (here: '%m/%d/%Y %I:%M:%S %p'
means month followed by “/” followed by day etc) to the standard datetime format.

With data in proper datetime format, additions and substractions can be made; for example, creating a
new column with the period of time between 2 other dates:

Data[‘ElapsedTime’] = Data[‘Date2’] – Data[‘Date1’]

Note: the python keyword “lambda” allows to create a short function defined with syntax:

lambda arguments : result

For example:
x = lambda a, b, c : a * 10 + b * c
x(1,2,3)

is the equivalent of writing and calling:


def myfunc(a, b, c):
  return a * 10 + b * c

myfunc(1, 2, 3)

Machine Learning with Python

Algorithms guide:
Source and information by algorithms:

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Concepts / Vocabulary
Features aka (independent) variables aka attributes

Curse of Dimensionality: when the dimensionality increases (to hundreds or even thousands of
independent variables), the volume of the space increases so fast that the available data become sparse.
This sparsity is problematic for any method that requires statistical significance. In order to obtain a
statistically sound and reliable result, the amount of data needed to support the result often grows
exponentially with the dimensionality. Also, organizing and searching data often relies on detecting
areas where objects form groups with similar properties

Features selection is identifying a subset of the original independent variables in order to build a
predictive model (while avoiding the curse of dimensionality…)

Features projection aka Features extraction is the transformation of the source data from the high-
dimensional space to a space of fewer dimensions (i.e. diminishing the number of dependent variables).
This can be done using multiple methodologies such as Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA) etc…

Multinomial Classification is a classification problem of one independent variable y that can take 3 or
more possible values (as opposed to a binary, Yes/No possible values). Not to be confused with multi-
label classification.

Multi-label Classification is a classification problem of 2 or more independent variable yn.

Transformer: refer to a function performing some data pre-processing work on the input data

In practice, dimension reduction is often performed when the number of features is over 10

General Approach / Steps


Whatever the application domain (text classification, image recognition etc), the following steps will
usually be applied:

1. Data pre-processing: cleaning, scaling, normalization etc


2. Features selection
3. Features extraction
4. Separating data into training and testing sets
5. Machine learning algorithm(s) application

When a sequence of steps is identified for a particular use case, a pipeline can be defined to perform the
same steps (calling specific transformers, machine learning algorithms etc)

Transformation of Categorical data / features


Categorical data is typically a (discrete) list of strings / text, which is not interpretable as input for most
machine learning algorithms and need to be digitalized.
Multiple approaches can be used.

Discussion on multiple encoding approaches and benefits/drawbacks:


https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-
of-3-6dca2f71b159

Excerpt: For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot
for high cardinality columns and decision tree-based algorithms.

For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum,
BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason
you might want to try them.

Simple Labels-Encoding
I.e. turning a categorical data (e.g. a list of cities with n values possible) into a 1 digit columns of n
possible integers (0, 1,2 .. n)

Using Pandas dataframe columns .cat.codes attributes:

Dataframe[‘NewColumn’] = Dataframe[‘ExistingCategoricalDataColumn’].astype(‘category’).cat.codes

Note: the categorical data column must be of Categorical type (not Object) – conversion can be done
from an Object column by calling method .astype(‘category’):

Advantage of One-Hot-Encoding: encoding 1 categorical variable in only 1 digital variable; Disadvantage:


depending on the exploiting ML algorithm used, can attribute different weights to different values…

One-Hot-Encoding
I.e. turning a categorical data (e.g. a list of cities with n values possible) into a n binary columns (0/1)

Advantage of One-Hot-Encoding: avoiding giving additional weight to certain values rather than others;
Disadvantage: can come up with a huge number of Binay columns / features (cf. curse of dimensionality)

Use pandas.get_dummies(nameOfDataFrame, column(s)ToEncode, PrefixofEncodedBinnaryColumns).


Method to easily add the binary columns to an existing dataframe.

Alternatively, use scikit-learn library labelBinezer object’s lb.fit_transform() method (results must be
turned into a dataframe and appended as new columns to the original dataframe containing the
categorical data):

from sklear-scikit import LabelBinazer

lb = LabelBinazer()

lb_results = lb.fit_transform(Originaldataframe[‘CategoricalColumn’])

lb_results_dataframe = pandas.DataFrame(lb_results, columns=lb.classes_)

results_dataframe = pandas.concat([Originaldataframe, lb_results_dataframe], axis=1)


Binary-Encoding
I.e. I.e. turning a categorical data (e.g. a list of cities with n values possible) first into a base-10 digit
(0,1,2,3 … n) then convert the base-10 numbers in base 2 (000, 001, 010, 011…), then creating columns
to transcript base-2 numbers, which typically reduces the number of output columns compared to One-
Hot-Encoding.

Can be done by importing the BinaryEncoder object from category_encoders library, loading the
BinaryEncoder() method and applying the fit_transform() function on the original dataframe:

import category_encoders

be = category_encoders.BinaryEncoder(cols=[‘CategoricalDataColumn’])

ResultDataFrame = be.fit_transform(‘OriginalDataFrame’)

Others encoding methods


The category_encoders library support multiple categorical data encoding methodologies.

Documentation: https://contrib.scikit-learn.org/categorical-encoding/

Separating training and testing datasets


Sklearn library has a useful object to randomly split a dataframe into a training dataframe and a testing
dataframe:

from sklearn.model_selection import train_test_split

train, test = train_test_split(InputDataFrame, test_size = 0.3)

where test_size is the ratio (in%) of the test size data (here testing set should be 30% of the original
data) and the returned dataframes are train (training data dataframe) and test (testing data dataframe)

Feeding the model


1. Indicating what the independent variables (aka features / predictors) in the training set:

x_train = train.iloc[rows, IndependentVariablesColumns] : pass all the cells – i.e. all rows of the training
dataframe and all independent variables columns; by convention, noted as “x”

2. Indicating what the dependent variables are:

y_train = train[dependentVariablecolumns] : pass the columns of the training dataframes containing the
dependent variables; by convention, noted as “y”

3. Selecting and fitting a model:

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(x_train,y_train)
Here, selecting a linear regression model and training it with the training data (parameters of the fit()
function are dependent variables and independent variables)

4. Indicating the dependent and independent variables in the testing set and running the model:

x_test = test.iloc[rows, IndependentVariablesColumns]

y_test = test[dependentVariablecolumns]

Approaches for multinomial classification problems


Several approaches possible:

 Decompose a classification to n values as n binary classification problems (one for each possible
value)

Concept exploration for complex classification: https://medium.com/@manish54.thapliyal/hierarchical-


classification-in-data-science-45521a3430c7

Interpreting/benchmarking a model results


Indicators for results quality after testing the data:

 Precision: measured in %, indicates the proportion of true positives out of total positives (or
how reliable the model is)
 Recall: measured in %, indicates the proportions of positives correctly identified by the model
(or how comprehensive the model is)
 Accuracy (aka F-score): measured in %, gives a composite score of both Precision and Recall)
 Confusion matrix: provide more details by giving the proportion of True Positives, False
Positives, False Negatives and True Negatives

 Threshold value: in the case of predicting a result for a categorical dependent variable, the value
we decide the fix to decide whether a predicted value (given as a probability between 0 and 1) is
considered as a 1 or a 0 (normally 0.5 but an alternative threshold might be selected)
 ROC curve: a visual representation of a classifier performance compared to a random classifier
(that would randomly pick up a result based on the expected value distribution in the training
set – i.e. if a training set had 80% of the dependent variable as “1”, the random classifier would
allocate 80% of “1” result to the predicted dependent variable result); the fitted classifier should
perform better than the random classifier for all possible threshold values if the fitted classifier
is robust
 AUROC: area under the ROC curve (the bigger it is, the better the fitted classifier is)
 AUPRC: area under the PRC curve

A confusion matrix is an improvement to the accuracy rate as for some studies, False positives matter
more than False negative or vice-versa…

Pipeline
A pipeline is a Python Scikit object that allows to chain the various pre-processing and transformation
steps. The benefits of using pipelines are:

 They make your workflow much easier to read and understand.


 They enforce the implementation and order of steps in your project.
 These in turn make your work much more reproducible.

Detailed explanation and full application example: https://medium.com/vickdata/a-simple-guide-to-


scikit-learn-pipelines-4ac0d974bdcf

Steps:

Import pipeline from scikit library:

from sklearn.pipeline import Pipeline

Builf a pipeline by passing in arguments the pre-processing and transformation functions (custom
designed or imported from scikit or other libraries..).

For example, for building a first pipeline doing imputing and scaling on input numerical data and another
one for imputing and transforming categorical data into integers (OneHotEncoder):

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),

('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[

('imputer', SimpleImputer(strategy='constant', fill_value='missing')),

('onehot', OneHotEncoder(handle_unknown='ignore'))])

Then building a pipeline encompassing both previous pipelines over each column of the dataset
(depending whether the column contains numerical or categorical data):

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(

transformers=[

('num', numeric_transformer, numeric_features),

('cat', categorical_transformer, categorical_features)])

Then building another pipeline applying pre-processing and random forest classification:

from sklearn.ensemble import RandomForestClassifier

rf = Pipeline(steps=[('preprocessor', preprocessor),

('classifier', RandomForestClassifier())])

Finally, training the training data set and fitting/predicting the testing data set is done referring to the
pipeline:

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

Domain: text mining


Usually done based on the comparison of input data to corpus of words (list of words classified as
“Positive” or “Negative” with a numerical rating, e.g. “excellent” > “good”)

Features used: word frequency

Example of an end-to-end implementation example using scikit library and comparing different ML
algorithms: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-
python-and-nltk-c52b92a7c73a

Pre-processing: Methodology and possible steps


Documentation: https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-
examples-bf025f872908
Common techniques include:

a. Tokenization of text: extracting roots of words, removing junction words etc


b. Stemming of tokens: stemming means identifying the root of a word (for example – “to wait”,
“waits” and “waiting” all have the root or stem “wait”, “cries” will have stem “cri”)) i.e. removal of
the prefix or suffix of a word
c. Lemmatization of tokens: lemmatization means identifying the lemma – i.e. the nase grammatical
format of a word (e.g. “cries” will have lemma “cry” – the noun in singular form). Lemmatization is
usually better than stemming.
d. Removal of Stop words aka conjunction and link words (e.g. and, or, to…) in order not clear the data
of non-significant words

And many others such as Part of Speech Tagging (identifying grammatical function of each token,
Chunking (organizing the tokens of a sentence in a hierarchical fashion), named entities recognition,
relationships extraction etc

Tokenization
Use nltk library and nltk.word_tokenize() function

Stemming
Use ntlk.stem library

From ntlk import stem

Create a PorterStemmer object:

ps = PorterStemmer()

Use the associated stem() function to stem words:

ps.stem(word)

Lemmatization
Different Stemming algorithms can be used. One is WordNet algorithm from ntlk library:

From ntlk.stem import WordNetLemmatizer

Create a WordNetLemmatizer object:

wl = WordNetLemmatizer()

Use the associated lemmatize function:

wl.lemmatize(word)

Removal of stop words


Use ntlk stop words library:

from nltk.corpus import stopwords

Load English stop words

stop_words = stopwords.words('english')
Features extraction
In a text analysis domain, features extraction is the transformation of words (or their frequency of
appearance) into integers that machine learning algorithms can work with.

Full documentation of features extraction with scikit library: https://scikit-


learn.org/stable/modules/feature_extraction.html

Classification ML algorithms

Naïve Bayes algorithm

Fast supervised learning algorithm, often used as baseline for algorithm performance measurement and
comparison for classification problems. Can work surprisingly well.

Rely on Bayes theorem stating that for a population of data instances comprised of n features, each
individual feature k’s value for the entire data instances population following a certain distribution (e.g.
normal distribution), the mean and standard deviation value of each feature can be calculated for each
possible class; consequently, each features value of a new data instance can be compared and the
resulting probability of this data instance to belong to each possible classes can be computed. The
probability to belong to each class can be calculated as the multiplication of the probability of each
feature to belong to this class (all features being assumed as independent from each other…)

Approach summary:

1. Represent a document X as a set of (w for word, a frequency of w) pairs.


2. For each label / class y, build a probabilistic model P(X| Y = y) of documents in class y.
3. To classify, select the label / class y which is most likely to generate X:

2 assumptions are made:

a) The order of the words in document X makes no difference but repetitions of words do.
b) Words appear independently of each other, given the document class.

Based on these assumptions, we have the following equations to estimate P(X|y):


For equation (1), if our document has more than 100 words, P(w₁,…, w_n|Y = y) will be a product of very
small word probabilities ( < 0.1), leading to the UNDERFLOW problem => converting the terms with
logarithms is desirable to maintain numerical stability:

For equation (2), if we have a new word w in the new text that we need to classify, P(W = w | Y = y) = 0
as w has never appeared in our training data. => one solution is to smooth the probabilities. Assume we
have m examples with P(w|y) = p. This use of m and p is a Dirichlet prior for the multinomial
distribution. Please note that there are many smoothing methods.

The pseudo-algorithm is then:

Explanation of Bayes theorem and step-by-step implementation (without and with scikit library):
https://dzone.com/articles/naive-bayes-tutorial-naive-bayes-classifier-in-pyt

And: https://towardsdatascience.com/algorithms-for-text-classification-part-1-naive-bayes-
3ff1d116fdd8
Example of implementation of a spam filter based on word frequency appearing in spam vs non-spam
emails: https://towardsdatascience.com/spam-filtering-using-naive-bayes-98a341224038

Advanced of an example of multinomial classification with Naïve Bayes + others algorithms:


https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

SVM and Kernel SVM

SVM are supervised learning algorithms used for classification problems.

SVM Concept: algorithm that looks for the optimized separation hyperplane between classes or
populations; here for a population of data points in a 2D space, the full line would be the one that
optimally separates the green and red data points, i.e. by introducing the most summed distance of each
of a sub-population data points to the separation line.

Note: SVM algorithms will only work for data points which are linearly separable (as in the pic below)

By extension, Kernel SVM implements a Kernel function that will modify the data to separate non-
linearly separable data into multiple dimensions, in order to then find a linear hyperplane separator
between data sub-populations / classes:

There are several Kernel functions to chose from, e.g. Gaussian Kernel and Polynomial Kernel functions
https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/

You might also like