Columbiax - BAMM 101 - Python For Analytics
Columbiax - BAMM 101 - Python For Analytics
Contents
Columbiax – BAMM 101 - Python for Analytics...........................................................................................1
Python Crash Course...................................................................................................................................3
Syntax......................................................................................................................................................3
Types.......................................................................................................................................................3
On the correct usage of Lists, Tuples and Dictionaries............................................................................4
Tests........................................................................................................................................................4
Loops.......................................................................................................................................................5
For loop...............................................................................................................................................5
While loop...........................................................................................................................................5
Exception management.......................................................................................................................5
Function definition..................................................................................................................................6
Useful Stuff..............................................................................................................................................6
New Concepts..........................................................................................................................................6
Mutability............................................................................................................................................6
Useful Python Libraries................................................................................................................................7
DateTime.............................................................................................................................................7
Acquiring data.............................................................................................................................................8
Process for extracting and parsing the data........................................................................................9
Placing HTTP requests.........................................................................................................................9
Placing HTTP requests and getting a response..................................................................................10
Placing a HTTP request and getting and decoding/parsing a JSON response....................................10
Placing a HTTP request and getting and decoding/parsing an XML response....................................11
Parsing Web Pages............................................................................................................................13
Gathering Data from SQL Databases.........................................................................................................14
Visualizing data..........................................................................................................................................15
Numpy...................................................................................................................................................15
Pandas...................................................................................................................................................16
Creating a dataframe.........................................................................................................................17
Referencing data in a dataframe.......................................................................................................17
Creating dataframes from Internet data: Panda datareaders...........................................................18
Cleaning Data with Pandas....................................................................................................................19
Visualizing / Summarizing the data....................................................................................................19
Cleaning and Transforming the Data.................................................................................................19
Machine Learning with Python..................................................................................................................21
Python Crash Course
Syntax
{} and ; are replaced by indentation solely (!!!)
Python “namespace” (~ addresses) contains all possible values returnable by a program as well as
variables and functions names. Variables can be seen as a “name” referring to a location in Python
“namespace”, not pointers!!! They are said to be “immutable”
For example, typing a = 1 then b = a will result in a new variable b with value 1; but a and b are 2
different variables for the program (not aliases!)
Variables and functions (ouput) are interchangeable in Python
Operators: = for value allocation, == for comparison/test, also <,>, >=, <= operators will work
+,-,*,/ for mathematical operations, % for the remaining (for example 5%2 will return 1)
Types
Types do not need to be explicitely defined! For example, writing a = 1 will automatically define variable
a as an integer with value 1
Floats; x = 1.3
Integers; x = 1
Booleans (not, and, or); x = True; y = False
Lists – ordered/sequential entry (of any types above, types can be mixed!); Lists are mutable
(i.e. contrary to other variables, their content can be edited, partially or in totality); x = [0,1,2,3,
…];
o Index: x[1] will return the second element in the list (indexation starts at 0!!); x[1,2] will
return the 2nd and 3rd element; x[0:5] will return the first 6 elements; x[2,] will return all
elements from the 3rd onwards
o Add one element to a list (after last): x.append(y)
o Add multiple elements to a list: x.extend([y,z]) will add elements y and z at the end of
the list
o Insert one element to a list: x.insert(0,y) will add element y in 1st position of the list “x”
o Get length of list (number of elements contained): len(x)
o Display last element of a list: x[-1]
o Reverse a list: x[::-1]
o Remove last element from a list: x.pop()
o Remove element in position 2 from a list: x.pop(1)
o Remove (the first) value y of a list: x.remove(y)
[Lists] are mutable: content can be changed => use for storing lists of variables of same type or not,
that can/will change with time (for example: the names of your dogs)
(Tuples) are immutable: content cannot be changed => use for storing lists of constants/variables
(of same type or not) that are not expected to change with time (for example the months of the year)
{Index:Dictionaries} are mutable: content can be changed => use for storing lists of paris of
variables (of same type or not) that can/will change with time and need to be identified by a textual
index (for example a phonebook with pairs of names and telephone numbers)
Tests
if(condition):
Do this or that…
Do that or this
else
Do this
Do this
fruits = [“apple”,”banana”,”orange”]
for x in fruits:
print(x)
for x in “banana”:
print(x)
The functions break and continue can be inserted (with if/else conditions) to define breakout and skip
conditions:
for x in “banana”:
if x=”n”:
break
print(x)
will print each letter of the string “banana” until a “n” character is met
for x in “banana”:
if x=”n”:
continue
print(x)
will print each letter of the string “banana” except “n” characters met
While loop
Exception management
Whenever something has a chance of going wrong (for example if trying to gather / parse some data
from the internet), python code should be embedded in exceptions management loop, under the form:
try:
Do something
except:
Do something if an error has occurred (give notice to the program or end user…)
Function definition
def funcName(arg1, arg2):
blabla operation 1
blabla operation 2
return whatever
class MyClass:
var1 = True
var2 = 1
def mymethod(self):
Note: keyword self refers to the invoked object of the class. It must be passed as the first argument of
any class method defined and parameters of the class used by the method must be referred in the
function as self.parameter!!!
Overriding the class __init__() function: he __init__() function is automatically called whenever an object
of the class is created (i.e. the class constructor). It can be defined as a class method overriding the
default __init__() function:
class MyClass:
self.var1 = False
self.var2 = 2
myobject = MyClass(True, 3) // here, arguments “True” and “3” are passed to the constructor as “var1”
and “var2”
Class Inheritance
To create a child class inheriting the properties and methods of the parent (aka “super”) “MyClass” class,
give the child class the parent/super class as an argument:
class MyChildClass(MyClass):
pass
The child class “MyChildClass” inherits the parameters, methods (including defined __init__ function) of
the parent/super class. The keyword pass can be used if no parameters or method need to be further
defined.
The function super() can be used from within the child class to refer to the properties and methods
inherited from the parent/super class:
class MyChildClass(MyClass):
super().__init__(var1, var2) //here the child object will execute the parent class constructor, defining
the parameters var1 and var2
self.var3 = var3 // then the child class var3 parameter (not defined for the parent class) is defined
To override an existing method or parameter of the parent function, simply redefine the function within
the child class:
class MyChildClass(MyClass):
def mymethod(self):
Useful Stuff
print(“HelloWorld”,y,z) will display the content (work with both values and variables, with one or several
elements)
id(x) is the memory address block recording the value of variable x (good to illustrate the mutability
concept below)
Importing a library is done using statement import with either the library (will import all functions) or a
function belonging to a library; import datetime or import datetime.date
New Concepts
Mutability
Every variable and objects in Python is either values mutable or immutable:
Immutable means the value cannot be changed (e.g. Integer, Float and Boolean): a variable is
allocated a “fixed” value; changing the value of the variable means pointing the variable to
another value that happens to reside in Python namespace (i.e. am immutable variable is not
referring to a fixed address)
Mutable means the value can be changed / edited (e.g. list): changing the value of a mutable
objects means actually replacing the value stored in the corresponding memory address
Consequences:
Mutable objects are persistent in memory (for example, if edited multiple times, they will
remain and continue to exist after being created – the address block does not change)
Immutable objects are non-persistent: if edited multiple times, the value will simply be
“replaced” (the variable will be reallocated to the new value which happens to sit in in another
address block)
Reference: https://en.wikipedia.org/wiki/Immutable_object
Useful Python Libraries
DateTime
Useful for manipulation of time
Acquiring data
Flat files (CSV, pdf, xls…)
Web files (HTML, XML, JSON…)
Databases (mySQL, postgres, NoSQL such as mongoDB etc)
Technologies/standard:
RESTful services = web services conforming to REST standard (almost all servers recently) ==
web servers delivering data using standardized functions integrated in the URL, e.g.
http://www.epicurious.com/search/thai%20chili to serve a request on the website search
engine using GET command with input “thai chili”, which allows to request data requests
without actually navigating to the website!
API requests are non-human data requests, sent in HTML format; the reply is usually a XML or
JSON file (can be multiple GB big!) – example: www.googleapis.com
JSON is the most common format used and is human-readable, similar as XML; it is made of
types, each of which has a direct Python equivalent:
Example of JSON format:
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}
req = requests.get(url_withsearchword)
req.status_code should give code “200” or “201” (“OK”) – 4xx codes means an error
To know what is the encoding format of the response (usually utf-8 or utf-16), use attribute
requests.encoding
To collect the response with correct decoded format, use method requests.content.decode(format)
indicating the correct format
req.content.decode(req.encoding)
Once converted with this method, a Python dictionary should be obtained (itself possibly containing
multiple dictionnaries, list, tuples etc.. as per the type conversion table below)
The whole XML tree is imported into an etree object, as defined in lxml library
root = etree.XML(xml_as_string_text)
To display the entire XML tree in a “nice format” while indicating the correct formatting:
print(etree.tostring(root, pretty_print=True).decode("utf-8"))
<Bookstore>
<Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">
<Title>New York Deco</Title>
<Authors> => element called “Authors” without text
<Author Residence="New York City"> => attribute “Residence” with
value “New York City”
<First_Name>Richard</First_Name> => element called
“First_name” with text value “Richar”
<Last_Name>Berenholtz</Last_Name>
</Author>
</Authors>
</Book>
<Book ISBN="ISBN-13:978-1579128562" Price="15.80">
<Remark>
Five Hundred Buildings of New York and over one million other books
are available for Amazon Kindle.
</Remark>
<Title>Five Hundred Buildings of New York</Title>
<Authors>
<Author Residence="Beijing">
<First_Name>Bill</First_Name>
<Last_Name>Harris</Last_Name>
</Author>
<Author Residence="New York City">
<First_Name>Jorg</First_Name>
<Last_Name>Brockmann</Last_Name>
</Author>
</Authors>
</Book>
</Bookstore>
The function element.find() with another tag name in argument can also be used:
print(element.find('First_Name').text,element.find('Last_Name').text)
The element class also has the attributes listed in a list called “attrib” to read / access the tag attributes:
Beautiful soup: library to convert and explore HTML into dictionnaries (documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – can also be used to scrap XML
Selenium: library emulating a browser and able to interpret local scripts output (javascript,
JQuery)
response = requests.get(url)
print(results_page.prettify())
Import pimysql
Create a cursor (buffer / command object between MySQL DB and Python objects):
cursor = db.cursor()
cursor.close()
Visualizing data
Amongst the huge number of Python librairies, are very useful for visualisation:
Numpy: executing numerical and array operations, very fast
Pandas: provide data structuration, useful for most data analysis
Matplotlib, seaborne, bokeh, plotly (drawing graphs), gmplot (geo representation on google
maps…) etc for visualization itself
Numpy
Implement multi-dimensional arrays type (in essence lists with consistent types object – not
mixed) which are much faster to handle than lists and more memory efficient (because they
restrict the content to the same type of data); arrays are mutable;
Support linear algebra, Fourier transformation, random number generation; in general very
useful for mathematical calculations
Documentation: https://docs.scipy.org/doc/numpy-1.13.0/reference/index.html
x = np.array([1,2,3,4,5])
Creating a single-dimensional array and specifying the type of the data that will get into it:
xf = np.array(x,'float') Could be a float, int, string… can also convert an existing array of a given type into
another type (such as in the example before)
Note: once the array object is created, it should be referred to as a list with []
For example: x[0:3,1:2] will take rows 0, 1 and 2 (excluding row 3) and column 1 (only!)
X[2:,3:] will take all rows from row 2 onward and all columns from column 3 onward
Reshaping an array: changing the number of rows and column, must respect the number of elements it
contains; for example, a 4x3 array can be reshaped as a 6x2 array (12 elements each):
x.reshape(6,2)
np.random.normal(size=10)
np.random.normal(size=(10,10))
Note: for vector multiplications (machine learning…), it is exponentially (as the number of rows/columns
increase) faster to use multiplication of numpy arrays as compared to simple lists.
import pandas as pd
df.corr()[‘ColumnNameOfVariableLookedAt’].plot()
Will show a plot showing the correlation (from -1 to 1) of the variable chosen in relation to the others:
(with correlation in Y axis, and other variables in X axis)
import pandas as pd
plot.pcolor(pd.dataframe.corr())
plot.show()
indicate a variable with 3 local minimums -> will not work well with a regression analysis if
treated as a dependent variable…
Pandas
Default / recommended structure for storing data for analytics ~ kinda similar to a
programmatic (smarter) version of Excel spreadsheet in term of functionalities
Implement dataframes types, which are 2-dimensional arrays, where columns can be named
and indexed with their name (rather than index number)
Rely on numpy and matlib library (so should be imported along with pandas)
Support time series
Comes with libraries for data collection / retrieval (both formats and API): xls, csv, html, google,
world bank…
Documentation: https://pandas.pydata.org/pandas-docs/stable/
2 main objects:
Libraries to import:
import pandas
Creating a dataframe
Creating a 2D dataframe with named columns:
A B C
0 1 2 3
1 1 2 3
df = pd.DataFrame([['r1','00','01','02'],['r2','10','11','12'],
['r3','20','21','22']],columns=['row_label','A','B','C'])
df.set_index('row_label',inplace=True)
The second line will indicate that the row index column is the one called ‘row_label’ (the option
inplace=True indicates that it is an existing column, not a new column to be added); accordingly, a row
can be called by it’s index:
df[‘r1’] => will return the first row; similarly for columns:
df[[‘A’,’B’]] => will return columns A and B (warning: the list of column must be passed in a list[]!)
df[‘r1’],[‘A’] => will return the cell value at the intersection of the row and columns called
df.loc['r1':'r2'] => will return the “slice” made up of row 1 to row 2 (all columns)
df.loc['r1':'r2','B':'C'] => will return the “slice” made up of row 1 to row 2 and column B to C
Format
Data Description Reader Writer
Type
Note: Hit or miss… returning 403 errors on many tables; possible workaround to avoid bot detection
returning 403: use header option (disguise as a browser – does not work on sites with elaborate anti-bot
protection..):
df_list = pd.read_html(‘url') => will return a list of dataframes (1 dataframe per html table found in the
given url) that can then be displayed:
df_list[0]
Documentation: https://towardsdatascience.com/data-visualization-exploration-using-pandas-only-
beginner-a0a52eb723d5
dataFrame.info() method : returns the list of columns and their types as well as the number of rows and
memory used by the dataframe
dataFrame[‘Column Name’].unique() : return the list of the unique values of a column => good to spot
the outliers in the case of a (preferably finite) list of categories…
Note: describe() can also be applied to the entire dataframe, in which case the statistical information will
be returned for all numerical columns.
data[ ConditionDescription ]
For example:
data = data[ data[‘ColumnName'].notnull() ] => keep only rows where the value of ‘columnName’ is not
null
With data in proper datetime format, additions and substractions can be made; for example, creating a
new column with the period of time between 2 other dates:
Note: the python keyword “lambda” allows to create a short function defined with syntax:
For example:
x = lambda a, b, c : a * 10 + b * c
x(1,2,3)
myfunc(1, 2, 3)
Algorithms guide:
Source and information by algorithms:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Concepts / Vocabulary
Features aka (independent) variables aka attributes
Curse of Dimensionality: when the dimensionality increases (to hundreds or even thousands of
independent variables), the volume of the space increases so fast that the available data become sparse.
This sparsity is problematic for any method that requires statistical significance. In order to obtain a
statistically sound and reliable result, the amount of data needed to support the result often grows
exponentially with the dimensionality. Also, organizing and searching data often relies on detecting
areas where objects form groups with similar properties
Features selection is identifying a subset of the original independent variables in order to build a
predictive model (while avoiding the curse of dimensionality…)
Features projection aka Features extraction is the transformation of the source data from the high-
dimensional space to a space of fewer dimensions (i.e. diminishing the number of dependent variables).
This can be done using multiple methodologies such as Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA) etc…
Multinomial Classification is a classification problem of one independent variable y that can take 3 or
more possible values (as opposed to a binary, Yes/No possible values). Not to be confused with multi-
label classification.
Transformer: refer to a function performing some data pre-processing work on the input data
In practice, dimension reduction is often performed when the number of features is over 10
When a sequence of steps is identified for a particular use case, a pipeline can be defined to perform the
same steps (calling specific transformers, machine learning algorithms etc)
Excerpt: For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot
for high cardinality columns and decision tree-based algorithms.
For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum,
BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason
you might want to try them.
Simple Labels-Encoding
I.e. turning a categorical data (e.g. a list of cities with n values possible) into a 1 digit columns of n
possible integers (0, 1,2 .. n)
Dataframe[‘NewColumn’] = Dataframe[‘ExistingCategoricalDataColumn’].astype(‘category’).cat.codes
Note: the categorical data column must be of Categorical type (not Object) – conversion can be done
from an Object column by calling method .astype(‘category’):
One-Hot-Encoding
I.e. turning a categorical data (e.g. a list of cities with n values possible) into a n binary columns (0/1)
Advantage of One-Hot-Encoding: avoiding giving additional weight to certain values rather than others;
Disadvantage: can come up with a huge number of Binay columns / features (cf. curse of dimensionality)
Alternatively, use scikit-learn library labelBinezer object’s lb.fit_transform() method (results must be
turned into a dataframe and appended as new columns to the original dataframe containing the
categorical data):
lb = LabelBinazer()
lb_results = lb.fit_transform(Originaldataframe[‘CategoricalColumn’])
Can be done by importing the BinaryEncoder object from category_encoders library, loading the
BinaryEncoder() method and applying the fit_transform() function on the original dataframe:
import category_encoders
be = category_encoders.BinaryEncoder(cols=[‘CategoricalDataColumn’])
ResultDataFrame = be.fit_transform(‘OriginalDataFrame’)
Documentation: https://contrib.scikit-learn.org/categorical-encoding/
where test_size is the ratio (in%) of the test size data (here testing set should be 30% of the original
data) and the returned dataframes are train (training data dataframe) and test (testing data dataframe)
x_train = train.iloc[rows, IndependentVariablesColumns] : pass all the cells – i.e. all rows of the training
dataframe and all independent variables columns; by convention, noted as “x”
y_train = train[dependentVariablecolumns] : pass the columns of the training dataframes containing the
dependent variables; by convention, noted as “y”
model = linear_model.LinearRegression()
model.fit(x_train,y_train)
Here, selecting a linear regression model and training it with the training data (parameters of the fit()
function are dependent variables and independent variables)
4. Indicating the dependent and independent variables in the testing set and running the model:
y_test = test[dependentVariablecolumns]
Decompose a classification to n values as n binary classification problems (one for each possible
value)
Precision: measured in %, indicates the proportion of true positives out of total positives (or
how reliable the model is)
Recall: measured in %, indicates the proportions of positives correctly identified by the model
(or how comprehensive the model is)
Accuracy (aka F-score): measured in %, gives a composite score of both Precision and Recall)
Confusion matrix: provide more details by giving the proportion of True Positives, False
Positives, False Negatives and True Negatives
Threshold value: in the case of predicting a result for a categorical dependent variable, the value
we decide the fix to decide whether a predicted value (given as a probability between 0 and 1) is
considered as a 1 or a 0 (normally 0.5 but an alternative threshold might be selected)
ROC curve: a visual representation of a classifier performance compared to a random classifier
(that would randomly pick up a result based on the expected value distribution in the training
set – i.e. if a training set had 80% of the dependent variable as “1”, the random classifier would
allocate 80% of “1” result to the predicted dependent variable result); the fitted classifier should
perform better than the random classifier for all possible threshold values if the fitted classifier
is robust
AUROC: area under the ROC curve (the bigger it is, the better the fitted classifier is)
AUPRC: area under the PRC curve
A confusion matrix is an improvement to the accuracy rate as for some studies, False positives matter
more than False negative or vice-versa…
Pipeline
A pipeline is a Python Scikit object that allows to chain the various pre-processing and transformation
steps. The benefits of using pipelines are:
Steps:
Builf a pipeline by passing in arguments the pre-processing and transformation functions (custom
designed or imported from scikit or other libraries..).
For example, for building a first pipeline doing imputing and scaling on input numerical data and another
one for imputing and transforming categorical data into integers (OneHotEncoder):
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
Then building a pipeline encompassing both previous pipelines over each column of the dataset
(depending whether the column contains numerical or categorical data):
preprocessor = ColumnTransformer(
transformers=[
Then building another pipeline applying pre-processing and random forest classification:
rf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
Finally, training the training data set and fitting/predicting the testing data set is done referring to the
pipeline:
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
Example of an end-to-end implementation example using scikit library and comparing different ML
algorithms: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-
python-and-nltk-c52b92a7c73a
And many others such as Part of Speech Tagging (identifying grammatical function of each token,
Chunking (organizing the tokens of a sentence in a hierarchical fashion), named entities recognition,
relationships extraction etc
Tokenization
Use nltk library and nltk.word_tokenize() function
Stemming
Use ntlk.stem library
ps = PorterStemmer()
ps.stem(word)
Lemmatization
Different Stemming algorithms can be used. One is WordNet algorithm from ntlk library:
wl = WordNetLemmatizer()
wl.lemmatize(word)
stop_words = stopwords.words('english')
Features extraction
In a text analysis domain, features extraction is the transformation of words (or their frequency of
appearance) into integers that machine learning algorithms can work with.
Classification ML algorithms
Fast supervised learning algorithm, often used as baseline for algorithm performance measurement and
comparison for classification problems. Can work surprisingly well.
Rely on Bayes theorem stating that for a population of data instances comprised of n features, each
individual feature k’s value for the entire data instances population following a certain distribution (e.g.
normal distribution), the mean and standard deviation value of each feature can be calculated for each
possible class; consequently, each features value of a new data instance can be compared and the
resulting probability of this data instance to belong to each possible classes can be computed. The
probability to belong to each class can be calculated as the multiplication of the probability of each
feature to belong to this class (all features being assumed as independent from each other…)
Approach summary:
a) The order of the words in document X makes no difference but repetitions of words do.
b) Words appear independently of each other, given the document class.
For equation (2), if we have a new word w in the new text that we need to classify, P(W = w | Y = y) = 0
as w has never appeared in our training data. => one solution is to smooth the probabilities. Assume we
have m examples with P(w|y) = p. This use of m and p is a Dirichlet prior for the multinomial
distribution. Please note that there are many smoothing methods.
Explanation of Bayes theorem and step-by-step implementation (without and with scikit library):
https://dzone.com/articles/naive-bayes-tutorial-naive-bayes-classifier-in-pyt
And: https://towardsdatascience.com/algorithms-for-text-classification-part-1-naive-bayes-
3ff1d116fdd8
Example of implementation of a spam filter based on word frequency appearing in spam vs non-spam
emails: https://towardsdatascience.com/spam-filtering-using-naive-bayes-98a341224038
SVM Concept: algorithm that looks for the optimized separation hyperplane between classes or
populations; here for a population of data points in a 2D space, the full line would be the one that
optimally separates the green and red data points, i.e. by introducing the most summed distance of each
of a sub-population data points to the separation line.
Note: SVM algorithms will only work for data points which are linearly separable (as in the pic below)
By extension, Kernel SVM implements a Kernel function that will modify the data to separate non-
linearly separable data into multiple dimensions, in order to then find a linear hyperplane separator
between data sub-populations / classes:
There are several Kernel functions to chose from, e.g. Gaussian Kernel and Polynomial Kernel functions
https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/