0% found this document useful (0 votes)
224 views

AISCIENCES - Data Science Cookbook - V0

Uploaded by

Kay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views

AISCIENCES - Data Science Cookbook - V0

Uploaded by

Kay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

i

P Y THO N DA TA S C IE N CE
CO O KBO O K

AI Sciences Publishing

ii
How to contact us
Please address comments and questions concerning this book
to our customer service by email at:
[email protected]

Our goal is to provide high-quality books for your technical learning in


Data Science and Artificial Intelligence subjects.

Thank you so much for buying this book.

If you noticed any problem, please let us know by


sending us an email at [email protected] before
writing any review online. It will be very helpful for us
to improve the quality of our books.

iii
Table of Contents

Table of Contents ........................................................... iv


From AI Sciences Publisher ....... Erreur ! Signet non défini.
Preface ............................................................................. 1
Introduction ....................................................................10
History 10

Data Science Illuminated ...............................................12


Definition .................................................................................... 12
Importance of Data Science ........................................................ 12
Advantages of Data Science ........................................................ 14
Role of a Data Scientist ............................................................... 15

A Basic Course in Python ...............................................18


Getting Started ............................................................................ 18

Getting Python ............................................................................. 18


PEP and the Zen of Python....................................................... 19
Whitespace Formatting ............................................................... 19
Modules ......................................................................................... 20
Scope of a variable ....................................................................... 22
Arithmetic Operators .................................................................. 23
Functions ....................................................................................... 24
Strings ............................................................................................ 27
Exceptions..................................................................................... 30
Lists ................................................................................................ 31

iv
Tuples ............................................................................................ 35
Dictionaries ................................................................................... 35
Defaultdict:.................................................................................... 38
Sets ................................................................................................. 40
Control Flow................................................................................. 40
Truthiness...................................................................................... 43
Moving ahead ............................................................................. 44
Sorting............................................................................................ 45
List Comprehensions .................................................................. 46
Randomness .................................................................................. 47
Regular Expressions .................................................................... 48
Object-Oriented Programming .................................................. 49
Class ............................................................................................... 49
Object ............................................................................................ 50
Method .......................................................................................... 50
Polymorphism .............................................................................. 51
Encapsulation: .............................................................................. 51
Enumerate ..................................................................................... 53
Zip .................................................................................................. 54
Args ................................................................................................ 54
Visualizing Data ............................................................ 56
Matplotlib ................................................................................... 56
Bar Charts................................................................................... 58
Line Charts ................................................................................. 59
Scatterplots .................................................................................. 61

v
Linear Algebra ............................................................... 63
Vectors 63
Matrices ...................................................................................... 65

Statistics ......................................................................... 66
Data in Statistics......................................................................... 66
Measures of central tendencies .................................................. 66
Dispersion .................................................................................. 67
Covariance .................................................................................. 68
Correlation .................................................................................. 69
Probability .................................................................................. 70
Dependence and Independence ................................................. 70
Conditional Probability ............................................................... 71
Bayes’ Theorem .......................................................................... 72
Random Variables ...................................................................... 73
Continuous Distributions ........................................................... 74
Distribution plot of the above graph .......................................... 75
The Normal Distribution ........................................................... 75

Hypothesis and Inference ............................................. 78


Statistical Hypothesis Testing ................................................... 78
Gradient Descent........................................................................ 82
Stochastic Gradient Descent ...................................................... 86

Getting Data .................................................................. 89


Reading Files ............................................................................. 90
Using APIs .................................................................................. 91

JSON (and XML)......................................................................... 92

vi
Finding APIs ................................................................................. 93
Getting Credentials .................................................................... 93

Working around Data .................................................... 95


Exploring Data ........................................................................... 95

Exploring One-Dimensional Data ............................................ 95


Two Dimensions .......................................................................... 95
Many Dimensions ........................................................................ 96
Cleaning and Munging .............................................................. 97
Manipulating Data ..................................................................... 98
Rescaling .................................................................................... 99

Machine Learning ........................................................ 102


Modeling ................................................................................... 102
What Is Machine Learning? ...................................................... 103
Unsupervised Learning ............................................................. 107
Unsupervised Learning ............................................................. 108
Semi-supervised Learning ......................................................... 108
Reinforcement Learning ........................................................... 109
Overfitting and Underfitting ..................................................... 109
Correctness ................................................................................ 112
The Bias-Variance Trade-off ..................................................... 113
Feature Extraction and Selection .............................................. 114

K-Nearest Neighbors ................................................... 116


Handling Data........................................................................... 117
Calculating Similarity ................................................................ 119
Locating Neighbors .................................................................. 120

vii
Generating Response ................................................................ 121
Evaluating Accuracy ................................................................. 122
Main Elements .......................................................................... 123
The Curse of Dimensionality .................................................... 124

Naive Bayes .................................................................. 128


Applications of Naive Bayes Algorithms ................................... 129
How to build a basic model using Naive Bayes in Python? ..... 129
Python Code .............................................................................. 130

Simple Linear Regression............................................. 131


Logistic Regression ...................................................... 135
Applying logistic regression ...................................................... 135

Decision Trees .............................................................. 138


The Entropy of a Partition ........................................................ 138
Creating a Decision Tree .......................................................... 139

Random Forests............................................................ 144


Neural Networks .......................................................... 146
Perceptrons................................................................................ 146
Backpropagation ....................................................................... 147
1st Part: The Derivative ............................................................. 156
2nd Part: Entire Statement: The Error Weighted Derivative ...... 157

Clustering ..................................................................... 160


The Idea .................................................................................... 160
The Model ................................................................................. 160
Step Number 1 ........................................................................... 161
Step Number 2 ........................................................................... 161

viii
Step Number 3 ........................................................................... 161
Step Number 4 ........................................................................... 162
Implementation using Python................................................... 162
Bottom-up Hierarchical Clustering........................................... 168

Natural Language Processing ...................................... 169


Word Clouds .............................................................................. 169
N-gram Models ......................................................................... 171
Grammars .................................................................................. 172
Topic Modeling ......................................................................... 176
Parameters of LDA ................................................................... 177
Running LDA Model ................................................................ 179
Network Analysis.......................................................... 180
Betweenness Centrality ............................................................. 180
Eigenvector Centrality ............................................................... 183

Recommender Systems ................................................ 189


Manual Curation ....................................................................... 190
Recommending What’s Popular ............................................... 190
User-Based Collaborative Filtering ........................................... 193
Item-Based Collaborative Filtering ........................................... 197

Databases and SQL ..................................................... 202


Step 1: Installing MySQL ..........................................................203
Step 2: Setting up the database..................................................204
Step 3: Getting the data from Python ........................................205
INSERT ...................................................................................... 207
UPDATE .................................................................................... 209

ix
DELETE..................................................................................... 209
SELECT ...................................................................................... 209
GROUP BY ................................................................................ 210
ORDER BY ................................................................................ 210
Indexes......................................................................................... 211
Query Optimization .................................................................. 213
NoSQL 214
MapReduce ............................................................................... 215
Why MapReduce? ...................................................................... 216
MapReduce More Generally .................................................... 216
Python MapReduce Code ......................................................... 218
Reduce step: reducer.py ............................................................ 219
Go Forth and Do Data Science ................................... 223
IPython 223
Mathematics ..............................................................................224
Not from Scratch .......................................................................225
NumPy ......................................................................................... 225
Pandas .......................................................................................... 226
Scikit-learn ................................................................................... 226
Visualization................................................................................ 226
R.................................................................................................... 227
Find Data..................................................................................... 228
Practicing Data Science ............................................................229

Thank you ! .................................................................. 232

x
xi
 Do you want to discover, learn and understand the methods
and techniques of artificial intelligence, data science,
computer science, machine learning, deep learning or
statistics?
 Would you like to have books that you can read very fast and
understand very easily?
 Would you like to practice AI techniques?
If the answers are yes, you are in the right place. The AI
Sciences book series is perfectly suited to your expectations!
Our books are the best on the market for beginners,
newcomers, students and anyone who wants to learn more
about these subjects without going into too much theoretical
and mathematical detail. Our books are among the best sellers
on Amazon in the field.

About Us

We are a group of experts, PhD students and young


practitioners of Artificial Intelligence, Computer Science,
Machine Learning and Statistics. Some of us work in big
companies like Google, Facebook, Microsoft, KPMG, BCG
and Mazars.
We decided to produce a series of books mainly dedicated to
beginners and newcomers on the techniques and methods of
Machine Learning, Statistics, Artificial Intelligence and Data
Science. Initially, our objective was to help only those who
wish to understand these techniques more easily and to be able
to start without too much theory and without a long reading.
1
Today we also publish more complete books on some topics
for a wider audience.

About our Books

Our books have had phenomenal success and they are today
among the best sellers on Amazon. Our books have helped
many people to progress and especially to understand these
techniques, which are sometimes considered to be complicated
rightly or wrongly.
The books we produce are short, very pleasant to read. These
books focus on the essentials so that beginners can quickly
understand and practice effectively. You will never regret
having chosen one of our books.
We also offer you completely free books on our website: Visit
our site and subscribe in our Email-List: www.aisciences.net
By subscribing to our mailing list, we also offer you all our new
books for free and continuously.

To Contact Us:

 Website: www.aisciences.net
 Email: [email protected]
Follow us on social media and share our publications
 Facebook: @aisciencesllc
 LinkedIn: AI Sciences

2
From AI Sciences Publishing

3
WWW.AISCIENCES.NET
EBooks, free offers of eBooks and online learning courses.
Did you know that AI Sciences offers free eBooks versions of
every books published? Please subscribe to our email list to be
aware about our free eBook promotion. Get in touch with us
at [email protected] for more details.

At www.aisciences.net , you can also read a collection of free


books and receive exclusive free ebooks.

4
WWW.AISCIENCES.NET
Did you know that AI Sciences offers also online courses?
We want to help you in your career and take control of your
future with powerful and easy to follow courses in Data
Science, Machine Learning, Deep learning, Statistics and all
Artificial Intelligence subjects.

Most courses in Data science and Artificial Intelligence simply


bombard you with dense theory. Our course don’t throw
complex maths at you, but focus on building up your intuition
for infinitely better results down the line.

Please visit our website and subscribe to our email list to be


aware about our free courses and promotions. Get in touch
with us at [email protected] for more details.

5
Preface

“In God we trust. All others must bring data.”


~ W. Edwards Deming, statistician

If you are looking for a practical book to help you understand


Data science step by step by using Python, then this is a good
book for you.

To put away Data Science in a simple sentence: It is the study


of where information comes from, what it represents and how
it can be turned into a valuable source in the establishment of
business and IT approaches.

In the past ten years, Data Science has quietly grown to include
businesses and organizations world-wide. It is now being used
by governments, geneticists, engineers, and even astronomers.
Technically, this includes machine translation, robotics, speech
recognition, the digital economy, and search engines. In terms
of research areas, Data Science has expanded to include the
biological sciences, health care, medical informatics, the
humanities, and social sciences. Data Science now influences
economics, governments, and business and finance.

Book Objectives

This book will help you:

6
 Have an appreciation for data science and an
understanding of their fundamental principles.
 Have an elementary grasp of data science concepts and
algorithms.
 Have achieve a technical background in data science

Target Users

The book is designed for a variety of target audiences. The


most suitable users would include:
 Beginners who want to approach data science, but are too
afraid of complex math to start
 Newbies in computer science techniques and data science
 Professionals in data science and social sciences
 Professors, lecturers or tutors who are looking to find
better ways to explain the content to their students in the
simplest and easiest way
 Students and academicians, especially those focusing on
data science

Is this book for me?

If you want to smash Data Science from scratch, this book is


for you. Little programming experience is required. If you
already wrote a few lines of code and recognize basic
programming statements, you’ll be OK.

7
© Copyright 2017 by AI Sciences
All rights reserved.
First Printing, 2016

Edited by Davies Company


Ebook Converted and Cover by Pixel Studio
Publised by AI Sciences LLC

ISBN-13: 978-1986318471
ISBN-10: 1986318478

The contents of this book may not be reproduced, duplicated or


transmitted without the direct written permission of the author.

Under no circumstances will any legal responsibility or blame be held


against the publisher for any reparation, damages, or monetary loss
due to the information herein, either directly or indirectly.

8
Legal Notice:

You cannot amend, distribute, sell, use, quote or paraphrase any part
or the content within this book without the consent of the author.

Disclaimer Notice:

Please note the information contained within this document is for


educational and entertainment purposes only. No warranties of any
kind are expressed or implied. Readers acknowledge that the author
is not engaging in the rendering of legal, financial, medical or
professional advice. Please consult a licensed professional before
attempting any techniques outlined in this book.

By reading this document, the reader agrees that under no


circumstances is the author responsible for any losses, direct or
indirect, which are incurred as a result of the use of information
contained within this document, including, but not limited to, errors,
omissions, or inaccuracies.

9
Introduction
To put away Data Science in a simple sentence: It is the study
of where information comes from, what it represents and how
it can be turned into a valuable source in the establishment of
business and IT approaches.

In the last decade, Data Science has silently grown to include


businesses and organizations world-wide. It is now being used
by governments for research and analytics, geneticists,
astronomers, engineers as well as for the entrepreneurs.

Technically, this includes robotics, machine translation,


scientific researches, speech recognition, the digital economy,
and search engines and many more. In research areas, Data
Science has expanded to include the biological and social
sciences, health care, medical informatics plus the humanities.
Data Science now influences economics, share markets,
governments, and business and finance.

History

There are a wide range of dates and courses of events that can
be utilized to follow the moderate development of Data
Science and its present effect on the Data Management
industry, a portion of the more huge ones are laid out
underneath.

10
Although data science isn’t a new profession, it has evolved
considerably over the last 50 years. If we look into the history
of data science it reveals a long and winding path that began as
early as 1962 when mathematician John W. Tukey predicted
the effect of modern-day electronic computing on data analysis
as an empirical science.

Yet, the data science of today is very different from the one
that Tukey imagined. Tukey’s predictions did not predict big
data and the ability to perform complex and large-scale
analyses. It wasn’t until 1964 that the first desktop computer—
Programma 101—was launched to the public at the New York
World’s Fair. Any analyses that took place were far more
elementary than the ones that are possible today.

By 1981, IBM had released its first ever personal computer.


Apple wasn’t far behind, releasing the first personal computer
with a graphical user interface in 1983. Throughout that
decade, computing evolved at a much faster pace and gave
them the ability to collect data more easily.

11
Data Science Illuminated
Definition

Data science is a field which is composed of computer science,


math and statistics, and domain expertise that seeks to derive
insight from data. Data science is considered as the intersection
of these three respective disciplines.

To help an organization cut costs large amounts of data is very


useful. The data can be structured or unstructured but
identifying patterns in it can help advertising, recognizing new
business opportunities, campaigning and increasing efficiency
and reducing costs. This gives the company an added
competitive advantage

The data science field uses mathematics, statistics and


computer science disciplines effectively, and incorporates
techniques like machine learning, cluster analysis, data mining
and visualization.

Importance of Data Science

We need to look at a series of emerging trends that give


importance to data science.

12
 First, Data analytics, the trend of applying data science's
practices and tools in the business world.
 Second, Internet of things (IOT), the trend of connecting
devices and sensors via the cloud, which is generating
massive streams of data to be analyzed.
 Third, Big data, a trend of creating tools and systems able
to store and process these enormous data sets at scale.
 Fourth, Machine learning, a trend in artificial intelligence of
teaching machines to solve problems without explicitly
being programmed to so. Machines able to make decisions
and predications all by identifying statistical patterns in
these massive data sets.
 All four of these trends are converging to create fully
autonomous, intelligent systems, machines capable of
acting rationally within their environment, and learning how
13
to optimize their performance over time without any
human intervention.
 As a result, data science has now become a cost-effective
strategy for answering questions, making decisions, and
predicting outcomes in a wide variety of scenarios in our
world. Given this trend, it's unlikely that the demand for
data science will decrease any time in the near future.

Advantages of Data Science

Data Science can have innumerable advantages. It can be used


as a powerful tool in decision-making.
Data science gives data data-based evidence to accept or reject
any business decision. It can increase profits and identify
potential risks before they actually occur. It can identify
potential customers and also provide solutions to otherwise
error prone and difficult tasks.

Depending on the company’s goal and industry the specific


benefits of data science vary.
Applications of data science include mining data by banking
sectors to detect frauds and potential defaulters. The Streaming
services like Netflix can use them to create personalized
recommendations to viewers. Telecom companies can use data
science to offer certain offers to the users based on the
interests of the users. Shipping companies can use them to find
the best delivery tools.

14
Since the identification and analysis of large amounts of data
which is unstructured can prove complex, expensive as well as
time-consuming for companies, data science is still an evolving
field here.

Role of a Data Scientist

In simple words a Data scientist is someone that performs data


science for a living. The goal of data scientist is to transform
data into knowledge, knowledge that is used to make rational
decisions. They possess all three sets of skills – Machine
Learning professionalism, Mathematics and Statistics
knowledge and Subject Matter Expertise. Individuals with all
three sets of skills and proper credentials are currently very
rare. Now we explain the three skills in details:

15
Machine Learning:
Machine learning is a subfield of artificial intelligence based on
statistics. It involves machines learning how to complete tasks
without being explicitly programmed to do so. This part is
explained in details further.

Mathematics and Statistics:


While everyone knows what mathematics is, statistics is the
study of data: how to collect, summarize and present it. The
statistics part will be covered in details later on.

Subject Matter Expertise

16
In general, a subject-matter expert (SME) or domain expert is
a person who is a specialist in a particular area or topic. An
SME should also have basic knowledge of other technical
subjects too. In Data Science an SME Provides
industry/process-specific context for what the patterns
identified by the algorithms and models mean.

Such Individuals who master in all three skills are also called as
Unicorn Data Scientist. Despite how rare unicorn data
scientists are they are rapidly growing in demand. In addition,
there doesn't appear to be any end in sight for the growth of
this demand. As a result, in the very near future this specific
set of skills will be in high demand, whether you're a data
scientist or applying data science practices to your current job
role. The rarity of data scientists combined with their high
demand leads to the much higher salaries for data scientists and
IT professionals with similar skills.

17
A Basic Course in Python

Getting Started

Getting Python

For starters the link to the official website of python is


https://www.python.org/. Python can be downloaded from
this link.
In the “downloads” tab you can select “All releases”, which
will redirect you to a page containing all versions of python 2
and python 3 available for download.
You can download the appropriate version for your Operating
system.

I would personally recommend instead installing the Anaconda


distribution, which already includes most of the libraries that
you need to do data science and it also comes with Jupyter
Notebook (also known as IPython Notebook previously)
which is a very good interactive user interface for executing
python scripts.

If you don’t get Anaconda, make sure to install pip, which is a


Python package manager that allows you to easily install third-
party packages (some of which we’ll need).

18
PEP and the Zen of Python

PEP stands for Python Enhancement Proposals. A PEP is a


design document providing information to the Python
community, or describing a new feature for Python or its
processes or environment.
The Zen of Python also known as PEP 20 is a list of proposals
of how code in python should be written.
As this not a book dedicated to python I am not writing the
proposals but you can always view all of these on the link
https://www.python.org/dev/peps/pep-0020/#the-zen-of-
python.

Whitespace Formatting

Python uses indentation to delimit blocks of code. Many


languages use curly braces for doing the same This makes
Python code very easy to read, but it also means that you have
to be very careful with your formatting. For example

for x in [1, 2, 3, 4]:


print x # first line
in "for x" block
for y in [1, 2, 3, 4]:
print y # first line
in "for y" block
print x + y # last line
in "for y" block
print x # last line
in "for x" block
print "done looping"

19
Whitespace is ignored inside parentheses and brackets. This
can be which can be helpful for long-winded computations.

long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7
+ 8 + 9 + 10 + 11 + 12 +
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)

A backslash can be used to indicate that a statement continues


onto the next line. This is rarely used
two_plus_three = 2 + \
3

Modules

Modules in Python include features which are not loaded by


default. These can be third-party features or features which are
a part of the language. So we need to import the modules
which contain these features in order to use them.

A module can be simply defined as a file containing Python


definitions and statements. The file name in Python is the
module name with the suffix .py following it. Within a module,
the module’s name (as a string) is available as the value of the
global variable __name__.

One approach is to simply import the module itself:

import re

20
my_regex = re.compile("[0-9]+", re.I)

Here re is the module containing functions and constants for


working with regular expressions. After this type of import you
can only access those functions by prefixing them with re.

If you already had a different re in your code you could use an


alias:

import re as regex
my_regex = regex.compile("[0-9]+", regex.I)

You might also do this if your module has an unwieldy name


or if you are going to be typing it a lot. For example, when
visualizing data with matplotlib, a standard convention is:

import matplotlib.pyplot as plt

It is not necessary to import the complete module always. If


you need a few specific values from a module then it is possible
to import them explicitly and use them without qualification:

from collections import defaultdict, Counter


lookup = defaultdict(int)
my_counter = Counter()

21
You can also import the entire contents of a module into your
script, which might automatically overwrite variables you’ve
already defined and also import functions not required.

match = 10
from re import * # re has a match function
print(match) # "<function re.match>" instead of
10

However, this is not a standard practice.

Scope of a variable

Like other programming languages the variables in Python also


have a scope, all variables may not be accessible throughout
the program. The scope of the variables depends on where
they have been declared.
The two basic scopes of variables in Python are Local and
Global
Local variables are those which have been defined inside a
function body and their scope is limited to that function body
only.
Global variables are defined outside functions and are
accessible throughout the program.

22
This implies that the local variables are accessible only inside
the function they are declared whereas global variables can be
accessed from anywhere throughout the program body by all
the functions.

When you call a function, the variables declared inside it are


brought into scope ignoring the global variables.

Arithmetic Operators

Assume variable x holds 10 and variable y holds 20, then −

+ Simple Addition of both operands x + y = 30

- Subtracts right hand operand from left x – y = -10

hand operand.

* Simple Multiplication of both x * y = 200

operands

/ Simple division of left operand by y / x = 2

right operand

% Divides left hand operand by right y % x = 0

hand operand and returns remainder

23
** Performs exponential (power) x**y =10 to
the power 20
calculation on operators

// Floor Division – It’s like simple 9//2 = 4 and


9.0//2.0 =
division of operands except that the 4.0, -11//3
result is the quotient in which the = -4, -
11.0//3 = -
digits after the decimal point are 4.0
removed. If one of the operands is
negative, the result is floored, i.e.,
rounded away from zero

Functions

A function is used mainly for code reusing. In python there are


built in functions like print(). But we can create our own
functions also. These will be categorized as user defined
functions.
A function is a block of organized, reusable code that is used
to perform a single, related action.
Normally in Python functions are defined using " def "
followed by the function name and then parenthesis containing
function arguments. The code block of every function starts
with a colon (:) and is indented. A function can also have a
return statement used to send back some variable.

Syntax of a simple function:

24
def double(x):
"""this is where you put an optional docstring that
explains what the function does.
for example, this function multiplies its input by 2
and returns the same"""
return x * 2

Python functions are first-class, which means that we can


assign them to variables and pass them into functions just like
any other arguments:

def apply_to_one(f):
"""calls the function f with 1 as its argument"""
return f(1)
my_double = double # refers to the previously defined
function
x = apply_to_one(my_double) # equals 2

It is also easy to create short anonymous functions, or lambdas:


The general syntax of a lambda function is quite simple:

lambda argument_list: expression

The argument list is a comma separated list. The expression


can be an arithmetic expression, which uses these arguments.
It is also possible to assign the unction to a variable in order to
assign it a name.

25
The following example of an add function returns the sum of
its two arguments:

f = add x, y : x + y
f(1,1) # equals to 2

Function parameters can also be given default arguments,


which only need to be specified when you want a value other
than the default:

def my_print(message="my default message"):


print message

my_print("This ") # prints 'hello'


my_print() # prints 'my default message'

It is sometimes useful to specify arguments by name called as


keyword arguments:

def subtract(a=0, b=0):


return a - b
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b=5) # same as previous

26
Strings

We can create strings simply by enclosing characters in quotes.


Python treats single quotes the same as double quotes.

var1 = 'My String!'


var2 = "Python Programming"

In python strings are immutable. This means that elements of


a string cannot be changed once it has been assigned. We can
simply reassign different strings to the same name.

my_string = 'programiz' # Permitted


my_string[5] = 'a' # Not permitted
my_string = 'Python' # Permitted

We cannot delete or remove characters from a string. But


deleting the string entirely is possible using the keyword del.

del my_string[1] # Not permitted


del my_string # Permitted

Joining of two or more strings into a single one is called


concatenation.
The + operator does this in Python.
The * operator can be used to repeat the string for a given
number of times.

27
my_string = 'Python'
my_string + my_string = 'PythonPython'
my_string * 3 = 'PythonPythonPythonPython '

We can test if a sub string exists within a string or not, using


the keyword in.

'a' in 'program' # True


'at' not in 'battle' # False

If we want to print a text like - He said, "What's there?"- We


can neither use single quote or double quotes. This will result
into SyntaxError as the text itself contains both single and
double quotes.
We can use escape sequences.

An escape sequence starts with a backslash and is interpreted


differently. If we use single quote to represent a string, all the
single quotes inside the string must be escaped. Similar is the
case with double quotes. Here is how it can be done to
represent the above text.

# escaping single quotes


print('He said, "What\'s there?"')

# escaping double quotes

28
print("He said, \"What's there?\"")

Escape sequences are used to represent special characters


Special characters are also represented by the escape sequences
\t – represents the tab character,
\n – represents the newline character, etc.

We can use raw string to ignore escape sequences at times. This


is done by placing an r or R before the string. You can see the
difference below.

print 'C:\\nowhere' print r'C:\\nowhere'


Result of the above code Result of the above code
is: is:
C:\nowhere C:\\nowhere

Python's triple quotes allows strings to span multiple lines.


The syntax for triple quotes consists of three consecutive single
or double quotes.

para_str = """this is a long string


that is made up of several lines
that will show up that way when displayed. """

29
print(para_str)

The result is:

this is a long string


that is made up of several lines
that will show up that way when displayed.

Exceptions

An exception is an error that happens during execution of a


program. When that error occurs, Python generates an
exception that can be handled, which avoids your program to
crash. If the exception is not handled the program may be
interrupted abruptly.

The words "try" and "except" are Python keywords that are
used to handle exceptions. The code which may cause
exception is placed under the try block and the code to handle
the exception is placed under the except block.

try:
print("Starting of try")
print (1/0)
print("Ending of try")
except ZeroDivisionError:
print ("Division by zero is not possible")
finally:

30
print("End of program")

Output

Starting of try
Division by zero is not possible
Ending of program

The code within the try clause will be executed statement by


statement.
If an exception occurs, the rest of the try block will be skipped
and the except clause will be executed.
In addition to using an except block after the try block, you
can also use the finally block. The code in the finally block will
be executed regardless of whether an exception occurs.

Lists

List is a mutable data structure in Python, which is mostly used


as compared to all the other data structures. A list is nothing
but a collection of ordered elements. It is similar to what in
other languages might be called an array, but with some added
functionality.

int_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ int_list, heterogeneous_list, [] ]
list_length = len(int_list) # equals 3

31
list_sum = sum(int_list) # equals 6

You can use the range function to initialize a list with a


sequence of numbers

Z = range(15) # equals to z = [0,1,2,. . .


.,13,14]

You can get or set the nth element of a list with square
brackets:

z = range(10) # is the list [0, 1, ..., 9]


zeroth = z[0] # equals 0, lists are 0-
indexed
one = z[1] # equals 1

You can also access the list elements in reverse order using
negative index:

nine = z[-1] # equals 9,


eight = z[-2] # equals 8,

You can update list elements as follows:

z[0] = -1 # now z is [-1, 1, 2, 3, ..., 9]

32
You can also delete individual list elements by index and the
entire list using the del keyword:

z = ['a', 'b', 'c', 'd', 'e', 'f']


del z[0] # deletes the element at index zero i.e.
a
del z # deletes the entire list

You can also use square brackets with the slicing operator " : "
to “slice” lists:

first_three_elements = z[:3] # [-1, 1, 2]


three_to_end_elemnts = z[3:] # [3, 4, ...,
9]
one_to_four_elements = z[1:5] # [1, 2, 3,
4]
last_three_elements = z[-3:] # [7, 8, 9]
without_first_and_last_elements = z[1:-1] # [1,
2, ..., 8]
copy_of_z_elements = z[:] # [-1, 1, 2,
..., 9]

Python’s in operator can also be used to check for list


membership:

1 in [1, 2, 3, 4, 5] # True
0 in [1, 2, 3, 4, 5] # False

33
This check involves examining the elements of the list one at a
time, which means that you probably shouldn’t use it unless
you know your list is pretty small or unless you don’t care how
long the check takes.

It is easy to concatenate lists together using the extend


function:

z = [1, 2, 3]
z.extend([4, 5, 6]) # z is now [1,2,3,4,5,6]

If you don’t want to modify x you can use list addition:

z = [1, 2, 3]
y = x + [4, 5, 6] # y is [1, 2, 3, 4, 5, 6]; z is
unchanged

More often we will append to lists one item at a time, this is


done using the append function

z = [1, 2, 3]
z.append(0) # x is now [1, 2, 3, 0]
y = x[-1] # equals 0
l = len(x) # equals 4

34
Tuples

Tuples are similar to lists except that they are immutable which
means that they cannot be modified once declared. To change
a tuple you have to replace it entirely or create it once again
after deleting it. You specify a tuple by using parentheses
instead of square brackets:

my_first_tuple = (1, 2)
my_first_tuple[1] = 3 # gives error
stating that you cannot modify a tuple

Tuples are a convenient way to return multiple values from


functions:

def sum_and_product(x, y):


return (x + y),(x * y)
sp = sum_and_product(2, 3) # equals (5, 6)
s, p = sum_and_product(5, 10) # s is 15, p is 50

Dictionaries

Dictionary is another important Data Structure, which is


mapped as pairs of keys and values. It allows you to quickly
retrieve the value corresponding to a given key. We use curly
braces to create dictionary and the ' : ' to separate a key and the
corresponding value:

35
empty_dict = {} # Creates an
empty dictionary
grades = { "James" : 80, "Tim" : 95 } # dictionary
literal

You can look up the value for a key using square brackets:

Alain_grade = grades["James"] # equals 80

But you will get a KeyError if you ask for a key that is not in
the dictionary:

try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"

You can check for the existence of a key using the in operator:

"James" in grades # True


"Kate" in grades # False

Dictionaries have a get method that returns a default value


(instead of raising an exception) when you look up a key that
is not in the dictionary. The default value is passed as the
second parameter to the get function as shown below:

Alain_grade = grades.get("James", 0) # equals 80

36
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # equals
None, since default key is None

You can assign new key-value pairs or update the existing ones
using the square brackets:

grades["Tim"] = 99 # replaces the old value


grades["Kate"] = 100 # adds a third entry
num_students = len(grades) # equals 3

Dictionaries are more often used as a simple way to represent


structured data as shown below:

tweet = {
"user" : "John",
"text" : "Data Science is Awesome",
"retweet_count" : 100,
"hashtags" : ["#data", "#science", "#datascience",
"#awesome", "#yolo"]
}

Besides looking for specific keys we can look at all of them:

tweet_keys = tweet.keys() # returns a list of


keys
tweet_values = tweet.values() # returns a list of
values

37
tweet_items = tweet.items() # returns a list of
(key, value) tuples
"user" in tweet_keys # equals True, uses
the slow list in
"user" in tweet # equals True, uses
the faster dict in
"John" in tweet_values # equals True

Dictionary keys must be immutable; in particular, you cannot


use lists as keys. If you need a multipart key, you should use a
tuple or figure out a way to turn the key into a string.

Defaultdict:

One of my favorite datatypes in Python is the defaultdict. A


defaultdict is like a regular dictionary, except that when you try
to look up a key it doesn’t contain, it first adds a value for it
using a zero-argument function you provided when you
created it. In order to use defaultdicts, you have to import them
from collections:

from collections import defaultdict


word_counts = defaultdict(int) # int()
produces 0 as default key
for word in document:
word_counts[word] += 1

These will be useful when we are using dictionaries to get


results by some key and don’t want to have to check every time
to see if the key exists yet.

38
Counter:
A Counter is a container that keeps track of how many times
equivalent values are added. It turns a sequence of values into
a defaultdict(int) like object converting keys to counts and only
one single entry for similar values.

From collections, import Counter

c = Counter([0, 1, 2, 0]) # c is
(basically) { 0 : 2, 1 : 1, 2 : 1 }

Here c is Counter Object which gives the count of occurrence


of elements as shown 0 occurs two times, 1 only one time and
so on.
This gives us a very simple way to solve our word_counts
problem:

word_counts = Counter(document)

A Counter instance has a most_common method that is very


useful to get the elements occurring most frequently:

# print the 10 most frequently occurring words and


their respective counts
for word, count in word_counts.most_common(10):
print word, count

39
Sets

Python also includes a data type for sets. A set is an unordered


collection with no duplicate elements. Basic uses include
membership testing and eliminating duplicate entries. Set
objects also support mathematical operations like union,
intersection, difference and symmetric difference.
Curly braces or the set() function can be used to create sets. To
create an empty set you have to use set().

my_set = set()
my_set.add(1) # my_set is now { 1 }
my_set.add(2) # my_set is now { 1, 2 }
my_set.add(2) # my_set is still { 1, 2 }
x = len(s) # equals 2
y = 2 in s # equals True
z = 3 in s # equals False

Control Flow

A program's control flow is the order in which the program's


code executes. The statements inside your programs are
generally executed from top to bottom, in the order that they
appear. Control flow statements, however, break up the flow
of execution by employing decision making, looping, and
branching, enabling your program to conditionally execute
particular blocks of code.
Python knows the usual control flow statements, with some
twists, listed below.
40
Perhaps the most well-known statement type is the if
statement. For example:

x = 90
if x < 0:
print('Number is Negative')
elif x == 0:
print('Number is Zero')
else:
print('Number is Positive')

Output: Positive

There can be zero or more elif parts, and the else part is
optional. The keyword ‘elif’ is short for ‘else if’, and is useful
to avoid excessive indentation.
You can also write a ternary if-then-else on one line, which is
quite useful occasionally:

result = "even" if x % 2 == 0 else "odd"

The while statement is used for repeated execution as long as


an expression is true:
x = 0
while x < 10:
print(x)
x += 1

41
Output: 0123456789

The for statement is used to iterate over the elements of a


sequence (such as a string, tuple or list) or other iterable object:

for i in range(5):
print(i)

Output: 01234

If you need more-complex logic, you can use continue and


break. Continue statement is used to skip the current iteration
while break is used to stop the iteration and get out of the loop.

for x in range(10):
if x == 2:
continue # go immediately to the next
iteration
if x == 5:
break # quit the loop entirely
print x

This will print 0, 1, 3, and 4.

The pass statement does nothing. It can be used when a


statement is required syntactically but the program requires no
action. For example:

42
while True:
pass # Busy-wait for keyboard
interrupt (Ctrl+C)

Truthiness

In python booleans are same as in any other programming


languages, except that they are capitalized:

one_is_less_than_two = 1 < 2 # equals True


true_equals_false = True == False # equals
False

Python uses the value None to indicate a nonexistent value. It


is similar to other languages’ null:

z = None
print z == None # prints True
print z is None # prints True

Basically any element having a True value is referred as Truthy


and the one having False value is referred to as Falsy.
Python lets you use any value where it expects a Boolean. The
following are all Falsy:

• False

43
• None
• [] (an empty list)
• {} (an empty dict)
• "" (an empty string)
• set()(an empty set)
• 0
• 0.0

Pretty much anything else gets treated as True. This allows you
to easily use if statements to test for empty lists or empty
strings or empty dictionaries or so on.

Python has an all function, which takes a list and returns True
precisely when every element is truthy, and an any function,
which returns True when at least one element is truthy:

all([True, 1, { 3 }]) # True


all([True, 1, {}]) # False, {} is falsy
any([True, 1, {}]) # True, True is truthy
all([]) # True, no falsy elements
in the list
any([]) # False, no truthy
elements in the list

Moving ahead

Here we will look at some more-advanced Python features that


we will find useful for working with data.

44
Sorting

Every Python list has a sort method that sorts it in some or the
other way. The sorted function returns a new list:

y = [4,1,2,3]
z = sorted(y) # is [1,2,3,4], y is unchanged
x.sort() # now y is [1,2,3,4]

By default, sort (and sorted) sort a list from smallest to largest


based on naively comparing the elements to one another.
If you want elements sorted from largest to smallest, you can
specify a reverse=True parameter. Instead of comparing the
elements themselves, you can compare the results of a function
that you specify with key:

# sort the list by absolute value from largest to


smallest
y = sorted([-4,1,-2,3], key=abs, reverse=True) # is
[-4,3,-2,1]
# sort the words and counts from highest count to
lowest
wc = sorted(word_counts.items(),
key=lambda (word, count): count,
reverse=True)

45
List Comprehensions

Python supports a concept called "list comprehensions". It can


be used to construct lists in a very natural and easy way, like a
mathematician is used to do.
L = []
for y in range(10):
L.append(y**2)

The above three lines of code can be written in a single line


using list comprehension:

L = [y**2 for y in range(10)]

print(L)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Similarly:
M = [x for x in L if x % 2 == 0]
Print(M)
[0, 4, 16, 36, 64]

You can similarly turn lists into dictionaries or sets:

square_dict = { y : y * y for y in range(4) }


# { 0:0, 1:1, 2:4, 3:9 }

46
Randomness

As we learn data science, we will frequently need to generate


random numbers, which we can do with the random module:

import random
four_uniform_randoms = [random.random() for _ in
range(4)]
#random.random() produces numbers uniformly between 0
and 1

The random module actually produces pseudorandom (that is,


deterministic) numbers based on an internal state that you can
set with random.seed if you want to get reproducible results:

random.seed(20) # set the seed to 20


print random.random() # 0.57140259469
random.seed(20) # reset the seed to 20
print(random.random()) # 0.57140259469 again

The random.randrange takes either 1 or 2 arguments and


returns an element chosen randomly from the corresponding
range():

random.randrange(20) # choose randomly from


range(10) = [0, 1, ..., 9]
random.randrange(3, 6) # choose randomly from
range(3, 6) = [3, 4, 5]

47
There are a few more methods that we will sometimes find
convenient. random.shuffle randomly reorders the elements of
a list:

up_to_ten = range(10)
random.shuffle(up_to_ten)
print(up_to_ten)
# [2, 5, 1, 9, 7, 3, 8, 6, 4, 0] (your results will
probably be different)

If you need to randomly pick one element from a list you can
use random.choice:
my_friend = random.choice(["Alice", "Bob",
"Charlie"]) # "Bob" for me

Regular Expressions

A regular expression is a special sequence of characters that


helps you match or find other strings or sets of strings, using a
specialized syntax held in a pattern.
The match function is generally used to match Regular
Expression patterns to string with optional flags.

re.match(pattern_to_match, search_string, flags=0)


pattern_to_match - This is the regular expression to
be matched.
Search_string - This is the string, which would be
searched to match the pattern.
flags - You can specify different flags using bitwise
OR (|).

48
import re
print all([ # all of
these are true, because
not re.match("a", "bat"), # * 'bat'
doesn't start with 'a'
re.search("a", "bat"), # * 'bat' has
an 'a' in it
not re.search("c", "bat"), # * 'bat'
doesn't have a 'c' in it
3 == len(re.split("[ab]", "carbs")), # * split on
a or b to ['c','r','s']
"R-D-" == re.sub("[0-9]", "-", "R2D2") # * replace
digits with dashes
]) # prints True

For more details on Regular Expressions you can go to


https://docs.python.org/3/howto/regex.html

Object-Oriented Programming

The object oriented programming languages follow the


concepts described below
Class

Class can be defined as a collection of objects or it can be


defined as a collection of data members and related methods.
It is a logical entity that has some specific attributes and
methods. For example: if you have a teacher class then it
should contain an attribute and method i.e. an qualification,
email id, name, age, salary etc.

49
Object

An object is an instance of the class in other words it can be


defines as a real world entity which has state and behavior. It
may be anything. An object can be logical or physical. For
example: printer, keyboard, chair, table, paper etc. An object is
the blueprint of the class.
Everything in Python is an object, and almost everything has
attributes and methods. All functions have a built-in attribute
__doc__, which returns the doc string defined in the function
source code.

Method

A function associated with an object is called method. In


Python, method is not unique to class instances. Any object
type can have methods.
Inheritance:
It is the process by which one object acquires the properties of
another object or one class inherits the attributes and actions
of another class. The class whose properties are inherited is
known as the parent class or the base class while the class
which inherits the properties are known as the child class or
sub class. Inheritance allows code reusability and makes it
much easier to create a new class which has same properties or
behaviors like existing ones. Inheritance also eliminates code
repetition.

50
Polymorphism

Polymorphism is made by two words "poly" and "morphs". It


is the property, which allows the interface to be implemented
in different manner by different classes as per their
requirements. Poly means many and Morphs means form,
shape. It defines that one task can be performed in different
ways. For example: You have a class figure and all figures have
shapes. But they have different shape. Here, the "shape"
behavior is polymorphic in the sense and totally depends on
the figure. Therefore, the abstract "figure" concept does not
actually have "shape", but specific figures (like rectangles and
squares) have a concrete implementation of the action "shape".

Encapsulation:

It is the mechanism that binds the data and functions together


and protects it from the outside misuse/tampering. It hides the
internal mechanism from the outside world. It allows the use
of the system as a whole without bothering about the internal
mechanisms.
Data Abstraction:
It is the process of identifying the necessary attributes and
actions of an entity as relevant to the application. Data
abstraction and encapsulation both are often used as
synonyms. Data abstraction is achieved through encapsulation
and hence they are synonymous. Abstraction is used to hide
internal details and show only functionalities. Abstracting
something means to give names to things, so that the name
captures the core of what a function or a whole program does.

51
# by convention, we give classes PascalCase names
class my_set:
# these are the member functions
# every one takes a first parameter "self"
(another convention)
# that refers to the particular Set object
being used
def __init__(self, values=None):
"""This is the constructor.
It gets called when you create a new
Set.
You would use it like
30 | Chapter 2: A Crash Course in Python
s1 = Set() # empty set
s2 = Set([1,2,2,3,4,4]) # initialize
with values"""
self.dict = {} # each instance of Set
has its own dict property
# which is what we'll use to track
memberships
if values is not None:
for value in values:
self.add(value)
def __repr__(self):
"""this is the string representation of
a Set object
if you type it at the Python prompt or
pass it to str()"""
return "Set: " + str(self.dict.keys())
# we'll represent membership by being a
key in self.dict with value True
def add_element(self, value):

52
self.dict[value] = True
# value is in the Set if it's a key in
the dictionary
def contains_element(self, value):
return value in self.dict
def remove_element(self, value):
del self.dict[value]

Which we could then use like:


set = my_set([1,2,3,4])
set.add(5)
print set.contains(5) # True
set.remove(1)
print set.contains(1) # False

Enumerate

Sometimes you will want to iterate over a list and use both its
elements and their indexes:
The Pythonic solution is enumerate, which produces tuples
(index, element):

for j, document in enumerate(documents):


do_something(j, document)

Similarly, if we just want the indexes:

for j, _ in enumerate(documents): do_something(j)

53
Remember that the _ 'single underscore' is always used
somewhere we are ignoring specific values.

Zip

Often we will need to zip two or more lists together. Zip


transforms multiple lists into a single list of tuples of
corresponding elements:

list_alphabets = ['w', 'x', 'y', ‘z’]


list_numbers = [11, 22, 33, 44]
zip(list_alphabets, list_numbers) # is [('w', 11),
('x', 22), ('y', 33), (‘z’,44)]

If the lists are different lengths, zip stops as soon as the first list
ends.

Args

The special syntax *args in function definitions in python is


used to pass a variable number of arguments to a function. It
is used to pass a non-keyworded, variable-length argument list.

def test(arg_first, *argv):


print "first argument :", arg_first
for argument in argv:
print "Next argument through *argv :",
argument

test('Hello', ‘This’, ‘is’, ‘a’, ‘DataScience’,


‘Program’)

54
Output:

first argument : Hello


Next argument through *argv : This
Next argument through *argv : is
Next argument through *argv : a
Next argument through *argv : DataScience
Next argument through *argv : Program

All this gives us an elementary idea of the basic operations in


Python
For detailed documentation of python, please visit
https://docs.python.org/3/tutorial/index.html

55
Visualizing Data

Any data scientist will need data scientist as a necessity. It is


quite simple to create visualizations but very difficult to
produce useful ones.
There are two uses for data visualization:
• To explore the data
• To communicate the data

Data visualization can be viewed as a counterpart of visual


communication. It includes the creation and study of the data,
which is in the form of graphs or some schematic form, this
includes taking into considerations the attributes or variables
for the units of information

The main goal of data visualization is to communicate


information clearly and efficiently via statistical graphics, plots
and information graphics. Numerical data may be encoded
using dots, lines, or bars, to visually communicate a
quantitative message. Effective visualization helps users
analyze and reason about data and evidence. It makes complex
data more accessible, understandable and usable.

Matplotlib

56
Python has a library known as Matplotlib, which produces a variety
of graphs and other visual representations across platform. This is a
2D plotting library. It can be used in any Python script. With just a
small code it is possible to generate bar charts, histograms, error
charts and even power spectra and scatterplots.
The following example uses the matplotlib.pyplot module. In its
simplest use, pyplot maintains an internal state in which you build
up a visualization gradually and for simple bar charts, line charts, and
scatterplots, it works pretty well. After the generation of the
graphics, you can either save it or display it.

Following code generates the graph shown next called as a line chart:

from matplotlib import pyplot as grph


year = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp_year = [300.2, 543.3, 1075.9, 2862.5, 5979.6,
10289.7, 14958.3]
# create a line chart, years on x-axis, gdp_year on
y-axis
grph.plot(year, gdp_year, color='green', marker='o',
linestyle='solid')
# add a title
grph.title("Nominal GDP")
# add a label to the y-axis
grph.ylabel("Billions of $")
grph.show()

57
Bar Charts

A bar chart is a graph with rectangular bars. The graph usually shows
a comparison between different categories. In other words, the
length or height of the bar is equal to the quantity within that
category.
For instance, figure below shows how many Academy Awards were
won by each of a variety of movies:

movies = ["Annie Hall", "Ben-Hur", "Casablanca",


"Gandhi", "West Side Story"]
num_oscars = [5, 11, 3, 8, 10]
# bars are by default width 0.8, so we'll add 0.1 to
the left coordinates
# so that each bar is centered
xs = [i + 0.1 for i, _ in enumerate(movies)]

58
# plot bars with left x-coordinates [xs], heights
[num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
# label x-axis with movie names at bar centers
plt.xticks([i + 0.5 for i, _ in enumerate(movies)],
movies)
plt.show()

Line Charts

A line chart or line graph is a type of chart which displays


information as a series of data points called 'markers' connected by
straight line segments. It is a basic type of chart common in many
fields. A line chart is often used to visualize a trend in data over
intervals of time – a time series – thus the line is often drawn
chronologically.

59
These are a good choice for showing trends, as illustrated in Figure
below:

variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]


bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance,
bias_squared)]
xs = [i for i, _ in enumerate(variance)]
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
plt.plot(xs, variance, 'g-', label='variance') #
green solid line
plt.plot(xs, bias_squared, 'r-.', label='bias^2') #
red dot-dashed line
plt.plot(xs, total_error, 'b:', label='total error')
# blue dotted line
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()

60
Scatterplots

A scatterplot is the right choice for visualizing the relationship


between two paired sets of data. Scatter plots are similar to line
graphs in that they use horizontal and vertical axes to plot data
points. However, they have a very specific purpose. Scatter plots
show how much one variable is affected by another. The relationship
between two variables is called their correlation. For example, Figure
below illustrates the relationship between the number of friends your
users have and the number of minutes they spend on the site every
day:

friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145,
190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
'i']
plt.scatter(friends, minutes)

61
# label each point
for label, friend_count, minute_count in zip(labels,
friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put
the label with its point
xytext=(5, -5), # but
slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()

62
Linear Algebra

Linear algebra is the branch of mathematics concerning linear


equations and their representations
through matrices and vector spaces.

Vectors

Vectors are points in some finite-dimensional space. They are


a good way to represent numeric data.
There are multiple ways to create our vector instances using
the vectors module.

We can first initialize some vectors calling its class contructors


as follows:

from vectors import Point, Vector


v1 = Vector(1, 2, 3) # Vector(1, 2, 3)
v2 = Vector(2, 4, 6) # Vector(2, 4, 6)

We can also create a Point instance or a Vector instance with a


list using the class method from_list().

components = [1.2, 2.4, 3.8]


v = Vector.from_list(components) # Vector(1.2, 2.4,
3.8)

63
We can also get access to the vector array to use it with other
libraries.

v1.vector # [1, 2, 3]

Magnitude: We can get the magnitude of the vector easily.

v1.magnitude() # 3.7416573867739413

Addition: We can add a real number to a vector or compute


the vector sum of two vectors as follows.

v1.add(2) # Vector(3.0, 4.0, 5.0)


v1.sum(v2) # Vector(3.0, 6.0, 9.0)

Both methods return a Vector instance.

Multiplication: We can multiply a vector by a real number.


v1.multiply(4) # Vector(4.0, 8.0, 12.0)

The above returns a Vector instance.

Dot Product: We can find the dot product of two vectors.


v1.dot(v2) # 0.0

Dot product returns a real number.

64
Cross/Scalar Product: We can find the cross product of two
vectors.
v1.cross(v2) # Vector(0, 0, 0)

Cross product returns a Vector instance, which is always


perpendicular to the other two vectors.

Matrices

A matrix is a two dimensional structure. In python, matrix is a nested


list or a list of lists. Note that each inner list must have same size and
represents a row of the matrix. If A is a matrix, then A[i][j] is the
element in the ith row and the jth column.

For example;
B =
[[80,75,85,90,95],[75,80,75,85,100],[80,80,80,90,95]]

If a matrix has n rows and k columns, we will refer to it as an


n × k matrix. The above one is a 3 x 5 matrix.

65
Statistics

Statistics is a form of mathematical analysis that uses quantified


models, representations and synopses for a given set of
experimental data or real-life studies. Statistics studies
methodologies to gather, review, analyze and draw conclusions
from data. Statistical analysis involves the process of gathering
and evaluating data and then summarizing the data into a
mathematical form. Statistical methods analyze large volumes
of data and their properties.
If you want to be a data scientists you should have a basic
understanding of statistics.

Data in Statistics

Before one can present and interpret information, there has to be a


process of gathering and sorting data. Just as trees are the raw
material from which paper is produced, so too, can data be viewed
as the raw material from which information is obtained.
In fact, a good definition of data is "facts or figures from which
conclusions can be drawn".
Data can be qualitative or quantitative:
Qualitative data is descriptive information (it describes something)
Quantitative data is numerical information (numbers)

Measures of central tendencies

66
Our measure of central tendencies are mean, median and mode
The mean is the average and is computed as the sum of all the
observed outcomes from the sample divided by the total number of
events.

Median is the middle value when the data values have been sorted
(or the average of the 2 middle values if there are an even number of
data values).

Mode is the data value(s) that occur with the greatest frequency.
The numpy library which will be explained later gives us pre-defined
functions to calculate all the three.

import numpy as np
from statistics import mode
num_list = [1,2,3,3,5,3]
mean_list = np.mean(num_list) # 2.833
median_list = np.median(num_list) # 3.0
mode_list = mode(num_list) # mode

Dispersion

In statistics, dispersion is the extent to which a distribution is


stretched or squeezed. Common examples of measures of statistical
dispersion are the variance, standard deviation, and interquartile
range.
Dispersion is contrasted with location or central tendency, and
together they are the most used properties of distributions.

67
For the study of dispersion, we need some measures which show
whether the dispersion is small or large. Measures like standard
deviation and variance give us an idea about the amount of
dispersion in a set of observations.

Variance measures how far a set of (random) numbers are spread


out from their mean value. Mathematically variance is the average of
the squared differences from the Mean.
Standard deviation is just the square root of the variance. The
difference between the two is that unlike the variance, it is
expressed in the same units as the data.
Below program shows how to find variance and standard deviation:

def sum_of_squares(s):
"""computes the sum of squared elements in s"""
return sum(s_i ** 2 for s_i in v)

Z = [1,2,3,4,5]
#translate z by subtracting its mean (so the result
has mean 0)
z_bar = mean(z)
deviations = [z_i - z_bar for z_i in z]

#assumes z has at least two elements


n = len(z)

variance1 = sum_of_squares(deviations) / (n - 1)
standard_deviations = math.sqrt(variance1)

Covariance

68
Covariance is a measure of how much two random variables vary
together from their mean. It’s similar to variance, but where variance
tells you how a single variable varies, co variance tells you how two
variables vary together.
The lines below shows how to calculate covariance:

l = len(z)
covariance = dot(mean(z), mean(y)) / ( - 1)

Here the dot just sums up the products of corresponding pairs


of elements

Correlation

Correlation is a statistical technique that can show whether and


how strongly pairs of variables are related.
The correlation is unitless and always lies between -1 and 1.
Where -1 represents perfect anti-correlation and 1 represents
perfect correlation. A number like 0.25 represents a relatively
weak positive correlation.

Some examples of data that have a high correlation:


Your caloric intake and your weight.
The amount of time your study and your results.

Example of data that have a low correlation (or none at all):


A dog’s name and the type of dog biscuit they prefer.

69
The lines below shows how to calculate correlation:

stdev_a = standard_deviation(a)
stdev_b = standard_deviation(b)
correlation = covariance(a, b) / stdev_a / stdev_b

Probability

Probability is simply how likely something is to happen. Whenever


we’re unsure about the outcome of an event, we can talk about the
probabilities of certain outcomes how likely they are.
Probability is quantified as a number between 0 and 1, where, loosely
speaking, 0 indicates impossibility and 1 indicates certainty. The
higher the probability of an event, the more likely it is that the event
will occur. A simple example is the tossing of a fair (unbiased) coin.
Since the coin is fair, the two outcomes ("heads" and "tails") are both
equally probable; the probability of "heads" equals the probability of
"tails"; and since no other outcomes are possible, the probability of
either "heads" or "tails" is 1/2 (which could also be written as 0.5 or
50%).

In general:
Probability of an event happening = Number of ways it can happen
/ Total number of outcomes

Dependence and Independence

An independent event is an event that has no connection to another


event’s chances of happening (or not happening). In other words,
the event has no effect on the probability of another event occurring.
Independent events in probability are no different from independent
events in real life.

70
For Example: Where you work has no effect on what color car you
drive.
When two events are independent, one event does not influence the
probability of another event.

Simple examples of independent events:


Owning a cat and growing your own herb garden.
Winning the lottery and running out of food.

When two events are dependent events, one event influences the
probability of another event. In other words two events are
dependent if the outcome or occurrence of the first affects the
outcome or occurrence of the second so that the probability is
changed.

Simple examples of dependent events:


Robbing a bank and going to jail.
Not paying your power bill on time and having your power cut off,
etc.

Conditional Probability

In probability theory, conditional probability is a measure of the


probability of an event given that (by assumption, presumption,
assertion or evidence) another event has occurred.

Assume that P(A) means "Probability Of Event A", likewise P(B)


means "Probability Of Event B".

71
The conditional probability of an event B is the probability that the
event will occur given the knowledge that an event A has already
occurred. This probability is written P(B|A), notation for the
probability of B given A.
In the case where events A and B are independent, the conditional
probability of event B given event A is simply the probability of
event B, that is P(B).

P(B|A) = P(B)

If events A and B are dependent, then the probability of the


intersection of A and B (the probability that both events
occur) is defined by P(A,B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is


easily obtained as shown below:
P(B|A) = P(A,B) / P(A): given that P(A) > 0

Bayes’ Theorem

In easy words Bayes’ Theorem is a way of finding a probability when


we know certain other probabilities.

The formula is:

P(A|B) = P(A). P(B|A)/P(B)

72
Which tells us how often A happens given that B happens, written
as P(A|B),
when we know how often B happens given that A happens, written
as P(B|A)
and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)

For Example:
Let us say P(Clouds) means how often there is fire, and P(Rain)
means how often we see smoke, then:

P(Clouds|Rain) means how often there is fire when we can see


smoke
P(Rain|Clouds) means how often we can see smoke when there is
fire

So the formula kind of tells us "forwards" P(Clouds|Rain) when we


know "backwards" P(Rain|Clouds)

Random Variables

Random variables in statistics are used to quantify outcomes of an


occurrence which is random. Since the occurrence is random it can
take on many values. Random variables should be measurable and
are generally supposed to be real numbers. For example, rolling a die
is an example of a random event, but you can quantify (i.e. give a
number to) the outcome, the letter X may be designated to represent
the sum of the resulting numbers after three dice are rolled. In this
case, X could be (1 + 1+ 1) = 3, (6 + 6 + 6) = 18, or somewhere

73
between 3 and 18, since the highest number of a die is 6 and the
lowest number is 1.
Further, there are two types of Random variables, discreet random
variable a variable that represents numbers found by counting, for
example: number of chocolates in a jar, and Continuous random
variable, which represents an infinite number of values on the
number line, for example distance traveled while delivering mail

Continuous Distributions

A continuous distribution describes the probabilities of the possible


values of a continuous random variable. A continuous random
variable is a random variable with a set of possible values (known as
the range) that is infinite and uncountable.

Probabilities of continuous random variables (X) are defined as the


area under the curve of its PDF. The equation used to describe a
continuous probability distribution is called a probability density
function (PDF). Thus, only ranges of values can have a nonzero
probability. The probability that a continuous random variable
equals some value is always zero.

Example of the distribution of weights:


The continuous normal distribution can describe the distribution of
weight of adult males. For example, you can calculate the probability
that a man weighs between 160 and 170 pounds.

74
Distribution plot of the above graph

The shaded region under the curve in this example represents


the range from 160 and 170 pounds. The area of this range is
0.136; therefore, the probability that a randomly selected man
weighs between 160 and 170 pounds is 13.6%. The entire area
under the curve equals 1.0.

The Normal Distribution

A normal distribution, sometimes called the bell curve, is a


distribution that occurs naturally in many situations. For
example, the bell curve is seen in tests like the SAT and GRE.
The bulk of students that will score the average (C), while
smaller numbers of students will score a B or D. An even

75
smaller percentage of students score an F or an A. This creates
a distribution that resembles a bell which is why this is known
as the bell shaped curve. This curve is symmetrical in nature.
Half of the data will fall to the left of the mean; half will fall to
the right.

Many groups follow this type of pattern. That’s why it’s widely
used in business, statistics and in government bodies like
the FDA:
The empirical rule tells you what percentage of your data falls
within a certain number of standard deviations from the mean:
• 68% of the data falls within one standard deviation of
the mean.
• 95% of the data falls within two standard deviations of
the mean.
• 99.7% of the data falls within three standard deviations of
the mean.

76
The standard deviation controls the spread of the distribution.
A smaller standard deviation indicates that the data is tightly
clustered around the mean; the normal distribution will be
taller. A larger standard deviation indicates that the data is
spread out around the mean; the normal distribution will be
flatter and wider.

77
Hypothesis and Inference

Statistical Hypothesis Testing

A data scientist will always want to test whether a specific


hypothesis is true or false. If it is true then what is its
probability. So data scientists create hypothesis like “is Java
more powerful than C” or “Is it likely to rain” that can be
translated into statistics about data.
Under various assumptions, those statistics can be thought of
as observations of random variables from known distributions,
which allows us to make statements about how likely those
assumptions are to hold.
In the traditional set up the data scientist may define two
hypothesis. H0 being the default proposition and H1 being the
alternative hypothesis which we would like to compare it with.
We use statistics to decide whether we can reject H0 as false or
not. This will probably make more sense with an example.

Example: Flipping a Coin


Imagine we have a coin and we want to test whether it’s fair.
We assume that the coin has some probability h of landing
heads, and so our null hypothesis is that the coin is fair—that
is, that h = 0 . 5. We’ll test this against the alternative hypothesis
h ≠ 0 . 5.

78
In particular, our test will involve flipping the coin some
number n times and counting the number of heads X. Each
coin flip is a Bernoulli trial, which means that X is a
Binomial(n,p) random variable, which we can approximate
using the normal distribution:

def normal_approximation_to_binomial(n, h):


"""finds mu and sigma corresponding to a Binomial(n,
h)"""
mu = p * n
sigma = math.sqrt(p * (1 - h) * n)
return mu, sigma
def normal_probability_outside(lo, hi, mu=0,
sigma=1):
return 1 - normal_probability_between(lo, hi, mu,
sigma)

We can also do the reverse—find either the nontail region or


the (symmetric) interval around the mean that accounts for a
certain level of likelihood. For example, if we want to find an
interval centered at the mean and containing 60% probability,
then we find the cutoffs where the upper and lower tails each
contain 20% of the probability (leaving 60%):

def normal_upper_bound(probability, mu=0, sigma=1):


"""returns the z for which P(Z <= z) = probability"""
return inverse_normal_cdf(probability, mu, sigma)
def normal_lower_bound(probability, mu=0, sigma=1):
"""returns the z for which P(Z >= z) = probability"""
return inverse_normal_cdf(1 - probability, mu, sigma)

79
def normal_two_sided_bounds(probability, mu=0,
sigma=1):
"""returns the symmetric (about the mean) bounds
that contain the specified probability"""
tail_probability = (1 - probability) / 2
# upper bound should have tail_probability above it
upper_bound = normal_lower_bound(tail_probability,
mu, sigma)
# lower bound should have tail_probability below it
lower_bound = normal_upper_bound(tail_probability,
mu, sigma)
return lower_bound, upper_bound

In particular, let’s say that we choose to flip the coin n = 1000


times. If our hypothesis of fairness is true, X should be
distributed approximately normally with mean 50 and standard
deviation 15.8:

mu_0, sigma_0 =
normal_approximation_to_binomial(1000, 0.5)

We need to make a decision about significance—how willing we


are to make a type 1 error (“false positive”), in which we reject
H0 even though it’s true. This willingness is often set at 5% or
1%. Let’s choose 5%.
Consider the test that rejects H0 if X falls outside the bounds
given by:

normal_two_sided_bounds(0.95, mu_0, sigma_0) # (469,


531)

80
Assuming p really equals 0.5 (i.e., H0 is true), there is just a 5%
chance we observe an
X that lies outside this interval, which is the exact significance
we wanted. Said differently,
if H0 is true, then, approximately 19 times out of 20, this test
will give the correct result.
We are also often interested in the power of a test, which is the
probability of not making a type 2 error, in which we fail to reject
H0 even though it’s false. In order to measure this, we have to
specify what exactly H0 being false means. (Knowing merely
that p is not 0.5 doesn’t give you a ton of information about the
distribution of X.) In particular, let’s check what happens if p
is really 0.55, so that the coin is slightly biased toward heads.
For our two-sided test of whether the coin is fair, we compute:

def two_sided_p_value(x, mu=0, sigma=1):


if x >= mu:
# if x is greater than the mean, the tail is what's
greater than x
return 2 * normal_probability_above(x, mu, sigma)
else:
# if x is less than the mean, the tail is what's less
than x
return 2 * normal_probability_below(x, mu, sigma)

If we were to see 530 heads, we would compute:

two_sided_p_value(529.5, mu_0, sigma_0) # 0.062

81
One way to convince yourself that this is a sensible estimate is
with a simulation:

extreme_value_count = 0
for _ in range(100000):
num_heads = sum(1 if random.random() < 0.5 else 0 #
count # of heads
for _ in range(1000)) # in 1000 flips
if num_heads >= 530 or num_heads <= 470: # and count
how often extreme_value_count += 1 # the # is
'extreme'
print extreme_value_count / 100000 # 0.062

Since the p-value is greater than our 5% significance, we don’t


reject the null.
Note: Make sure your data is roughly normally distributed
before using normal_probability_above to compute p-values

Gradient Descent

Vanilla is the simplest form of gradient descent technique.


Here, vanilla means pure i.e. without any adulteration. Its main
feature is that we take small steps in the direction of the minima
by taking gradient of the cost function.
The pseudocode is given below

update = learning_rate * gradient_of_parameters


parameters = parameters - update

82
Here, we see that we make an update to the parameters by
taking gradient of the parameters. And multiplying it by a
learning rate, which is essentially a constant number suggesting
how fast we want to go the minimum. Learning rate is a hyper-
parameter and should be treated with care when choosing its
value.

We will now look at a basic implementation of gradient descent


using python.
Here we will use gradient descent optimization to find our best
parameters for our deep learning model on an application of
image recognition problem. Our problem is an image
recognition, to identify digits from a given 28 x 28 image. We
have a subset of images for training and the rest for testing our
model. In this article, we will look at how we define gradient
descent and see how our algorithm performs.

83
Here is the main code for defining vanilla gradient descent,

params = [weights_hidden, weights_output,


bias_hidden, bias_output]
def sgd(cost, params, lr=0.05):
grads = T.grad(cost=cost, wrt=params)
updates = []

for param, grad in zip(params, grads):


updates.append([param, param - grad * lr])

return updates

updates = sgd(cost, params)

Now we break it down to understand it better.


We defined a function sgd with arguments as cost, params and
lr. These represent J(θ) as seen previously, θ i.e. the parameters
of our deep learning algorithm and our learning rate. We set
default-learning rate as 0.05, but this can be changed easily as
per our preference.

def sgd(cost, params, lr=0.05):

We then defined gradients of our parameters with respect to


the cost function. Here we used theano library to find gradients
and we imported theano as T

84
grads = T.grad(cost=cost, wrt=params)

Finally iterated through all the parameters to find out the


updates for all possible parameters. You can see that we use
vanilla gradient descent here.

for param, grad in zip(params, grads):

updates.append([param, param - grad * lr]

This function can now be used to find the best optimal


parameters for the neural network.
When we are using this function, we find that our neural
network does a good enough job in finding the digits in our
image as seen below

85
Stochastic Gradient Descent

As we mentioned before, often we’ll be using gradient descent


to choose the parameters of a model in a way that minimizes
some notion of error. Using the previous batch approach, each
gradient step requires us to make a prediction and compute the
gradient for the whole data set, which makes each step take a
long time.
Now, usually these error functions are additive, which means
that the predictive error on the whole data set is simply the sum
of the predictive errors for each data point.

86
When this is the case, we can instead apply a technique called
stochastic gradient descent, which computes the gradient (and
takes a step) for only one point at a time. It cycles over our data
repeatedly until it reaches a stopping point. During each cycle,
we’ll want to iterate through our data in a random order:

def in_random_order(data):
"""generator that returns the elements of data in
random order"""
indexes = [i for i, _ in enumerate(data)] # create a
list of indexes

random.shuffle(indexes) # shuffle them

or i in indexes: # return the data in that order


yield data[i]

In addition, we’ll want to take a gradient step for each data


point. This approach leaves the possibility that we might circle
around near a minimum forever, so whenever we stop getting
improvements we’ll decrease the step size and eventually quit:

def minimize_stochastic(target_fn, gradient_fn, x, y,


theta_0, alpha_0=0.01):
data = zip(x, y)
theta = theta_0 # initial guess
alpha = alpha_0 # initial step size
min_theta, min_value = None, float("inf") # the
minimum so far
iterations_with_no_improvement = 0
# if we ever go 100 iterations with no improvement,
stop
while iterations_with_no_improvement < 100:

87
value = sum( target_fn(x_i, y_i, theta) for x_i, y_i
in data )
if value < min_value:
# if we've found a new minimum, remember it
# and go back to the original step size
min_theta, min_value = theta, value
iterations_with_no_improvement = 0
alpha = alpha_0
else:
# otherwise we're not improving, so try shrinking the
step size
iterations_with_no_improvement += 1
alpha *= 0.9
# and take a gradient step for each of the data
points
for x_i, y_i in in_random_order(data):
gradient_i = gradient_fn(x_i, y_i, theta)
theta = vector_subtract(theta, scalar_multiply(alpha,
gradient_i))
return min_theta

The stochastic version will typically be a lot faster than the


batch version. Of course, we’ll want a version that maximizes
as well:

def maximize_stochastic(target_fn, gradient_fn, x, y,


theta_0, alpha_0=0.01):
return minimize_stochastic(negate(target_fn),
negate_all(gradient_fn),

x, y, theta_0, alpha_0)

88
Getting Data
Without data, it is impossible to be a data scientist. Actually as
a data scientist, you will spend, most of your time acquiring,
cleaning, and transforming data. It is possible but not
recommended that you always type the data in yourself but
usually this is not a good use of your time. So let us have a look
at how you can read the data in python
stdin and stdout

If you run your Python scripts at the command line, you can
pipe data through them using sys.stdin and sys.stdout. For
example, here is a script that reads in lines of text and spits
back out the ones that match a regular expression:

# egrep.py
import sys, re
# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the
command line
regex = sys.argv[1]
# for every line passed into the script
for lines in sys.stdin:
# if it matches the regex, write it to stdout
if re.search(regex, lines):
sys.stdout.write(lines)

In addition, here is one that counts the lines it receives and


then writes out the count:
89
# line_count.py
import sys
c = 0 #count
for lines in sys.stdin:
c += 1
# print goes to sys.stdout
print count

We can then use these to count how many lines of a file contain
numbers. In Windows, we would use type SomeFile.txt |
python egrep.py "[0-9]" | python line_count.py
Whereas in a UNIX system we would use:
cat SomeFile.txt | python egrep.py "[0-9]" | python
line_count.py
The | is the pipe character, which means “use the output of
the left command as the input of the right command.” You can
build pretty elaborate data-processing pipelines this way.

Reading Files

The Basics of Text Files


Opening a file in Python
file = open("test_file.txt", "r") #opens file with name of
"test_file.txt"
Reading a file in Python
 file.read(n) – This method reads n number of characters
from the file, or if n is blank it reads the entire file.

90
 file.readline(n) – This method reads an entire line from the
text file.
 Closing a file
file.close()
Delimited Files
For example, if we had a tab-delimited file of stock prices:
6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34

We could process them with:

import csv
with open('tab_delimited_stock_prices_file.txt',
'rb') as f:
reader = csv.reader(f, delimiter='\t')
for rows in reader:
date1 = rows[0]
symbol1 = rows[1]
closing_prices = float(row[2])
process(date, symbol, closing_prices)

Using APIs

91
Many websites and web services provide application
programming interfaces (APIs), which allow you to explicitly
request data in a structured format.

JSON (and XML)

Because HTTP is a protocol for transferring text, the data you


request through a web API needs to be serialized into a string
format. Often this serialization uses JavaScript Object
Notation (JSON). JavaScript objects look quite similar to
Python dicts, which makes their string representations easy to
interpret:

{ "title" : "Data Science Book",


"author" : "James",
"publicationYear" : 2014,
"topics" : [ "data", "science", ”python” "data
science"] }

We can parse JSON using Python’s json module. In particular,


we will use its loads function, which deserializes a string
representing a JSON object into a Python object:

import json
serialized = """{ "title" : "Data Science Book",
"author" : "James",
"publicationYear" : 2014,
"topics" : [ "data", "science", ”python”, "data
science"] }"""
# parse the JSON to create a Python dict

92
deserialized = json.loads(serialized)
if "data science" in deserialized["topics"]:
print deserialized

Finding APIs

There are two directories namely Python API and Python for
Beginners. They can be useful if you are looking for lists of
APIs that have Python wrappers. If you want a directory of
web APIs more broadly (without Python wrappers
necessarily), a good resource is Programmable Web, which has
a huge directory of categorized APIs.
Example: Using the Twitter APIs
To interact with the Twitter APIs we will be using the Twython
library (pip install twython). There are quite a few Python
Twitter libraries out there.

Getting Credentials

In order to use Twitter’s APIs, you need to get some


credentials. Till the methods don’t changed the following steps
may help you out.
1. Go to the website https://apps.twitter.com/.
2. If you are not signed in, click Sign in and enter your Twitter
username and password on the Sign In page.
3. Click on Create New App.
4. Give it a name and a description, and put any URL as the
website.

93
5. Agree to the Terms of Service and click Create.
6. Take note of the consumer key and consumer secret.
7. Click on “Create my access token.”
8. Take note of the access token and access token secret (you
may have to refresh the page).

94
Working around Data
Exploring Data

Exploring One-Dimensional Data

The simplest example of one-dimensional data set is one, which has


just a collection of numbers.
For example, these could be the number of pages of each of the
Python books in your library.

def bucketize_data(point, bucket_size):


"""floor the point to the next lower multiple of
bucket_size"""
return bucket_size * math.floor(point / bucket_size)
def make_histograms(points, bucket_size):
"""buckets the points and counts how many in each
bucket"""
return Counter(bucketize_data(point, bucket_size) for
point in points)
def plot_histograms(points, bucket_size, title=""):
histogram = make_histograms(points, bucket_size)
pltgrph.bar(histogram.keys(), histogram.values(),
width=bucket_size)
pltgrph.title(title)
pltgrph.show()

Two Dimensions

Now imagine you have a data set with two dimensions. Maybe
in addition to daily minutes you have years of data science
95
experience. Of course, you’d want to understand each
dimension individually. However, you probably also want to
scatter the data.
For example, consider another fake data set:

def random_normal():
"""returns a random draw from a standard normal
distribution"""
return inverse_normal_cdf(random.random())
xs = [random_normal() for _ in range(1000)]
ys11 = [ x + random_normal() / 2 for x in xs]
ys22 = [-x + random_normal() / 2 for x in xs]

Many Dimensions

With many dimensions, you would like to know how all the
dimensions relate to one another. A simple approach is to look
at the correlation matrix, in which the entry in row i and
column j is the correlation between the ith dimension and the
jth dimension of the data:

def correlation_matrix(data):
"""returns the num_columns x num_columns matrix whose
(i, j)th entry
is the correlation between columns i and j of data"""
_, num_columns = shape(data)
def matrix_entry(i, j):
return correlation(get_column(data, i),
get_column(data, j))
return make_matrix(num_columns, num_columns,
matrix_entry)

96
Cleaning and Munging

While our exploration of the data, we found a few problems in


the dataset, which need to be solved before, the data is ready
for a good model. This exercise is typically referred as “Data
Munging”.

def parse_rows(input_row, parsers):


"""given a list of parsers (some of which may be
None)
apply the appropriate one to each element of the
input_row"""
return [parser(value) if parser is not None else
value
for value, parser in zip(input_row, parsers)]
def parse_rows_with(reader, parsers):
"""wrap a reader to apply the parsers to each of its
rows"""
for row in reader:
yield parse_rows(row, parsers)

What if there’s bad data? A “float” value that doesn’t actually


represent a number? We’d usually rather get a None than crash
our program. We can do this with a helper function:
def try_or_none(f):
"""wraps f to return None if f raises an exception
assumes f takes only one input"""
def f_or_none(x):
try: return f(x)
except: return None

97
return f_or_none
after which we can rewrite parse_row to use it:
def parse_row(input_row, parsers):
return [try_or_none(parser)(value) if parser is not
None else value
for value, parser in zip(input_row, parsers)]

Manipulating Data

One of the most important skills of a data scientist is manipulating


data. It’s more of a general approach than a specific technique, so
we’ll just work through a handful of examples to give you the flavor
of it.
Imagine we’re working with dicts of stock prices that look like:

data = [
{'closing_price': 102.06,
'date': datetime.datetime(2014, 8, 29, 0, 0),
'symbol': 'AAPL'},
# ...
]

Conceptually we’ll think of them as rows (as in a spreadsheet).


1. Group together all the rows with the same symbol.
2. Within each group, do the same as before:

# group rows by symbol


by_symbol = defaultdict(list)

98
for row in data:
by_symbol[row["symbol"]].append(row)
# use a dict comprehension to find the max for each
symbol
max_price_by_symbol = { symbol :
max(row["closing_price"]
for row in grouped_rows)
for symbol, grouped_rows in by_symbol.iteritems() }

Rescaling

Any techniques are sensitive to the scale of your data. For


example, imagine that you have a data set consisting of the
heights and weights of hundreds of data scientists, and that you
are trying to identify clusters of body sizes.
Intuitively, we’d like clusters to represent points near each
other, which means that we need some notion of distance
between points. We already have a Euclidean distance
function, so a natural approach might be to treat (height,
weight) pairs as points in two-dimensional space. Consider the
people listed in Table 10-1.
Person Height (inches) Height (centimeters) Weight

A 63 inches 160 cm 150 pounds


B 67 inches 170.2 cm 160 pounds
C 70 inches 177.8 cm 171 pounds
If we measure height in inches, then B’s nearest neighbor is A:

a_to_b = distance([63, 150], [67, 160]) # 10.77


a_to_c = distance([63, 150], [70, 171]) # 22.14

99
b_to_c = distance([67, 160], [70, 171]) # 11.40

However, if we measure height in centimeters, then B’s nearest


neighbor is instead C:
a_to_b = distance([160, 150], [170.2, 160]) # 14.28
a_to_c = distance([160, 150], [177.8, 171]) # 27.53
b_to_c = distance([170.2, 160], [177.8, 171]) # 13.37

Obviously it’s problematic if changing units can change results


like this. For this reason, when dimensions aren’t comparable
with one another, we will sometimes rescale our data so that
each dimension has mean 0 and standard deviation 1. This
effectively gets rid of the units, converting each dimension to
“standard deviations from the mean.”
To start with, we’ll need to compute the mean and the
standard_deviation for each column:

def scale(data_matrix):
"""returns the means and standard deviations of each
column"""
num_rows, num_cols = shape(data_matrix)
means = [mean(get_column(data_matrix,j))
for j in range(num_cols)]
stdevs =
[standard_deviation(get_column(data_matrix,j))
for j in range(num_cols)]
return means, stdevs

And then use them to create a new data matrix:

100
def rescale(data_matrix):
"""rescales the input data so that each column
has mean 0 and standard deviation 1
leaves alone columns with no deviation"""
means, stdevs = scale(data_matrix)
def rescaled(i, j):
if stdevs[j] > 0:
return (data_matrix[i][j] - means[j]) / stdevs[j]
else:
return data_matrix[i][j]
num_rows, num_cols = shape(data_matrix)
return make_matrix(num_rows, num_cols, rescaled)

As always, you need to use your judgment. If you were to take


a huge data set of heights and weights and filter it down to only
the people with heights between 69.5 inches and 70.5 inches,
it’s quite likely (depending on the question you’re trying to
answer) that the variation remaining is simply noise, and you
might not want to put its standard deviation on equal footing
with other dimensions’ deviations.

101
Machine Learning

Modeling

A mathematical model is simply a description of a system using


mathematical concepts and language. A model is useful to
explain a system. This can be done by studying the different
components and making prediction about the behavior.
Since prehistorically times simple models such as maps and
diagrams have been used. Usually when engineers analyze a
system to be controlled or optimized, they prefer using a
mathematical model. In analysis, engineers can build a
descriptive model of the system as a hypothesis of how the
system could work, or try to estimate how an unforeseeable
event could affect the system. Similarly, in control of a system,
engineers can try out different control approaches in
simulations.
A mathematical model usually describes a system by a set of
variables and a set of equations that establish relationships
between the variables. Variables may be of many types; real or
integer numbers, boolean values or strings, for example. The
variables represent some properties of the system, for example,
measured system outputs often in the form of signals, timing
data, counters, and event occurrence (yes/no). The actual
model is the set of functions that describe the relations
between the different variables.

102
What Is Machine Learning?

Machine learning is an application of artificial intelligence (AI)


that provides systems the ability to automatically learn and
improve from experience without being explicitly
programmed. Machine learning focuses on the development of
computer programs that can access data and use it learn for
themselves.
The process of learning begins with observations or data, such
as examples, direct experience, or instruction, in order to look
for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to
allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.

Essentially, with machine learning we use existing data to learn


a function that can make a prediction given new data. For
example, imagine that we want to create a function to
determine whether a photo contains an image of a cat or not.
First, we would need to create a data set that contains images
with cats and images without cats. We would then have
humans label each photo indicating whether it contains a cat
or not. Next, we would apply a machine learning algorithm to
that dataset of images with and without cats. The algorithm
would learn a function that predicts whether an image contains
a cat or not. Finally, if we've done everything correctly, we
should be able to provide the function with a new image and it
will tell us whether it contains a cat or not. Well this is a vastly
oversimplified explanation of machine learning, it captures the

103
essence of what we're attempting to accomplish. Some
examples of tasks that machine learning algorithms can
perform are classification where we make a decision or a
prediction involving two or more categories or outcomes, for
example, deciding whether to accept or reject a loan based on
data from a customer's financial history. Regression, where we
attempt to predict a numeric outcome based on one or more
input variables. For example, how much will a house sell for
based on the features of the house compared to the sale price
of similar houses? Clustering, where we group similar objects
together based on similarities and their data, for example,
grouping customers into marketing segments based on their
income, age, gender, number of children, etc. To understand
Machine Learning in a better way we see its workflow:

The general Machine Learning works like this:

104
First, we find a question that we want to answer. This can be a
hypothesis we want to test, a decision we want to make, or
something we want to attempt to predict.
Second, we collect data for our analysis. Sometimes this means
designing an experiment to create new data, other times the
data already exist and we just need to find them.
Third, we prepare the data for analysis, a process often referred
to as data munging or data wrangling. We need to clean and
transform these data to get them into a form suitable for
analysis.
Fourth, we create a model for our data. In the most generic
sense, this can be a numerical model, a visual model, a
statistical model, or a machine learning model. We use this
model to provide evidence for or against our hypothesis, to
help us make a decision, or to predict an outcome.

105
Fifth, we evaluate the model. We need to determine if our
model answers our question, helps us make a decision, or
creates an accurate prediction. In addition, we need to make
sure that our model is appropriate given our data and the
context.
Finally, if everything looks good, we deploy our model. This
could mean communicating the results of our analysis to
others, making a decision and acting upon our decision, or
deploying an application into production.
We then repeat this process for each question we would like to
answer using feedback from our previous results to help guide
our process.
Data science is typically an iterative process. We typically go
through the complete cycle multiple times learning and
improving with each iteration. In addition, this process is often
non-sequential; we often have to bounce back and forth
between steps as we discover problems and learn better ways
of solving these problems. In addition, there are times when
we don't need to complete the process. Often, we learn that
what we're doing isn't working or doesn't make sense give our
data or context, so we terminate the process and shift our focus
to the next most important question in our to-lo list instead.
There are some well-established practices for the data science
process available, like the CRISP-DM process, which stands
for Cross Industry Standard Process for Data Mining. These
established processes are useful to help you get started with
your data-science process.

Commonly machine learning algorithms are:

106
 Supervised learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning

Unsupervised Learning

 Here the human experts’ acts as the teacher where we


feed the computer with training data containing the
input/predictors and we show it the correct answers
(output) and from the data the computer should be
able to learn the patterns.
 Supervised learning algorithms try to model relationships
and dependencies between the target prediction output and the
input features such that we can predict the output values
for new data based on those relationships which it
learned from the previous data sets.
 Supervised learning can be thought with the concept
of function approximation, where basically we train an
algorithm and in the end of the process we pick the
function that best describes the input
data, the one that for a given X makes the best estima
tion of y (X -> y). Most of the time we are not able to
figure out the true function that always make the
correct predictions and other reason is that the
algorithm rely upon an assumption made by humans
about how the computer should learn and this
assumptions introduce a bias, Bias is topic I’ll explain
in another post. Examples are: Nearest Neighbor,
Naive Bayes, Decision Trees, Linear Regression,
Support Vector Machines (SVM), Neural Networks

107
Unsupervised Learning

 The computer is trained with unlabeled data.


 Here there’s no teacher at all, actually the computer
might be able to teach you new things after it learns
patterns in data, these algorithms a particularly useful
in cases where the human expert doesn’t know what to
look for in the data.
 Are the family of machine learning algorithms, which
are mainly used in pattern detection and descriptive modeling.
However, there are no output categories or labels here based
on which the algorithm can try to model relationships.
These algorithms try to use techniques on the input
data to mine for rules, detect patterns, and summarize and
group the data points which help in deriving meaningful
insights and describe the data better to the users.
Example: k-means clustering, Association Rules

Semi-supervised Learning

In the previous two types, either there are no labels for all the
observation in the dataset or labels are present for all the
observations. Semi-supervised learning falls in between these
two. In many practical situations, the cost to label is quite high,
since it requires skilled human experts to do that. So, in the
absence of labels in the majority of the observations but
present in few, semi-supervised algorithms are the best
candidates for the model building. These methods exploit the
idea that even though the group memberships of the unlabeled
data are unknown, this data carries important information
about the group parameters.

108
Reinforcement Learning

This method aims at using observations gathered from the


interaction with the environment to take actions that would
maximize the reward or minimize the risk. Reinforcement
learning algorithm (called the agent) continuously learns from
the environment in an iterative fashion. In the process, the
agent learns from its experiences of the environment until it
explores the full range of possible states.

Overfitting and Underfitting

A common danger in machine learning is overfitting—producing


a model that performs well on the data you train it on but that
generalizes poorly to any new data.
This could involve learning noise in the data. Or it could involve
learning to identify specific inputs rather than whatever factors
are actually predictive for the desired output.
The other side of this is underfitting, producing a model that
doesn’t perform well even on the training data, although
typically when this happens you decide your model isn’t good
enough and keep looking for a better one.

109
Overfitting and Underfitting

The horizontal line shows the best fit degree 0 (i.e., constant)
polynomial. It severely underfits the training data. The best fit
degree 9 (i.e., 10-parameter) polynomial goes through every
training data point exactly, but it very severely overfits—if we
were to pick a few more data points it would quite likely miss
them by a lot. And the degree 1line strikes a nice balance—it’s
pretty close to every point, and (if these data are representative)
the line will likely be close to new data points as well.
Clearly, models that are too complex lead to overfitting and
don’t generalize well beyond the data they were trained on. So
how do we make sure our models aren’t too complex? The
most fundamental approach involves using different data to
train the model and to test the model.

110
The simplest way to do this is to split your data set, so that (for
example) two-thirds of it is used to train the model, after which
we measure the model’s performance on the remaining third:

def split_data(data, prob):


"""split data into fractions [prob, 1 - prob]"""
results = [], []
for row in data:
results[0 if random.random() < prob else
1].append(row)
return results

If the model was overfit to the training data, then it will


hopefully perform really poorly on the (completely separate)
test data. Said differently, if it performs well on the test data,
then you can be more confident that it’s fitting rather than
overfitting.
However, there are a couple of ways this can go wrong.
The first is if there are common patterns in the test and train
data that wouldn’t generalize to a larger data set.
A bigger problem is if you use the test/train split not just to
judge a model but also to choose from among many models. In
that case, although each individual model may that makes the
test set function as a second training set. (Of course the model
that performed best on the test set is going to perform well on
the test set.)
In such a situation, you should split the data into three parts: a
training set for building models, a validation set for choosing
among trained models, and a test set for judging the final model.

111
Correctness

 TP – True Positive – the number of observations


correctly assigned to the positive class
Example: the model’s predictions are correct and
resigning customers have been assigned to the class of
„disloyal” customers
 TN – True Negative – the number of observations
correctly assigned to the negative class
Example: the model’s predictions are correct and
customers who continue using the service have been
assigned to the class of „loyal” customers.
 FP – False Positive – the number of observations
assigned by the model to the positive class, which in
reality belong to the negative class.Example:
unfortunately the model is not perfect
 and made a mistake: some customers, who continue
using the service have been assigned to the class of
„disloyal” customers.
 FN – False Negative – the number of observations
assigned by the model to the negative class, which in
reality belong to the positive class.
 Example: unfortunately the model is not perfect and
made a mistake: some churning customers have been
assigned to the class of „loyal” customers.
 ACC (Total Accuracy) – reflects the classifier’s overall
prediction correctness, i.e. the probability of making
the correct prediction, equal to the ration of the
number of correct decision to the total number of
decisions

ACC = (TP + TN) / (TP + TN + FP + FN)


112
The Bias-Variance Trade-off

Another way of thinking about the overfitting problem is as a


trade-off between bias and variance.
Both are measures of what would happen if you were to retrain
your model many times on different sets of training data (from
the same larger population).
However, any two randomly chosen training sets should give
similar models (since any two randomly chosen training sets
should have similar average values). Therefore, we say that it
has a low variance.
High bias and low variance typically corresponds to
underfitting. On the other hand, the degree 9 model fit the
training set perfectly. It has very low bias but very high variance
(since any two training sets would likely give rise to very
different models). This corresponds to overfitting.
Thinking about model problems, this way can help you figure
out what do when your model doesn’t work so well.
If your model has high bias (which means it performs poorly
even on your training data) then one thing to try is adding more

113
features. Going from the degree 0 model in“Overfitting and
Underfitting” to the degree 1 model can be a big improvement.
If your model has high variance, then you can similarly remove
features. Nevertheless, another solution is to obtain more data.

Reducing variance with more data

Feature Extraction and Selection

In machine learning and pattern recognition, a feature is a


measurable property of a phenomenon being observed.
Choosing discriminating and independent features is key to any
pattern recognition algorithm being successful in classification.
For instance, if our goal is to detect a cow in an image while
any pixel of the image can be seen as a feature, not all of them
are informative and useful for our goal.

114
In real applications, usually tens of thousands of features are
measured while only a very small percentage of them carry
useful information towards our learning goal. Therefore, we
usually need an algorithm that compress our feature vector and
reduce its dimension. Two groups of methods, which can be
used, for dimensionality reduction are: 1) Feature extraction
methods where apply a transformation on the original feature
vector to reduce its dimension from d to m. 2) Feature
selection methods that select a small subset of original features.
In this work, we want to compare the linear discriminant
analysis (LDA) which is a traditional feature extraction method
with a forward selection based method (which is an instance of
the feature selection algorithms) and find under which
conditions, one of these algorithms works better.

115
K-Nearest Neighbors
Imagine that you’re trying to predict how I’m going to vote in
the next presidential election. If you know nothing else about
me (and if you have the data), one sensible approach is to look
at how my neighbors are planning to vote. Living in downtown
Seattle, as I do, my neighbors are invariably planning to vote
for the Democratic candidate, which suggests that
“Democratic candidate” is a good guess for me as well.
Now imagine you know more about me than just geography—
perhaps you know my age, my income, how many kids I have,
and so on. To the extent my behavior is influenced (or
characterized) by those things, looking just at my neighbors
who are close to me among all those dimensions seems likely
to be an even better predictor than looking at all my neighbors.
This is the idea behind nearest neighbors classification.
The Model

Example:
This example is broken down into the following steps:

1. Handling Data: Open the dataset from CSV and split


into test/train datasets.
2. Calculating Similarity: Calculate the distance
between two data instances.
3. Locating Neighbors: Locate k most similar data
instances.
4. Generating Response: Generate a response from a
set of data instances.

116
5. Evaluating Accuracy: Summarize the accuracy of
predictions.
6. Main: Tie it all together.

Handling Data

The first thing we need to do is load our data file. The data is
in CSV format without a header line or any quotes. We can
open the file with the open function and read the data lines
using the reader function in the csv module.

import csv
with open('iris.data', 'rb') as csvfile:
lines = csv.reader(csvfile)
for row in lines:
print ', '.join(row)

Next we need to split the data into a training dataset that kNN
can use to make predictions and a test dataset that we can use
to evaluate the accuracy of the model.

We first need to convert the flower measures that were loaded


as strings into numbers that we can work with. Next we need
to split the data set randomly into train and datasets. A ratio of
67/33 for train/test is a standard ratio used.

Pulling it all together, we can define a function called


loadDataset that loads a CSV with the provided filename and

117
splits it randomly into train and test datasets using the provided
split ratio.

import csv
import random
def loadDataset(filename, split, trainingSet=[] ,
testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

Download the iris flowers dataset CSV file to the local


directory. We can test this function out with our iris dataset, as
follows:
Test the loadDataset function in Python
Python

Training_Set=[]
Test_Set=[]
loadDataset('iris.data', 0.66, training_Set,
test_Set)

118
print 'Train: ' + repr(len(training_Set))
print 'Test: ' + repr(len(test_Set))

Calculating Similarity

Calculating the similarity between any two given data


instances is necessary to make a predictions. By doing so we
can find the k nearest neighbors and form training sets. These
training sets can be used to predict the result for a similar data
item.

The Euclidean distance is defined as the square root of the


sum of the squared differences between the two arrays of
numbers (read that again a few times and let it sink in).
Additionally, we want to control which fields to include in the
distance calculation. Specifically, we only want to include the
first 4 attributes. One approach is to limit the euclidean
distance to a fixed length, ignoring the final dimension.

Putting all of this together we can define the Euclidean


Distance function as follows:

import math
def Euclidean_Distance(instance_1, instance_2, len):
distance = 0
for i in range(len):
dist += pow((instance_1[x] – instance_2[x]), 2)
return math.sqrt(dist)

119
We can test this function with some sample data, as
follows:
data_1 = [3, 3, 3, 'x']
data_2 = [5, 5, 5, 'y']
distance = Euclidean_Distance(data_1, data_2, 3)
print 'Distance: ' + repr(distance)

Locating Neighbors

Now that we have a similarity measure, we can use it collect the k


most similar instances for a given unseen instance.
This is a straight forward process of calculating the distance for all
instances and selecting a subset with the smallest distance values.

Below is the getNeighbors function that returns k most similar


neighbors from the training set for a given test instance (using the
already defined euclideanDistance function)

import operator
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance,
trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):

120
neighbors.append(distances[x][0])
return neighbors
We can test out this function as follows:
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)

Generating Response

Once we have located the most similar neighbors for a test


instance, the next task is to devise a predicted response based
on those neighbors.
We can do this by allowing neighbors to vote for their class
attribute, and take the majority vote as the prediction.
Below provides a function for getting the majority voted
response from a number of neighbors. It assumes the class is
the last attribute for each neighbor.

import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1

121
sortedVotes = sorted(classVotes.iteritems(),
key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]

We can test out this function with some test neighbors, as


follows:
neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
response = getResponse(neighbors)
print(response)

This approach returns one response in the case of a draw, but


you could handle such cases in a specific way, such as
returning no response or selecting an unbiased random
response.

Evaluating Accuracy

We have all of the pieces of the kNN algorithm in place. An


important remaining concern is how to evaluate the accuracy
of predictions.
An easy way to evaluate the accuracy of the model is to
calculate a ratio of the total correct predictions out of all
predictions made, called the classification accuracy.
Below is the getAccuracy function that sums the total correct
predictions and returns the accuracy as a percentage of correct
classifications.

def getAccuracy(testSet, predictions):


correct = 0

122
for x in range(len(testSet)):
if testSet[x][-1] is predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
We can test this function with a test dataset and
predictions, as follows:
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSet, predictions)
print(accuracy)

Main Elements

We now have all the elements of the algorithm and we can tie
them together with a main function Running the example, you
will see the results of each prediction compared to the actual
class value in the test set. At the end of the run, you will see
the accuracy of the model. In this case, a little over 98%.

1 ...
2 > predicted='Iris-virginica', actual='Iris-virginica'
3 > predicted='Iris-virginica', actual='Iris-virginica'
4 > predicted='Iris-virginica', actual='Iris-virginica'
5 > predicted='Iris-virginica', actual='Iris-virginica'
6 > predicted='Iris-virginica', actual='Iris-virginica'
7 Accuracy: 98.0392156862745%

123
The Curse of Dimensionality

K nearest neighbours works fine as long as there are limited


number of dimensions to the data. As soon as the curse of
dimensionality comes into picture k nearest neighbours start
failing. This is due to the fact that high dimensional spaces are
vast.
Points in high-dimensional spaces tend not to be close to one
another at all.
We can check this by generating random pairs of points in the
d dimensional “ unit cube” in different directions and then
calculating the distances between them.

When machine learning algorithms are applied to highly


dimensional data then The Curse of dimensionality occurs.
The Curse of dimensionality as a term was first introduced by
Richard Bellman.

Let's take a simple example as an illustration of the issue. Say


we have a data set of observations XT={x1,…,xN} where
N=204, and x∈[0;20] is a scalar (D=1). We also have a set of
target classes TT={t1,…,tN}, where t is a class variable
expressing association with one of two possible classes - either
"green" or "red" (t∈{g,r}).
We want to train a classifier that can output the probability of
the nth observation being of class "green":
p(tn=g∣xn)(1)
124
Let's use a quite naïve approach and partition the space of X
into four equally sized regions - Region1, Region2, Region3, and
Region4, as shown in the left most sub-plot of Figure 1.

The curse of dimensionality says that as the number of


dimensions increases, the number of regions grows
exponentially
In the first region, that is the x∈[0,5] interval, we have 3
observations of class red and one observation of class green,
so
p(tn=g∣Region1)=14=0.25(2)
We can apply the same approach to the remaining three
regions and get
p(tn=g∣Region2)=23≈0.67p(tn=g∣Region3)=47≈0.57p(tn=g∣Regio
n4)=36=0.5(3)

125
We now have a working, although quite simplistic classifier. If
we are presented an unseen observation, all we have to do is
figure out its region, in order to make a prediction for its class.
Now let's increase the dimensionality of the data set X, by
making x a two dimensional vector (D=2)
XT={(x11,x12),…,(xN1,xN2)}(4)

Look at the middle sub-plot of Figure 1. We have the same


number of observations, but the number of regions has
increased to 16.
In the case of a one dimensional x we had 20 observations
across 4 regions, resulting in an average of 204=5 observations
per region. Using the same number of observations in a two
dimensional space reduces this number to 2016=1.25
observations per region.
We now have regions that only contain green classes (e.g.
Region2, x1∈[5;10], x2∈[0,5]) where according to our classifier
the probability of observing "green" will be 1 and the
probability of observing "red" will be 0.

Another problem that we can immediately spot is that regions


with zero observations emerge (e.g. Region1, x1∈[0;5],
x2∈[0,5]). Our classifier will be undefined in such regions and
we won't be able to use it to make any predictions.
Increasing the dimensionality further worsens the situation -
adding a third dimension results in 64 regions and a density of
2064≈0.31observations per region.

126
Let see how many observations we need if we want to keep the
one-dimensional density from our example in a three-
dimensional space.
2011=x13x=8000(5)

This means we have to use 8'000 observations in three-


dimensional space to get the same density as we would get
from 20 observations in a one-dimensional space.
This illustrates one of the key effects of the curse of
dimensionality - as dimensionality increases the data becomes
sparse. This means we need to gather more observations in
order to present the classification algorithm with a good space
coverage. Nevertheless, if we keep increasing the number of
dimensions, the number of required observations quickly goes
beyond what we can hope to gather.

127
Naive Bayes

It is a classification technique based on Bayes’ Theorem with


an assumption of independence among predictors. In simple
terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any
other feature. For example, a fruit may be considered to be an
orange if it is orange in colour, round in shape, and about 3
inches in diameter. Even if these features depend on each other
or upon the existence of the other features, all of these
properties independently contribute to the probability that this
fruit is an orange and that is why it is known as ‘Naive’.
It is easy to build Naïve Bayes model and it can be very useful
for the data sets which are large. It is simple and can outshine
even refined classification method
Bayes theorem provides a way of calculating posterior
probability P(c|x) from P(c), P(x) and P(x|c). Look at the
equation below:

In the above formula,

128
 P(c|x) (read as probability of c given x) is the
posterior probability of class (c, target) given predictor
(x, attributes).
 P(c) (read as probability of c ) is the prior probability
of class.
 P(x|c) ) (read as probability of x given c)is the
likelihood which is the probability of predictor given
class.
 P(x) (read as probability of x) is the prior probability
of predictor.

Applications of Naive Bayes Algorithms

 In Real time Predictions


 In Multi class Predictions
 In Text classification/ Spam Filtering and
Sentiment Analysis
 In Recommendation Systems

How to build a basic model using Naive Bayes in


Python?

The three types of Naive Bayes model under scikit learn library
are:

 Gaussian models
 Multinomial models
 Bernoulli models

129
Based on your data set, you can choose any of above discussed
model. Below is the example of Gaussian model.

Python Code
#Import Library of Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
import numpy as np

#assigning predictor and target variables


a= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-
4,0], [-1,1], [1,1], [-2,2], [2,7], [-4,1], [-2,7]])
b = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])
#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets


model.fit(a, b)

#Predict Output
predicted= model.predict([[1,2],[3,4]])
print predicted

Output: ([3,4])

130
Simple Linear Regression
The question which arises here is that how does regression
relate to machine learning?

Given data, we can try to find the best fit line. After we
discover the best fit line, we can use it to make predictions.

Consider we have data about Clothing: price, size etc

Data can be any data saved from Excel into a csv format. To
load the data, we will use Python Pandas.
Required modules are
1. sklearn
2. scipy
3. scikit-learn
You can use sudo pip install to install the above

Load dataset and plot


You can choose the graphical toolkit but its optional:

matplotlib.use('GTKAgg')

We start by loading the modules, and the dataset. Without data


we can’t make good predictions.
The first step is to load the dataset. The data will be loaded
using Python Pandas, a data analysis module. It will be loaded
into a structure known as a Panda Data Frame, which allows
for each manipulation of the rows and columns.
We create two arrays: S (size) and P (price). Intuitively we’d
expect to find some correlation between price and size.

131
The data will be split into a training and test set. Once we have
the test data, we can find a best fit line and make predictions.

import matplotlib
matplotlib.use('GTKAgg')

import matplotlib.pyplot as plt


import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

# Load CSV and columns


df = pd.read_csv("Clothing.csv")

P = df['price']
S = df['lotsize']

S=S.reshape(len(S),1)
P=P.reshape(len(P),1)

# Split the data into training/testing sets


P_train = P[:-250]
P_test = P[-250:]

# Split the targets into training/testing sets


S_train = S[:-250]
S_test = S[-250:]

# Plot outputs
plt.scatter(P_test, S_test, color='black')

132
plt.title('Test Data')
plt.slabel('Size')
plt.plabel('Price')
plt.pticks(())
plt.sticks(())

plt.show()

Finally we plot the test data.

We have created the two datasets and have the test data on the
screen. We can continue to create the best fit line:

# Create linear regression object


regression = linear_model.LinearRegression()

133
# Train the model using the training sets
regression.fit(X_train, Y_train)

# Plot outputs
plt.plot(X_test, regrression.predict(X_test),
color='red',linewidth=3)

This will output the best fit line for the given test data.

134
Logistic Regression

Logistic regression in Python can be used for data science.


Though it isn’t the best tool for predicting the class of
observation it is useful for estimating values. If we try to
classify a binary class by 1’s and 0’s the results are bound to be
disappointing.
The fact is that linear regression works on a continuum of
numeric estimates. In order to classify correctly, you need a
more suitable measure, such as the probability of class
ownership. Thanks to the following formula, you can
transform a linear regression numeric estimate into a
probability that is more apt to describe how a class fits an
observation:

probability of a class = exp(res) / (1+exp(res))

res is the regression result (the sum of the variables weighted


by the coefficients) and exp is the exponential function.
exp(res) corresponds to Euler’s number e elevated to the
power of r. A linear regression using such a formula (also called
a link function) for transforming its results into probabilities is
a logistic regression.

Applying logistic regression

135
Logistic regression is similar to linear regression, with the only
difference being the y data, which should contain integer values
indicating the class relative to the observation. Using the Iris
dataset from the Scikit-learn datasets module, you can use the
values 0, 1, and 2 to denote three classes that correspond to
three species:

from sklearn.datasets import load_iris


iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]

To make the example easier to work with, leave a single value


out so that later you can use this value to test the efficacy of
the logistic regression model on it.

from sklearn.linear_model import LogisticRegression


logistic = LogisticRegression()
logistic.fit(X,y)
print ‘Predicted class %s and real class %s’ % (
logistic.predict(iris.data[-1,:]),iris.target[-1])
print ‘Probabilities for each class from 0 to 2: %s’
% logistic.predict_proba(iris.data[-1,:])
Predicted class [2], real class 2
Probabilities for each class from 0 to 2:
[[ 0.00168787 0.28720074 0.71111138]]

Contrary to linear regression, logistic regression doesn’t just


output the resulting class (in this case, the class 2), but it also
estimates the probability of the observation’s being part of all

136
three classes. Based on the observation used for prediction,
logistic regression estimates a probability of 71 percent of its
being from class 2 — a high probability, but not a perfect
score, therefore leaving a margin of uncertainty.
Using probabilities lets you guess the most probable class, but
you can also order the predictions with respect to being part of
that class. This is especially useful for medical purposes:
Ranking a prediction in terms of likelihood with respect to
others can reveal what patients are at most risk of getting or
already having a disease.
The two multiclass classes OneVsRestClassifier and
OneVsOneClassifier operate by incorporating the estimator
(in this case, LogisticRegression). After incorporation, they
usually work just like any other learning algorithm in Scikit-
learn. Interestingly, the one-versus-one strategy obtained the
best accuracy thanks to its high number of models in
competition.

137
Decision Trees
A decision tree simply uses a tree structure to represent the
number of possible decision paths and an outcome for each
specified path.

Which we can implement as:


Entropy measures the uncertainty of a random variable; it
characterizes the impurity of an arbitrary collection of
examples. The higher the entropy the more the information
content.

The Entropy of a Partition

Mathematically, if we partition our data S into subsets S1, ...,


Sm containing proportions q1, ..., qm of the data, then we
compute the entropy of the partition as a weighted sum:
H = q1H S1 + . . . + qmH Sm

def entropy_of_partition(subsets):
"""find the entropy from this partition of data into
subsets

138
subsets is a list of lists of labeled data"""
total_count = sum(len(subset) for subset in subsets)
return sum( data_entropy(subset) * len(subset) /
total_count for subset in subsets )

Creating a Decision Tree

The VP provides you with the interviewee data, consisting of


(per your specification) pairs (input, label), where each input is
a dict of candidate attributes, and each label is either True (the
candidate interviewed well) or False (the candidate interviewed
poorly). In particular, you are provided with each candidate’s
level, her preferred language, whether she is active on Twitter,
and whether she has a PhD:

inputs = [
({'level':'Senior', 'lang':'Java', 'tweets':'no',
'phd':'no'}, False),
({'level':'Senior', 'lang':'Java', 'tweets':'no',
'phd':'yes'}, False),
({'level':'Mid', 'lang':'Python', 'tweets':'no',
'phd':'no'}, True),
({'level':'Junior', 'lang':'Python', 'tweets':'no',
'phd':'no'}, True),
({'level':'Junior', 'lang':'R', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Junior', 'lang':'R', 'tweets':'yes',
'phd':'yes'}, False),
({'level':'Mid', 'lang':'R', 'tweets':'yes',
'phd':'yes'}, True),
({'level':'Senior', 'lang':'Python', 'tweets':'no',
'phd':'no'}, False),
({'level':'Senior', 'lang':'R', 'tweets':'yes',
'phd':'no'}, True),

139
({'level':'Junior', 'lang':'Python', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Senior', 'lang':'Python', 'tweets':'yes',
'phd':'yes'}, True),
({'level':'Mid', 'lang':'Python', 'tweets':'no',
'phd':'yes'}, True),
({'level':'Mid', 'lang':'Java', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Junior', 'lang':'Python', 'tweets':'no',
'phd':'yes'}, False)
]

Our tree will consist of decision nodes (which ask a question and
direct us differently depending on the answer) and leaf nodes
(which give us a prediction). We will build it using the relatively
simple ID3 algorithm, which operates in the following manner.
Let’s assume we have some labeled data, and a list of attributes
to consider branching on.
• If the data all have the same label, then we create a leaf node
that predicts that label and then stops.
• If the list of attributes is empty (i.e., there are no more
possible questions to ask), then we create a leaf node that
predicts the most common label and then stops.
• Otherwise, we try partitioning the data by each of the
attributes
• Then we choose the partition which has the lowest partition
entropy
• Then we add a decision node based on the chosen attribute
• Recur on each partitioned subset using the remaining
attributes

140
This is known as the “greedy” algorithm because, at each step,
it chooses the most immediately best option. Given a data set,
there may be a better tree with a worse-looking first move. If
so, this algorithm won’t find it. Nevertheless, it is relatively easy
to understand and implement, which makes it a good place to
begin exploring decision trees.
Let’s manually go through these steps on the interviewee data
set. The data set has both True and False labels, and we have
four attributes we can split on. So our first step will be to find
the partition with the least entropy. We’ll start by writing a
function that does the partitioning:

def partition_by(inputs, attribute):


"""each input is a pair (attribute_dict, label).
returns a dict : attribute_value -> inputs"""
groups = defaultdict(list)
for input in inputs:
key = input[0][attribute] # get the value of the
specified attribute
groups[key].append(input) # then add this input to
the correct list
return groups

and one that uses it to compute entropy:


def partition_entropy_by(inputs, attribute):
"""computes the entropy corresponding to the given
partition"""
partitions = partition_by(inputs, attribute)
return entropy_of_partition(partitions.values())

141
Then we just need to find the minimum-entropy partition for
the whole data set:
for key in ['level','langs','tweets','phd']:
print key, partition_entropy_by(inputs, key)
# level 0.693536138896
# lang 0.860131712855
# tweets 0.788450457308
# phd 0.892158928262

The lowest entropy comes from splitting on level, so we’ll need


to make a subtree for each possible level value. Every Mid
candidate is labeled True, which means that the mid subtree is
simply a leaf node predicting True. For Senior candidates, we
have a mix of Trues and Falses, so we need to split again:

senior_inputs = [(input, label)


for input, label in inputs if input["level"] ==
"Senior"]
for key in ['lang', 'tweets', 'phd']:
print key, partition_entropy_by(senior_inputs, key)
# lang 0.4
# tweets 0.0
# phd 0.950977500433

This shows us that our next split should be on tweets, which


results in a zero entropy partition. For these Senior-level
candidates, “yes” tweets always result in True while “no”
tweets always result in False.

142
Finally, if we do the same thing for the Junior candidates, we
end up splitting on phd, after which we find that no PhD
always results in True and PhD always results in
False.
Figure below shows the complete decision tree.

143
Random Forests
Given how closely decision trees can fit themselves to their
training data, it’s not surprising that they have a tendency to
overfit. One way of avoiding this is a technique called random
forests, in which we build multiple decision trees and let them
vote on how to classify inputs:

def forest_classify(trees, input):


votes = [classify(tree, input) for tree in trees]
vote_counts = Counter(votes)
return vote_counts.most_common(1)[0][0]

Our tree-building process was deterministic, so how do we get


random trees?
One piece involves bootstrapping data. Rather than training
each tree on all the inputs in the training set, we train each tree
on the result of bootstrap sample(inputs). Since each tree is
built using different data, each tree will be different from every
other tree. (A side benefit is that it’s totally fair to use the non-
sampled data to test each tree, which means you can get away
with using all of your data as the training set if you are clever
in how you measure performance.) This technique is known as
bootstrap aggregating or bagging.
A second source of randomness involves changing the way we
chose the best attribute to split on. Rather than looking at all
the remaining attributes, we first choose a random subset of
them and then split on whichever of those is best:

144
# if there's already few enough split candidates,
look at all of them
if len(split_candidates) <= self.num_split_candidates:
sampled_split_candidates = split_candidates
# otherwise pick a random sample
else:
sampled_split_candidates =
random.sample(split_candidates,
self.num_split_candidates)
# now choose the best attribute only from those
candidates
best_attribute = min(sampled_split_candidates,
key=partial(partition_entropy_by, inputs))
partitions = partition_by(inputs, best_attribute)

This is an example of a broader technique called ensemble learning


in which we combine several weak learners (typically high bias,
low-variance models) in order to produce an overall strong
model.
We will find that random forests are the most popular models
and that they ae the most versatile too.

145
Neural Networks

Finding bugs in any program is a trial and error or more


properly a hit and try method. We try to test different scenarios
to find the bugs and then keep it mind for the next test case.
Neural networks are very similar in working. It takes several
inputs, processes it through multiple neurons from multiple
hidden layers and returns the result using an output layer. This
result estimation process is technically known as “Forward
Propagation“.

Next, we compare the result with actual output. The task is to


make the output to neural network as close to actual (desired)
output. Each of these neurons are contributing some error to
final output. How do you reduce the error?

We try to minimize the value/ weight of neurons those are


contributing more to the error and this happens while traveling
back to the neurons of the neural network and finding where
the error lies. This process is known as “Backward
Propagation“.

Perceptrons

A perceptron can be viewed as anything that takes multiple


inputs and produces one output. For example, look at the
image below.

146
Example of a function of Perceptron It computes a weighted
sum of its inputs and “fires”if that weighted sum is zero or
greater:

def my_step_function(x):
return 1 if x >= 0 else 0
def my_perceptron_output(weights, bias, x):
"""returns 1 if the perceptron 'fires', 0 if not"""

The above structure takes three inputs and produces one


output. The next logical question is what is the relationship
between input and output? Let us start with basic ways and
build on to find more complex ways.

Backpropagation

147
Usually we don’t build neural networks by hand.
Instead (as usual) we use data to train neural networks. One
popular approach is an algorithm called backpropagation that has
similarities to the gradient descent algorithm we looked at
earlier.
Imagine we have a training set that consists of input vectors
and corresponding target output vectors. Imagine that our
network has some set of weights. We then adjust the weights
using the following algorithm:

1. Run feed_forward on an input vector and produce the


outputs. We do this for all the neurons in the network.
2. This results in an error for each output neuron—the
difference between its output and its target.
3. Compute the gradient of this error as a function of the
neuron’s weights, and then adjust its weights in the direction
that most decreases the error.
4. “Propagate” these output errors backward to infer errors for
the hidden layer.
5. Compute the gradients of these errors and adjust the hidden
layer’s weights in the same manner.
Typically we run this algorithm many times for our entire
training set until the network converges:

def backpropagate(network, input_vector, targets):


hidden_outputs, outputs = feed_forward(network,
input_vector)

148
# the output * (1 - output) is from the derivative of
sigmoid
output_deltas = [output * (1 - output) * (output -
target)
for output, target in zip(outputs, targets)]
# adjust weights for output layer, one neuron at a
time
for i, output_neuron in enumerate(network[-1]):
# focus on the ith output layer neuron
for j, hidden_output in enumerate(hidden_outputs +
[1]):
# adjust the jth weight based on both
# this neuron's delta and its jth input
output_neuron[j] -= output_deltas[i] * hidden_output
# back-propagate errors to hidden layer
hidden_deltas = [hidden_output * (1 - hidden_output)
*
dot(output_deltas, [n[i] for n in output_layer])
for i, hidden_output in enumerate(hidden_outputs)]
# adjust weights for hidden layer, one neuron at a
time
for i, hidden_neuron in enumerate(network[0]):
for j, input in enumerate(input_vector + [1]):
hidden_neuron[j] -= hidden_deltas[i] * input

Example: A neural network trained with backpropagation is


attempting to use input to predict output.

149
01. import numpy as np
02.
03.# sigmoid function
04.def nonlin(x,deriv=False):
05.if(deriv==True):
06.return x*(1-x)
07.return 1/(1+np.exp(-x))
08.
09.# input dataset is as below
10.X = np.array([ [0,0,1],
11.[0,1,1],
12.[1,0,1],
13.[1,1,1] ])
14.
15.# output dataset is as below
16.y = np.array([[0,0,1,1]]).T
17.
18.# seed random numbers to make calculations
19.# deterministic in nature
20.np.random.seed(1)
21.

150
22.# initializing weights randomly with mean as 0
23.syn0 = 2*np.random.random((3,1)) - 1
24.
25.for iter in xrange(10000):
26.
27.# forward propagation
28.l0 = X
29.l1 = nonlin(np.dot(l0,syn0))
30.
31.# calculating error
32.l1_error = y - l1
33.
34.# multiplying error by the
35.# slope of the sigmoid at the values in l1
36.l1_delta = l1_error * nonlin(l1,True)
37.
38.# update weights
39.syn0 += np.dot(l0.T,l1_delta)

40.
41.print "The Output After Training of the Data is:"
42.print l1
The Output After Training of the Data is:
[[ 0.00966449]
[ 0.00786506]
[ 0.99358898]
[ 0.99211957]]

151
Variable Definition

X Input dataset matrix where each row is a training


example

Y Output dataset matrix where each row is a training


example

l0 First Layer of the Network, specified by the input


data

l1 Second Layer of the Network, otherwise known as


the hidden layer

syn0 First layer of weights, Synapse 0, connecting l0 to l1.

* Element wise multiplication, so two vectors of equal


size are multiplying corresponding values 1-to-1 to
generate a final vector of identical size.

- Element wise subtraction, so two vectors of equal


size are subtracting corresponding values 1-to-1 to
generate a final vector of identical size.

x.dot(y) If x and y are vectors, this is a dot product. If both


are matrices, it's a matrix-matrix multiplication. If
only one is a matrix, then it's vector matrix
multiplication.
As you can see in the " The Output After Training of the
Data is: ", it is successful!!!

Line Number 01: This imports our only dependency, which


is numpy. Numpy is only a linear algebra library
Line Number 04: This is our "nonlinearity". While it can be
several kinds of functions, this nonlinearity maps a function

152
called a "sigmoid". A sigmoid function maps any value to a
value between 0 and 1. We use it to convert numbers to
probabilities. It also has several other desirable properties for
training neural networks.

Line Number 05: Notice that this function can also generate
the derivative of a sigmoid (when deriv=True). One of the
desirable properties of a sigmoid function is that its output can
be used to create its derivative. If the sigmoid's output is a
variable "out", then the derivative is simply out * (1-out). This
is very efficient.

If you're unfamiliar with derivatives, just think about it as the


slope of the sigmoid function at a given point (as you can see
above, different points have different slopes). For more on

153
derivatives, check out this derivatives tutorial from Khan
Academy.

Line Number 10: This initializes our input dataset as a numpy


matrix. Each row is a single "training example". Each column
corresponds to one of our input nodes. Thus, we have 3 input
nodes to the network and 4 training examples.

Line Number 16: This initializes our output dataset. In this


case, I generated the dataset horizontally (with a single row and
4 columns) for space. ".T" is the transpose function. After the
transpose, this y matrix has 4 rows with one column. Just like
our input, each row is a training example, and each column
(only one) is an output node. So, our network has 3 inputs and
1 output.

Line Number 20: It's good practice to seed your random


numbers. Your numbers will still be randomly distributed, but
they'll be randomly distributed in exactly the same way each
time you train. This makes it easier to see how your changes
affect the network.

Line Number 23: This is our weight matrix for this neural
network. It's called "syn0" to imply "synapse zero". Since we
only have 2 layers (input and output), we only need one matrix
of weights to connect them. Its dimension is (3,1) because we
have 3 inputs and 1 output. Another way of looking at it is that
l0 is of size 3 and l1 is of size 1. Thus, we want to connect every

154
node in l0 to every node in l1, which requires a matrix of
dimensionality (3,1).

Line Number 25: This begins our actual network training


code. This for loop "iterates" multiple times over the training
code to optimize our network to the dataset.

Line Number 28: Since our first layer, l0, is simply our data.
We explicitly describe it as such at this point. Remember that
X contains 4 training examples (rows). We're going to process
all of them at the same time in this implementation. This is
known as "full batch" training. Thus, we have 4 different l0
rows, but you can think of it as a single training example if you
want. It makes no difference at this point. (We could load in
1000 or 10,000 if we wanted to without changing any of the
code).
Line Number29: This is our prediction step. We first let the
network "try" to predict the output given the input. We will
then study how it performs so that we can adjust it to do a bit
better for each iteration.

This line contains 2 steps. The first matrix multiplies l0 by syn0.


The second passes our output through the sigmoid function.
Consider the dimensions of each:
(4 x 3) dot (3 x 1) = (4 x 1)
Matrix multiplication is ordered, such the dimensions in the
middle of the equation must be the same. The final matrix
generated is thus the number of rows of the first matrix and

155
the number of columns of the second matrix.
Since we loaded in 4 training examples, we ended up with 4
guesses for the correct answer, a (4 x 1) matrix. Each output
corresponds with the network's guess for a given input.
Perhaps it becomes intuitive why we could have "loaded in" an
arbitrary number of training examples. The matrix
multiplication would still work out. :)

Line Number 32: So, given that l1 had a "guess" for each
input. We can now compare how well it did by subtracting the
true answer (y) from the guess (l1). l1_error is just a vector of
positive and negative numbers reflecting how much the
network missed.
Line Number 36: let's further break it into two parts.

1st Part: The Derivative

1.nonlin(l1,True)
If l1 represents these three dots, the code above generates the
slopes of the lines below. Notice that very high values such as
x=2.0 (green dot) and very low values such as x=-1.0 (purple
dot) have rather shallow slopes. The highest slope you can have
is at x=0 (blue dot). This plays an important role. Also notice
that all derivatives are between 0 and 1.

156
2nd Part: Entire Statement: The Error Weighted
Derivative

1.l1_delta = l1_error * nonlin(l1,True)


There are more "mathematically precise" ways than "The Error
Weighted Derivative" but I think that this captures the
intuition. l1_error is a (4,1) matrix. nonlin(l1,True) returns a
(4,1) matrix. What we're doing is multiplying them "element
wise". This returns a (4,1) matrix l1_delta with the multiplied
values.

When we multiply the "slopes" by the error, we are reducing


the error of high confidence predictions. Look at the sigmoid
picture again! If the slope was really shallow (close to 0), then
the network either had a very high value, or a very low value.

157
This means that the network was quite confident one way or
the other. However, if the network guessed something close to
(x=0, y=0.5) then it isn't very confident. We update these
"wishy-washy" predictions most heavily, and we tend to leave
the confident ones alone by multiplying them by a number
close to 0.
Line Number 39: We are now ready to update our network!
Let's take a look at a single training example.

In this training example, we're all setup to update our weights.


Let's update the far left weight (9.5).

weight_update = input_value * l1_delta

158
For the far left weight, this would multiply 1.0 * the l1_delta.
Presumably, this would increment 9.5 ever so slightly. Why
only a small ammount? Well, the prediction was already very
confident, and the prediction was largely correct. A small error
and a small slope means a VERY small update. Consider all the
weights. It would ever so slightly increase all three.
So, what does line 39 do? It computes the weight updates for
each weight for each training example, sums them, and updates
the weights, all in a simple line.

159
Clustering

The Idea

Clustering can be considered the most important unsupervised


learning problem; so, as every other problem of this kind, it
deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of
organizing objects into groups whose members are similar in
some way”.
A cluster is therefore a collection of objects which are “similar”
between them and are “dissimilar” to the objects belonging to
other clusters.

The Model

As it is very hard to do optimal clustering we will instead go


for an iterative algorithm that usually will give us a good cluster.
1. Starting with a set of k-means, which are points in d-
dimensional space.
2. Assigning of each point to the mean to which it is closest.
3. If no point’s assignment has changed, stop and keep the
clusters as they are.
4. If some point’s assignment has changed, recompute the
means and return to step number 2.
Using the vector_mean function, it’s pretty simple to create a
class that does this:

160
Step Number 1

We randomly pick K cluster centers( means centroids). Let’s


assume these are c1,c2,…,ck , and we can say that;
C=c1,c2,…,ck
C is the set of all centroids.

Step Number 2

In this step we assign each input value to closest center. This


is done by calculating Euclidean(L2) distance between the
point and the each centroid.
argminci∈Cdist(ci,x)2
Where dist(.) is Euclidean distance.

Step Number 3

In this step, we find the new centroid by taking the average of


all the points assigned to that cluster.
ci=1|Si|∑xi∈Sixi
Si is the set of all points assigned to the Ith cluster.

161
Step Number 4

In this step, we repeat step number 2 and 3 until none of the


cluster assignments change. That means until our clusters
remain stable, we repeat the algorithm.
We often know the value of K. In that case we use the value
of K. Else we use the Elbow Method.

We run the algorithm for different values of K(say K = 10 to


1) and plot the K values against SSE(Sum of Squared Errors).
And select the value of K for the elbow point as shown in the
figure.

Implementation using Python

The dataset we are going to use has 3000 entries with 3 clusters.
So we already know the value of K.

162
We will start by importing the dataset.
%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Import of the dataset
data = pd.read_csv('xclara.csv')
print(data.shape)
data.head()
(3000, 2)

# Getting the values and plotting it on the graph


fill1 = data['V1'].values
fill2 = data['V2'].values
X = np.array(list(zip(fill1, fill2)))
plt.scatter(fill1, fill2, c='black', s=7)

163
# To calculate Euclidean Distance
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print(C)
[[ 11. 26.]
[ 79. 56.]
[ 79. 21.]]
# Plotting on graph along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')

164
# Now storing the values of centroids when they will
be updated
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. –The Distance between the new centroids
and old centroids is error
error = dist(C, C_old, None)
# Run the following loop till error becomes zero
while error != 0:
# Now assign each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# To store old centroid values
C_old = deepcopy(C)

165
# Finding the new centroids by taking the average
value
for i in range(k):
points = [X[j] for j in range(len(X)) if
clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in
range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7,
c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200,
c='#050505')

From this visualization it is clear that there are 3 clusters and


the black stars are their centroids respectively.

166
class KMeans:
"""To perform k-means clustering"""
def __init__(self, k):
self.k = k # number of clusters
self.means = None # means of clusters
def classify(self, input):
"""return the index of the cluster closest to the
input"""
return min(range(self.k),
key=lambda i: squared_distance(input, self.means[i]))
def train(self, inputs):
# choose k random points as the initial means
self.means = random.sample(inputs, self.k)
assignments = None

while True:
# To find new assignments
new_assignments = map(self.classify, inputs)
# If no assignments have changed, we can stop
if assignments == new_assignments:
return
# Else we keep the new assignments,
assignments = new_assignments
# And then compute new means based on the new
assignments
for i in range(self.k):
# find all the points which are assigned to cluster i
i_points = [p for p, a in zip(inputs, assignments) if
a == i]
# make sure i_points is not empty so don't divide by
0

167
if i_points:
self.means[i] = vector_mean(i_points)

Bottom-up Hierarchical Clustering

An alternative approach to clustering is to “grow” clusters


from the bottom up. We can do this in the following way:
1. Make each input its own cluster of one.
2. As long as there are multiple clusters remaining, find the two
closest clusters and merge them.
At the end, we’ll have one giant cluster containing all the
inputs. If we keep track of the merge order, we can recreate
any number of clusters by unmerging. For example,
if we want three clusters, we can just undo the last two merges.
We’ll use a really simple representation of clusters. Our values
will live in leaf clusters, which we will represent as 1-tuples:

leaf1 = ([10, 20],) # to make a 1-tuple you need the


trailing comma
leaf2 = ([30, -15],) # otherwise Python treats the
parentheses as parentheses

We’ll use these to grow merged clusters, which we will


represent as 2-tuples (merge order, children):
merged = (1, [leaf1, leaf2])

168
Natural Language Processing

Natural language processing (NLP) is the ability of a computer


program to understand human language as it is spoken.

Word Clouds

Wordcloud (or Tag cloud) is a visual representation of text


data. It displays a list of words, the importance of each being
shown with font size or color. This format is useful for quickly
perceiving the most prominent terms. Python is totally adapted
to draw this kind of representation, thanks to the wordcloud
library developed by Andreas Mueller.

169
This looks neat but doesn’t really tell us anything. A more
interesting approach might be to scatter them so that
horizontal position indicates posting popularity and vertical
position indicates resume popularity, which produces a
visualization that conveys a few insights

def text_size(total):
"""equals to 8 if total is 0 and equals to 28 if
total is 200"""
return 8 + total / 200 * 20
for word, job_popularity, resume_popularity in data:
plt.text(job_popularity, resume_popularity, word,
ha='center', va='center',
size=text_size(job_popularity + resume_popularity))
plt.xlabel("Popularity on Job Postings")
plt.ylabel("Popularity on Resumes")
plt.axis([0, 100, 0, 100])
plt.xticks([])
plt.yticks([])
plt.show()

170
A more meaningful (if less attractive) word cloud

N-gram Models

Marketing wants to create thousands of web pages about data


science so that your site will rank higher in search results for
data science–related terms. Of course, you don’t want to write
thousands of web pages, nor does she want to pay a horde of
“content strategists” to do so. Instead, she asks you whether
you can somehow programmatically generate these web pages.
To do this, we’ll need some way of modeling language.

>>> def ngrams(s, n=2, i=0):

... while len(s[i:i+n]) == n:


... yield s[i:i+n]
... i += 1
...

171
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)


>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'],
['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)


>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'],
['of', 'the'], ['the', 'awesomest'], ['awesomest',
'languages']]

>>> trigram = ngrams(txt.split(), n=3)


>>> list(trigram)

[['Python', 'is', 'one'], ['is', 'one', 'of'],


['one', 'of', 'the'], ['of', 'the', 'awesomest'],
['the', 'awesomest', 'languages']]

Grammars

A different approach to modeling language is with grammars,


rules for generating acceptable sentences. In elementary
school, you probably learned about parts of speech and how
to combine them. For example, if you had a really bad English
teacher, you might say that a sentence necessarily consists of a
noun followed by a verb. If you then have a list of nouns and
verbs, you can generate sentences according to the rule.
We’ll define a slightly more complicated grammar:

172
grammar = {
"_S" : ["_NP _VP"],
"_NP" : ["_N",
"_A _NP _P _A _N"],
"_VP" : ["_V",
"_V _NP"],
"_N" : ["data science", "Python", "regression"],
"_A" : ["big", "linear", "logistic"],
"_P" : ["about", "near"],
"_V" : ["learns", "trains", "tests", "is"]
}

I made up the convention that names starting with underscores


refer to rules that need further expanding, and that other
names are terminals that don’t need further processing.
So, for example, "_S" is the “sentence” rule, which produces a
"_NP" (“noun phrase”) rule followed by a "_VP" (“verb
phrase”) rule.
The verb phrase rule can produce either the "_V" (“verb”) rule,
or the verb rule followed by the noun phrase rule.
Notice that the "_NP" rule contains itself in one of its
productions. Grammars can be recursive, which allows even
finite grammars like this to generate infinitely many different
sentences.
How do we generate sentences from this grammar? We’ll start
with a list containing the sentence rule ["_S"]. And then we’ll
repeatedly expand each rule by replacing it with a randomly
chosen one of its productions. We stop when we have a list
consisting solely of terminals.

173
For example, one such progression might look like:

['_S']
['_NP','_VP']
['_N','_VP']
['Python','_VP']
['Python','_V','_NP']
['Python','trains','_NP']
['Python','trains','_A','_NP','_P','_A','_N']
['Python','trains','logistic','_NP','_P','_A','_N']
['Python','trains','logistic','_N','_P','_A','_N']
['Python','trains','logistic','data
science','_P','_A','_N']
['Python','trains','logistic','data
science','about','_A', '_N']
['Python','trains','logistic','data
science','about','logistic','_N']
['Python','trains','logistic','data
science','about','logistic','Python']

How do we implement this? Well, to start, we’ll create a simple


helper function to
identify terminals:
def is_terminal(token):
return token[0] != "_"

Next we need to write a function to turn a list of tokens into a


sentence. We will look for the first nonterminal token. If we
can’t find one, that means we have a completed sentence and
we’re done.

174
If we do find a nonterminal, then we randomly choose one of
its productions. If that production is a terminal (i.e., a word),
we simply replace the token with it. Otherwise it’s a sequence
of space-separated nonterminal tokens that we need to split
and then splice into the current tokens. Either way, we repeat
the process on the new set of tokens.
Putting it all together we get:

def expand(grammar, tokens):


for i, token in enumerate(tokens):
# now we skip over terminals
if is_terminal(token): continue
# if you are here, you found a non-terminal token
# so you need to choose a replacement at random
replacement = random.choice(grammar[token])
if is_terminal(replacement):
tokens[i] = replacement
else:

tokens = tokens[:i] + replacement.split() +


tokens[(i+1):]
# now we call expand on the new list of tokens
return expand(grammar, tokens)
# if you are here, you had all terminals done
return tokens

And now we can start generating sentences:


def generate_sentence(grammar):
return expand(grammar, ["_S"])

175
Try changing the grammar—add more words, add more rules,
add your own parts of speech—until you’re ready to generate
as many web pages as your company needs.
Grammars are actually more interesting when they are used in
the other direction.
Given a sentence we can use a grammar to parse the sentence.
This then allows us to identify subjects and verbs and helps us
make sense of the sentence.
Using data science to generate text is a neat trick; using it to
understand text is more magical. (See “For Further
Investigation” on page 200 for libraries that you could use for
this.)

Topic Modeling

There are many approaches for obtaining topics from a text


such as – Term Frequency and Inverse Document Frequency.
Nonnegative Matrix Factorization techniques. Latent Dirichlet
Allocation is the most popular topic modeling technique and
in this article, we will discuss the same.
LDA assumes documents are produced from a mixture of
topics. Those topics then generate words based on their
probability distribution. Given a dataset of documents, LDA
backtracks and tries to figure out what topics would create
those documents in the first place.
LDA is a matrix factorization technique.

176
Parameters of LDA

There are two parameters of LDA- Alpha


Hyperparameters(represents document-topic density)and Beta
Hyperparameters (represents topic-word density).The higher
the value of alpha, documents are composed of more topics
and lower the value of alpha, documents contain fewer topics.
On the other hand, higher the value of beta, topics are
composed of a large number of words in the corpus, and with
the lower value of beta, they are composed of few words.
Number of Topics – Number of topics to be extracted from
the corpus. Researchers have developed approaches to obtain
an optimal number of topics by using Kullback Leibler
Divergence Score. We will not discuss this in detail, as it is too
mathematical. For understanding, one can refer to this original
paper on the use of KL divergence.
Number of Topic Terms – Number of terms composed in a
single topic. It is generally decided according to the
requirement. If the problem statement talks about extracting
themes or concepts, it is recommended to choose a higher
number, if problem statement talks about extracting features
or terms, a low number is recommended.
Number of Iterations / passes – Maximum number of
iterations allowed to LDA algorithm for convergence.
Preparing Documents
Here are the sample documents combining together to form a
corpus.
doc1 = "Sugar is bad to consume. My sister likes to have sugar,
but not my father."

177
doc2 = "My father spends a lot of time driving my sister
around to dance practice."
doc3 = "Doctors suggest that driving may cause increased
stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school,
but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your
lifestyle."
# compile documents doc_complete = [doc1, doc2, doc3,
doc4, doc5]

Cleaning and Preprocessing


Cleaning is an important step before any text mining task, in
this step, we will remove the punctuations, stopwords and
normalize the corpus.
Preparing Document-Term Matrix
All the text documents combined is known as the corpus. To
run any mathematical model on text corpus, it is a good
practice to convert it into a matrix representation. LDA model
looks for repeating term patterns in the entire DT matrix.
Python provides many great libraries for text mining practices,
“gensim” is one such clean and beautiful library to handle text
data. It is scalable, robust and efficient. Following code shows
how to convert a corpus into a document-term matrix.

# First Import Gensim


import gensim
from gensim import corpora

# Create the term dictionary of our courpus, where

178
every unique term is assigned an index. dictionary =
corpora.Dictionary(doc_clean)

# Convert list of documents (corpus) into Document


Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in
doc_clean]

Running LDA Model

Next step is to create an object for LDA model and train it


on Document-Term matrix. The training also requires few
parameters as input which are explained in the above section.
The gensim module allows both LDA model estimation from
a training corpus and inference of topic distribution on new,
unseen documents.

```
# Creating the object for LDA model using gensim
library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term


matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word
= dictionary, passes=50)

```

Results
```
print(ldamodel.print_topics(num_topics=3,
num_words=3))
['0.168*health + 0.083*sugar + 0.072*bad,
'0.061*consume + 0.050*drive + 0.050*sister,
'0.049*pressur + 0.049*father + 0.049*sister]
```

179
Network Analysis
Betweenness Centrality

In Network Analysis the identification of important nodes is a


common task. We have various centrality measures that we can
use and in this post we will focus on the Betweenness
Centrality. We will see how this measure is computed and how
to use the library networkx in order to create a visualization of
the network where the nodes with the highest betweenness are
highlighted. The betweenness focuses on the number of visits
through the shortest paths. If a walker moves from one node
to another node via the shortest path, then the nodes with a
large number of visits have a higher centrality. The
betweenness centrality is defined as

where s(s,t) is total number of shortest paths from node s to


node t and sv(s,t) is the number of those paths that pass through
v.
Let's see how to compute the betweenness with networkx. As
first step we have to load a sample network.

# egrep.py
import sys, re

# read the graph (gml format)

180
G = nx.read_gml('lesmiserables.gml',relabel=True)

Now we have a representation G of our network and we can


use the function betweenness_centrality() to compute the
centrality of each node. This function returns a list of tuples,
one for each node, and each tuple contains the label of the
node and the centrality value. We can use this information in
order to trim the original network and keep only the most
important nodes:

def most_important(G):
""" returns a copy of G with
the most important nodes
according to the pagerank """
ranking = nx.betweenness_centrality(G).items()
print ranking
r = [x[1] for x in ranking]
m = sum(r)/len(r) # mean centrality
t = m*3 # threshold, we keep only the nodes with 3
times the mean
Gt = G.copy()
for k, v in ranking:
if v < t:
Gt.remove_node(k)
return Gt

Gt = most_important(G) # trimming

And we can use the original network and the trimmed one to
visualize the network as follows:

181
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.2
,node_size=8)
nx.draw_networkx_edges(G,pos,alpha=0.1)

# drawing the most important nodes with a different


style
nx.draw_networkx_nodes(Gt,pos,node_color='r',alpha=0.
4,node_size=254)
# also the labels this time
nx.draw_networkx_labels(Gt,pos,font_size=12,font_colo
r='b')
show()

The graph should be like this one:

182
This graph is pretty interesting, indeed it highlights the nodes
which are very influential on the way the information spreads
over the network.

Eigenvector Centrality

In graph theory, eigenvector centrality (also called


eigencentrality) is a measure of the influence of a node in a
network. It assigns relative scores to all nodes in the network
based on the concept that connections to high-scoring nodes
contribute more to the score of the node in question than equal
connections to low-scoring nodes.
Google’s PageRank and the Katz centrality are variants of the
eigenvector centrality.
Centrality

183
To start with, we’ll need to represent the connections in our
network as an adjacency_matrix, whose (i,j)th entry is either 1
(if user i and user j are friends) or 0 (if they’re not):

def entry_fn(i, j):


return 1 if (i, j) in friendships or (j, i) in
friendships else 0
n = len(users)
adjacency_matrix = make_matrix(n, n, entry_fn)

The eigenvector centrality for each user is then the entry


corresponding to that user in the eigenvector returned by
find_eigenvector

The DataSciencester network sized by eigenvector


centrality

184
Users with high eigenvector centrality should be those who
have a lot of connections and connections to people who
themselves have high centrality.
Here users 1 and 2 are the most central, as they both have three
connections to people who are themselves highly central. As
we move away from them, people’s centralities steadily drop
off.
On a network this small, eigenvector centrality behaves
somewhat erratically. If you try adding or subtracting links,
you’ll find that small changes in the network can dramatically
change the centrality numbers. In a much larger network this
would not particularly be the case.
We still haven’t motivated why an eigenvector might lead to a
reasonable notion of centrality.
In other words, eigenvector centralities are numbers, one per
user, such that each user’s value is a constant multiple of the
sum of his neighbors’ values. In this case centrality means
being connected to people who themselves are central. The
more centrality you are directly connected to, the more central
you are. This is of course a circular definition—eigenvectors
are the way of breaking out of the circularity.
Another way of understanding this is by thinking about what
find_eigenvector is doing here. It starts by assigning each node
a random centrality. It then repeats the following two steps
until the process converges:
1. Give each node a new centrality score that equals the sum
of its neighbors’ (old) centrality scores.
2. Rescale the vector of centralities to have magnitude 1.

185
Directed Graphs and PageRank
In this new model, we’ll track endorsements (source, target)
that no longer represent a reciprocal relationship, but rather
that source endorses target as an awesome data scientist
(Figure 21-5). We’ll need to account for this asymmetry:

endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1,
2),
(2, 1), (1, 3), (2, 3), (3, 4), (5, 4),
(5, 6), (7, 5), (6, 8), (8, 7), (8, 9)]

However, “number of endorsements” is an easy metric to


game. All you need to do is create phony accounts and have
them endorse you. Or arrange with your friends to endorse
each other. (As users 0, 1, and 2 seem to have done.)
A better metric would take into account who endorses you.
Endorsements from people who have a lot of endorsements
should somehow count more than endorsements from people
with few endorsements. This is the essence of the PageRank
algorithm, used by Google to rank websites based on which
other websites link to them, which other websites link to those,
and so on.
A simplified version looks like this:
1. There is a total of 1.0 (or 100%) PageRank in the network.
2. Initially this PageRank is equally distributed among nodes.
3. At each step, a large fraction of each node’s PageRank is
distributed evenly among
its outgoing links.

186
4. At each step, the remainder of each node’s PageRank is
distributed evenly among
all nodes.

def page_rank(users, damping = 0.85, num_iters =


100):
# initially distribute PageRank evenly
num_users = len(users)
pr = { user["id"] : 1 / num_users for user in users }
# this is small fraction of PageRank
# that each node gets each in iteration
base_pr = (1 - damping) / num_users
for __ in range(num_iters):
next_pr = { user["id"] : base_pr for user in users }
for user in users:
# distribute PageRank to outgoing link/s
links_pr = pr[user["id"]] * damping
for endorsee in user["endorses"]:
next_pr[endorsee["id"]] += links_pr /
len(user["endorses"])
pr = next_pr
return pr

PageRank (Figure 21-6) identifies user 4 (Thor) as the highest


ranked data scientist.

187
The DataSciencester network sized by PageRank

Even though he has fewer endorsements (2) than users 0, 1,


and 2, his endorsements carry with them rank from their
endorsements. Additionally, both of his endorsers endorsed
only him, which means that he doesn’t have to divide their rank
with anyone else.

188
Recommender Systems

In today’s world data is a valuable resource. Using this data it


is possible to predict the “rating” and “preference” that a
potential user will give to a particular product. Most of the tech
giants today cash on this very property of data science to have
a competitive advantage. These types of systems, which predict
the preferences, are known as Recommender systems. Amazon
and Flipkart use these to suggest products to customers and
YouTube makes your playlist using it. Facebook uses it to
recommend people whom to follow. In addition, since all this
results in a personalized experience to the user the user is more
inclined to buy these products. All its advantages make
Recommender systems very popular.

users_interest = [
["Hadoop", "Big Data", "HBase", "Java", "Spark",
"Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase",
"Postgres"],
["Python", "scikit-learn", "scipy", "numpy",
"statsmodels", "pandas"],
["R", "Python", "statistics", "regression",
"probability"],
["machine learning", "regression", "decision trees",
"libsvm"],
["Python", "R", "Java", "C++", "Haskell",
"programming languages"],
["statistics", "probability", "mathematics",
"theory"],
["machine learning", "scikit-learn", "Mahout",
"neural networks"],

189
["neural networks", "deep learning", "Big Data",
"artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence",
"probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL",
"MongoDB"],
["libsvm", "regression", "support vector machines"]
]

Moreover, we’ll think about the problem of recommending


new interests to a user based on her currently specified
interests.

Manual Curation

The most "accurate" recommender system would recommend


the same items (whether those "items" are movies, restaurants,
options available to a hardware end user, etc.) over and over
again, focused on a narrow topic area, and ignorant of context.
Before the Internet, when we needed book recommendations
we would go to the library, where a librarian was available to
suggest books that were relevant to our interests or similar to
books we liked.

Recommending What’s Popular

One easy approach is to simply recommend what’s popular:

190
popular_interest = Counter(interest
for user_interest in users_interest
for interest in user_interest).most_common()

Which looks like:


[('Python', 4),
('R', 4),
('Java', 3),
('regression', 3),
('statistics', 3),
('probability', 3),
# ...
]

Having computed this, we can just suggest to a user the most


popular interests that
He’s not already interested in:

def most_popular_new_interests(user_interest,
max_results=5):
suggestions = [(interest, frequency)
for interest, frequency in popular_interest
if interest not in user_interest]
return suggestions[:max_results]

So, if you are user 1, with interests:


["NoSQL", "MongoDB", "Cassandra", "HBase",
"Postgres"]

191
Then we’d recommend you:
most_popular_new_interests(users_interest[1], 5)
# [('Python', 4), ('R', 4), ('Java', 3),
('regression', 3), ('statistics', 3)]

If you were user 3, who’s already interested in many of those


things, you’d instead get:
[('Java', 3),
('HBase', 3),
('Big Data', 3),
('neural networks', 2),
('Hadoop', 2)]

Of course, “lots of people are interested in Python so maybe


you should be too” is not the most compelling sales pitch. If
someone is brand new to our site and we don’t know anything
about them, that’s possibly the best we can do. Let’s see how
we can do better by basing each user’s recommendations on
her interests.
Collaborative filtering
Let’s start with the idea of what would I do if I want to do a
non-personalized collaborative filtering and calculate the
prediction of an item i for the user u. Simply, I would calculate
the average of the rating of that item i by adding all the rating
values of the item i and divide it by the total number of users
U.
Let’s move forward from this intuition and incorporate the
behaviour of other users and provide more weight to the

192
ratings of those users who are like me. But how do we check
how much a user is similar to me?

User-Based Collaborative Filtering

One way of considering a user’s interests is to look for users


who are somehow similar to him, and then suggest the things
that those users are interested. In order to do that, we will need
a way to measure how similar two users are. Here we will use
a metric called cosine similarity. Given two vectors, v and w,
it’s defined as:
def cosine_similarity(v, w):
return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))
It measures the “angle” between v and w. If v and w point in
the same direction, then the numerator and denominator are
equal, and their cosine similarity equals 1. If v and w point in
opposite directions, then their cosine similarity equals -1. And

193
if v is 0 whenever w is not (and vice versa) then dot(v, w) is 0
and so the cosine similarity will be 0.

We will apply this to vectors of 0s and 1s, each vector v


representing one user’s interests. v will be 1 if the user is
specified the ith interest, 0 otherwise. Accordingly, “similar
users” will mean “users whose interest vectors most nearly
point in the same direction.”
Users with identical interests will have similarity 1. Users with
no identical interests will have similarity 0. Otherwise the
similarity will fall in between, with numbers closer to 1
indicating “very similar” and numbers closer to 0 indicating
“not very similar.”
A good place to start is collecting the known interests and
(implicitly) assigning indices to them. We can do this by using
a set comprehension to find the unique interests, putting them
in a list, and then sorting them. The first interest in the resulting
list will be interest 0, and so on:

unique_interests = sorted(list({ interest


for user_interest in users_interest
for interest in user_interest }))
This gives us a list that starts:
['Big Data',
'C++',
'Cassandra',
'HBase',
'Hadoop',

194
'Haskell',
# ...
]

Next we want to produce an “interest” vector of 0s and 1s for


each user. We just need to iterate over the unique_interests list,
substituting a 1 if the user has each interest, a 0 if not:

def make_user_interest_vector(user_interest):
"""given a list of interests, produce a vector whose
ith element is 1
if unique_interests[i] is in the list, 0 otherwise"""
return [1 if interest in user_interest else 0
for interest in unique_interests]

After which, we can create a matrix of user interests simply by


map-ping this function against the list of lists of interests:
user_interest_matrix = map(make_user_interest_vector,
users_interests)

Now user_interest_matrix[i][j] equals 1 if user i specified


interest j, 0 otherwise.
Because we have a small data set, it’s no problem to compute
the pairwise similarities
between all of our users:

user_similarities =
[[cosine_similarity(interest_vector_i,
interest_vector_j)
for interest_vector_j in user_interest_matrix]

195
for interest_vector_i in user_interest_matrix]

After which, user_similarities[i][j] gives us the similarity


between users i and j. For instance, user_similarities[0][9] is
0.57, as those two users share interests in Hadoop, Java, and
Big Data. On the other hand, user_similarities[0][8] is only
0.19, as users 0 and 8 share only one interest, Big Data. How
do we use this to suggest new interests to a user? For each
interest, we can just add up the user-similarities of the other
users interested in it:

def user_based_suggestions(user_id,
include_current_interests=False):
# sum up the similarities
suggestions = defaultdict(float)
for other_user_id, similarity in
most_similar_users_to(user_id):
for interest in users_interests[other_user_id]:
suggestions[interest] += similarity
# convert them to a sorted list
suggestions = sorted(suggestions.items(),
key=lambda (_, weight): weight,
reverse=True)
# and (maybe) exclude already-interests
if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]

196
If we call user_based_suggestions(0), the first several suggested
interests are:

[('MapReduce', 0.5669467095138409),
('MongoDB', 0.50709255283711),
('Postgres', 0.50709255283711),
('NoSQL', 0.3380617018914066),
('neural networks', 0.1889822365046136),
('deep learning', 0.1889822365046136),
('artificial intelligence', 0.1889822365046136),
#...
]

These seem like pretty decent suggestions for someone whose


stated interests are “Big Data” and database-related. (The
weights aren’t intrinsically meaningful; we just use them for
ordering.)
This approach doesn’t work as well when the number of items
gets very large.

Item-Based Collaborative Filtering

An alternative approach is to compute similarities between


interests directly. We can then generate suggestions for each
user by aggregating interests that are similar to her current
interests.

197
To start with, we will want to transpose our user-interest
matrix so that rows corresponding to interests and columns
correspond to users:

interest_user_matrix = [[user_interest_vector[j]
for user_interest_vector in user_interest_matrix]
for j, _ in enumerate(unique_interests)]

What does this look like? Row j of interest_user_matrix is


column j of
user_interest_matrix. That is, it has 1 for each user with that
interest and 0 for
Each user without that interest.
For example, unique_interests[0] is Big Data, and so
interest_user_matrix[0] is:
[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

Because users 0, 8, and 9 indicated interest in Big Data.


We can now use cosine similarity again. If precisely the same
users are interested in two topics, their similarity will be 1. If
no two users are interested in both topics, their similarity will
be 0:
interest_similarities =
[[cosine_similarity(user_vector_i, user_vector_j)
for user_vector_j in interest_user_matrix]
for user_vector_i in interest_user_matrix]

For example, we can find the interests most similar to Big Data
(interest 0) using:
198
def most_similar_interests_to(interest_id):
similarities = interest_similarities[interest_id]
pairs = [(unique_interests[other_interest_id],
similarity)
for other_interest_id, similarity in
enumerate(similarities)
if interest_id != other_interest_id and similarity >
0]
return sorted(pairs,
key=lambda (_, similarity): similarity,
reverse=True)

Which suggests the following similar interests:


[('Hadoop', 0.8164965809277261),
('Java', 0.6666666666666666),
('MapReduce', 0.5773502691896258),
('Spark', 0.5773502691896258),
('Storm', 0.5773502691896258),
('Cassandra', 0.4082482904638631),
('artificial intelligence', 0.4082482904638631),
('deep learning', 0.4082482904638631),
('neural networks', 0.4082482904638631),
('HBase', 0.3333333333333333)]

Now we can create recommendations for a user by summing


up the similarities of the interests similar to his:

def item_based_suggestions(user_id,
include_current_interests=False):
# To add up similar interests
suggestions = defaultdict(float)

199
user_interest_vector = user_interest_matrix[user_id]
for interest_id, is_interested in
enumerate(user_interest_vector):
if is_interested == 1:
similar_interests =
most_similar_interests_to(interest_id)
for interest, similarity in similar_interests:
suggestions[interest] += similarity
# Then sorting them by weight
suggestions = sorted(suggestions.items(),
key=lambda (_, similarity): similarity,
reverse=True)
if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]

For user 0, this generates the following (seemingly reasonable)


recommendations:
[('MapReduce', 1.861807319565799),
('Postgres', 1.3164965809277263),
('MongoDB', 1.3164965809277263),
('NoSQL', 1.2844570503761732),
('programming languages', 0.5773502691896258),
('MySQL', 0.5773502691896258),
('Haskell', 0.5773502691896258),
('databases', 0.5773502691896258),
('neural networks', 0.4082482904638631),
('deep learning', 0.4082482904638631),

200
('C++', 0.4082482904638631),
('artificial intelligence', 0.4082482904638631),
('Python', 0.2886751345948129),
('R', 0.2886751345948129)]

201
Databases and SQL
The data you need will often live in databases, systems designed
for efficiently storing and querying data. The bulk of these are
relational databases, such as Oracle, MySQL, and SQL Server,
which store data in tables and are typically, queried using
Structured Query Language (SQL), a declarative language for
manipulating data. You can get a list of available Python
Database Interface API’s at
https://wiki.python.org/moin/DatabaseInterfaces

The API to be used will depend on the database you wish to


access. Separate API needs to be used for each database, which
you want to use. For example if you wish to access both Oracle
and PostgreSQL then modules for both will be required to be
downloaded.
Note: An API (Application Programming Interface) is nothing
but a set of functions and procedures that allow the creation
of applications, which access the features, or data of an
operating system, application, or other service.

The Python standard for database interfaces is the Python DB-


API. Most Python database interfaces adhere to this standard.

You can choose the right database for your application. Python
Database API supports a wide range of database servers such
as −

 GadFly
 mSQL
 MySQL
202
 PostgreSQL
 Microsoft SQL Server 2000
 Informix
 Interbase
 Oracle
 Sybase

The DB API provides a minimal standard for working with


databases using Python structures and syntax wherever
possible. This API includes the following −

1. Importing the API module.


2. Acquiring a connection with the database.
3. Issuing SQL statements and stored procedures.
4. Closing the connection

The following are the steps to be followed to execute Python


with MySQL.

Step 1: Installing MySQL

First you must install a MySQL driver, use the specific


installation method below.

 On Windows: Install MySQLdb


 On Linux: Install MySQLdb using:
sudo apt-get install python-mysqldb
yum install mysql-python

Depending on your version, MySQL server has to be running


before going to the next step.

203
Step 2: Setting up the database

Make sure you have database access, from the command line
type:
mysql -u USERNAME -p

MySQL will then ask your password. Type these commands:

mysql&gt; CREATE DATABASE pythonspot;


mysql&gt; USE pythonspot;

We go on the create the table:

CREATE TABLE IF NOT EXISTS examples (


id int(11) NOT NULL AUTO_INCREMENT,
description varchar(45),
PRIMARY KEY (id)
);

Then we can insert data into the table (these are SQL queries):
INSERT INTO examples(description) VALUES ("Hello ");
INSERT INTO examples(description) VALUES ("MySQL ");
INSERT INTO examples(description) VALUES ("Python
Example");

You can now grab all records from the table using a SQL
query:

204
mysql&gt; SELECT * FROM examples;
+----+---------------+
| id | description |
+----+---------------+
| 1 | Hello |
| 2 | MySQL |
| 3 | Python Example |
+----+---------------+
3 rows in set (0.01 sec)

Step 3: Getting the data from Python

You can access the database directly from Python using the
MySQLdb module.

#!/usr/bin/python
import MySQLdb

db = MySQLdb.connect(host="localhost", # your host


user="root", # username
passwd="root", # password
db="pythonspot") # name of the
database

# Create a Cursor object to execute queries.


cur = db.cursor()

# Select data from table using SQL query.

205
cur.execute("SELECT * FROM examples")

# print the first and second columns


for row in cur.fetchall() :
print row[0], " ", row[1]
db.close()

Output:
1 Hello
2 MySQL
3 Python Example

A relational database is a collection of tables (and of


relationships among them). A table is simply a collection of
rows, not unlike the matrices we’ve been working with.
However, a table also has associated with it a fixed schema
consisting of column names and column types.
Once a database connection is established, we are ready to
create tables or records into the database tables using execute
method of the created cursor.

Example
Let us create Database table EMPLOYEES −

#!/usr/bin/python

import MySQLdb

206
# Open database connection
db =
MySQLdb.connect("localhost","testuser","test123","TES
TDB" )

# prepare a cursor object using cursor() method


cursor = db.cursor()

# Drop table if it already exist using execute()


method.
cursor.execute("DROP TABLE IF EXISTS EMPLOYEES")

# Create table as per requirement


sql = """CREATE TABLE EMPLOYEES (
FIRST_NAME CHAR(20) NOT NULL,
LAST_NAME CHAR(20),
AGE INT,
SEX CHAR(1),
INCOME FLOAT )"""

cursor.execute(sql)

# disconnect from server


db.close()

INSERT

It is required when you want to create your records into a


database table.

207
Example
The following example, executes SQL INSERT statement to
create a record into EMPLOYEE table −
# Prepare SQL query to INSERT a record into the
database.
sql = """INSERT INTO EMPLOYEES(FIRST_NAME,
LAST_NAME, AGE, SEX, INCOME)
VALUES ('Mac', 'Mohan', 20, 'M', 2000)"""
# Prepare SQL query to INSERT a record into the
database.
sql = "INSERT INTO EMPLOYEES(FIRST_NAME, \
LAST_NAME, AGE, SEX, INCOME) \
VALUES ('%s', '%s', '%d', '%c', '%d' )" % \
('Mac', 'Mohan', 20, 'M', 2000)

Example
Following code segment is another form of execution where
you can pass parameters directly –

..................................
user_id = "test123"
password = "password"

con.execute('insert into Login values("%s", "%s")' %


\
(user_id, password))
.................................

208
UPDATE

UPDATE Operation on any database means to update one or


more records, which are already available in the database.
The following procedure updates all the records having SEX
as 'M'. Here, we increase AGE of all the males by one year.
Example

# Prepare SQL query to UPDATE required records


sql = "UPDATE EMPLOYEES SET AGE = AGE + 1
WHERE SEX = '%c'" % ('M')

DELETE

DELETE operation is required when you want to delete some


records from your database. Following is the procedure to
delete all the records from EMPLOYEE where AGE is more
than 20 −
Example

# Prepare SQL query to DELETE required records


sql = "DELETE FROM EMPLOYEES WHERE AGE > '%d'" % (20)
try:

SELECT

The Select is a simple operation

# Select data from table using SQL query.

209
cur.execute("SELECT * FROM EMPLOYEES")

GROUP BY

The Group By clause can be used to

# Select data from table using SQL query to count


number of employees having same income
cur.execute("SELECT INCOME,COUNT(*) FROM EMPLOYEES
GROUP BY INCOME")

ORDER BY
# Select data from table using SQL query to arrange
employees in ascending order
cur.execute("SELECT FIRST_NAME FROM EMPLOYEES ORDER
BY FIRST_NAME")

JOIN

210
# Select data from table using SQL query to find
manager of an employee
Considerations include having an Employee table with
employee_id with primary key and Manager table with
manager_id as primary key and employee_id as foreign
key
db.query("""SELECT e.employee_id 'Emp_Id',
e.last_name 'Employee',
m.employee_id 'Mgr_Id', m.last_name 'Manager'
FROM employees e join employees m
ON (e.manager_id = m.employee_id)""")

Indexes

Indexes are used to find rows with specific column values


quickly. Without an index, MySQL must begin with the first
row and then read through the entire table to find the relevant
rows(this is done in NotQuiteABase). The larger the table, the
more this costs. If the table has an index for the columns in
question, MySQL can quickly determine the position to seek
to in the middle of the data file without having to look at all
the data. This is much faster than reading every row
sequentially.
Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX,
and FULLTEXT) are stored in B-trees. Exceptions: Indexes
on spatial data types use R-trees; MEMORY tables also
support hash indexes; InnoDB uses inverted lists for
FULLTEXT indexes.
MySQL uses indexes for these operations:

211
 To find the rows matching a WHERE clause quickly.
 To eliminate rows from consideration. If there is a
choice between multiple indexes, MySQL normally
uses the index that finds the smallest number of rows
(the most selective index).
 If the table has a multiple-column index, any leftmost
prefix of the index can be used by the optimizer to look
up rows. For example, if you have a three-column
index on (col1, col2, col3), you have indexed search
capabilities on (col1), (col1, col2), and (col1, col2, col3).
 To retrieve rows from other tables when performing
joins. MySQL can use indexes on columns more
efficiently if they are declared as the same type and size.
In this context, VARCHAR and CHAR are considered
the same if they are declared as the same size. For
example, VARCHAR(10) and CHAR(10) are the same
size, but VARCHAR(10) and CHAR(15) are not.

For comparisons between nonbinary string columns,


both columns should use the same character set. For
example, comparing a utf8 column with a latin1
column precludes use of an index.

Comparison of dissimilar columns (comparing a string


column to a temporal or numeric column, for example)
may prevent use of indexes if values cannot be
compared directly without conversion. For a given
value such as 1 in the numeric column, it might
compare equal to any number of values in the string
column such as '1', ' 1', '00001', or '01.e1'. This rules out
use of any indexes for the string column.

212
 To find the MIN() or MAX() value for a specific
indexed column key_col. This is optimized by a
preprocessor that checks whether you are using
WHERE key_part_N = constant on all key parts that
occur before key_col in the index. In this case, MySQL
does a single key lookup for each MIN() or MAX()
expression and replaces it with a constant. If all
expressions are replaced with constants, the query
returns at once. For example:
 SELECT MIN(key_part2),MAX(key_part2)

FROM tbl_name WHERE key_part1=10;

Indexes are less important for queries on small tables, or big


tables where report queries process most or all of the rows.
When a query needs to access most of the rows, reading
sequentially is faster than working through an index. Sequential
reads minimize disk seeks, even if not all the rows are needed
for the query.

Query Optimization

In SQL, you generally wouldn’t worry about this. You


“declare” the results you want and leave it up to the query
engine to execute them (and use indexes efficiently). Recall the
query to find all users who are interested in SQL:

SELECT users.name
FROM users

213
JOIN user_interest
ON users.user_id = user_interests.user_id
WHERE user_interests.interest = 'SQL'

In NotQuiteABase there are (at least) two different ways to


write this query. You
could filter the user_interests table before performing the join:
user_interest \
.where(lambda row: row["interest"] == "SQL") \
.join(users) \
.select(["name"])

Or you could filter the results of the join:


user_interest \
.join(users) \
.where(lambda row: row["interest"] == "SQL") \
.select(["name"])

You’ll end up with the same results either way, but filter-
before-join is almost certainly more efficient, since in that case
join has many fewer rows to operate on.

NoSQL

A recent trend in databases is toward nonrelational “NoSQL”


databases, which don’t represent data in tables. For instance,
MongoDB is a popular schema-less database whose elements
are arbitrarily complex JSON documents rather than rows.

214
There are column databases that store data in columns instead
of rows1, key-value stores that are optimized for retrieving
single (complex) values by their keys, databases for storing and
traversing graphs, databases that are optimized to run across
multiple datacenters, databases that are designed to run in
memory, databases for storing time-series data, and hundreds
more.

MapReduce

MapReduce is a framework for processing parallelizable


problems across large datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all
nodes are on the same local network and use similar hardware)
or a grid (if the nodes are shared across geographically and
administratively distributed systems, and use more
heterogeneous hardware). Processing can occur on data stored
either in a filesystem (unstructured) or in a database
(structured). MapReduce can take advantage of the locality of
data, processing it near the place it is stored in order to
minimize communication overhead.
"Map" step: Each worker node applies the "map()" function
to the local data, and writes the output to a temporary storage.
A master node ensures that only one copy of redundant input
data is processed.

1 Good when data has many columns but queries need few of them.

215
"Shuffle" step: Worker nodes redistribute data based on the
output keys (produced by the "map()" function), such that all
data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of
output data, per key, in parallel.

Why MapReduce?

MapReduce deals with distributed processing for both steps,


but remember, the processing is designed to be embarrassingly
parallel. This is where MapReduce gets performance,
performing operations in parallel. To get the most
performance means that there is no communication between
worker nodes, so no data is really shared between them
Our original (non-MapReduce) approach requires the machine
doing the processing to have access to every document. This
means that the documents all need to either live on that
machine or else be transferred to it during processing. More
important, it means that the machine can only process one
document at a time.
MapReduce More Generally

As a slightly more complicated example, imagine we need to


find out for each user the most common word that she puts in
her status updates. Three possible approaches spring to mind
for the mapper:
• Put the username in the key; put the words and counts in the
values.

216
• Put the word in key; put the usernames and counts in the
values.
• Put the username and word in the key; put the counts in the
values.
If you think about it a bit more, we definitely want to group by
username, because we want to consider each person’s words
separately. In addition, we don’t want to group by word, since
our reducer will need to see all the words for each person to
find out which is the most popular. This means that the first
option is the right choice:

def words_per_user_mapper(status_update):
user = status_update["username"]
for word in tokenize(status_update["text"]):
yield (user, (word, 1))

def most_popular_word_reducer(user,
words_and_counts):
"""given a sequence of (word, count) pairs,
return the word with the highest total count"""
word_counts = Counter()
for word, count in words_and_counts:
word_counts[word] += count
word, count = word_counts.most_common(1)[0]
yield (user, (word, count))
user_words = map_reduce(status_updates,
words_per_user_mapper,
most_popular_word_reducer)

Or we could find out the number of distinct status-likers for


each user:

217
def liker_mapper(status_update):
user = status_update["username"]
for liker in status_update["liked_by"]:
yield (user, liker)
distinct_likers_per_user = map_reduce(status_updates,
liker_mapper,
count_distinct_reducer)

Python MapReduce Code

The “trick” behind the following Python code is that we will


use the Hadoop Streaming API (see also the corresponding
wiki entry) for helping us passing data between our Map and
Reduce code via STDIN (standard input) and STDOUT
(standard output). We will simply use Python’s sys.stdin to read
input data and print our own output to sys.stdout. That’s all we
need to do because Hadoop Streaming will take care of
everything else!

Map step: mapper.py

Save the following code in the file /home/hduser/mapper.py.


It will read data from STDIN, split it into words and output a
list of lines mapping words to their (intermediate) counts to
STDOUT. The Map script will not compute an (intermediate)
sum of a word’s occurrences though. Instead, it will output
<word> 1 tuples immediately – even though a specific word
might occur multiple times in the input. In our case we let the
subsequent Reduce step do the final sum count.

218
Make sure the file has execution permission (chmod +x
/home/hduser/mapper.py)

mapper.py
#!/usr/bin/env python

import sys

# input will come from STDIN (standard input)


for line in sys.stdin:
# removing leading and trailing whitespace
line = line.strip()
# spliting the line into words
words = line.split()
# incrementing counters
for word in words:
# write the results to the STDOUT (standard
output);
# what we will output here will become the
input for the
# Reduce step, i.e. the input for the
reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

Reduce step: reducer.py

Save the following code in the file /home/hduser/reducer.py.


It will read the results of mapper.py from STDIN (so the
output format of mapper.py and the expected input format of
219
reducer.py must match) and sum the occurrences of each word
to a final count, and then output its results to STDOUT.

Make sure the file has execution permission (chmod +x


/home/hduser/reducer.py).

#!/usr/bin/env python

from operator import itemgetter


import sys

current_word = None
current_count = 0
word = None

# input coming from STDIN


for line in sys.stdin:
# removing leading and trailing whitespace
line = line.strip()

# parsing the input we got from mapper.py


word, count = line.split('\t', 1)

# converting count (currently a string) to int


try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignoring/discarding this line

220
continue

# this IF-switch only is working because Hadoop


sorts map output
# by key (here: word) before it is passed to the
reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word,
current_count)
current_count = count
current_word = word

# Printing he last word on stdout if required


if current_word == word:
print '%s\t%s' % (current_word, current_count)

Test your code (cat data | map | sort | reduce)


Recommendation is to test mapper.py and reducer.py scripts
locally before using them in a MapReduce job. Otherwise jobs
might successfully complete but there will be no job result data
at all or not the results you would have expected.

Testing the script locally first

# very first basic test

221
hduser@ubuntu:~$ echo "foo foo quux labs foo bar
quux" | /home/hduser/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1

hduser@ubuntu:~$ echo "foo foo quux labs foo bar


quux" | /home/hduser/mapper.py | sort -k1,1 |
/home/hduser/reducer.py
bar 1
foo 3
labs 1
quux 2

# using one of the ebooks as example input


# (see below on where to get the ebooks)
hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt |
/home/hduser/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]

You can later run this on Hadoop as well.

222
Go Forth and Do Data Science

IPython

IPython Notebook is a web-based interactive computational


environment for creating IPython notebooks. An IPython
notebook is a JSON document containing an ordered list of
input/output cells which can contain code, text, mathematics,
plots and rich media.

It provides a shell with far more functionality than the standard


Python shell, and it adds “magic functions” that allow you to
(among other things) easily copy and paste code (which is
normally complicated by the combination of blank lines and

223
whitespace formatting) and run scripts from within the shell.
Mastering IPython will make your life far easier. (Even learning
just a little bit of IPython will make your life a lot easier.)
Additionally, it allows you to create “notebooks” combining
text, live Python code, and visualizations that you can share
with other people, or just keep around as a journal of what you
did.

Mathematics

There are many reasons why the mathematics in Data Science


is important and I will highlight some of them below:

1. Selecting the right algorithm which includes giving


considerations to accuracy, training time, model
complexity, number of parameters and number of
features.

224
2. Choosing parameter settings and validation strategies.
3. Identifying under fitting and overfitting by
understanding the Bias-Variance tradeoff.
4. Estimating the right confidence interval and
uncertainty.

So in order to be a good data scientist linear algebra, statistics,


probability and various aspects of machine learning will be
added advantage.

Not from Scratch

Implementing things “from scratch” is great for understanding


how they work. But it’s generally not great for performance
(unless you’re implementing them specifically with
performance in mind), ease of use, rapid prototyping, or error
handling.
In practice, you’ll want to use well-designed libraries that
solidly implement the fundamentals.
NumPy

NumPy (for “Numeric Python”) provides facilities for doing


“real” scientific computing.
It features arrays that perform better than our list-vectors,
matrices that perform better than our list-of-list-matrices, and
lots of numeric functions for working with them.
NumPy is a building block for many other libraries, which
makes it especially valuable to know.

225
Pandas

Pandas provides additional data structures for working with


data sets in Python. Its primary abstraction is the DataFrame
with much more functionality and better performance than
table class made in NotQuiteABase.
If you’re going to use Python to munge, slice, group, and
manipulate data sets, pandas is an invaluable tool.

Scikit-learn

Scikit-learn is probably the most popular library for doing


machine learning in Python. It contains all the models we’ve
implemented and many more that we haven’t. On a real
problem, you’d never build a decision tree from scratch; you’d
let scikitlearn do the heavy lifting. On a real problem, you’d
never write an optimization algorithm by hand; you’d count on
scikit-learn to be already using a really good one. Its
documentation contains many, many examples of what it can
do (and, more generally, what machine learning can do).

Visualization

The matplotlib charts we’ve been creating have been clean and
functional but not particularly stylish (and not at all
interactive). If you want to get deeper into data visualization,
you have several options. The first is to further explore
matplotlib, only a handful of whose features we’ve actually
covered. Its website contains many examples of its
functionality and a Gallery of some of the more interesting

226
ones. If you want to create static visualizations (say, for
printing in a book), this is probably your best next step. You
should also check out seaborn, which is a library that (among
other things) makes matplotlib more attractive.

R is very important in data science because of its versatility in


the field of statistics. R is usually used in the field of data
science when the task requires special analysis of data for
standalone or distributed computing.

R is also perfect for exploration. It can be used in any kind of


analysis work, as it has many tools and is also very extensible.
Additionally, it is a perfect fit for big data solutions.

Following are some of the highlights which show why R is


important for data science:

 Data analysis software: R is s data analysis software. It is


used by data scientists for statistical analysis, predictive
modeling and visualization.
 Statistical analysis environment: R provides a complete
environment for statistical analysis. It is easy to implement
statistical methods in R. Most of the new research in
statistical analysis and modeling is done using R. So, the new
techniques are first available only in R.
 Open source: R is open source technology, so it is very
easy to integrate with other applications.

227
 Community support: R has the community support of
leading statisticians, data scientists from different parts of
the world and is growing rapidly.

So, most of the development of R language is done by keeping


data science and statistics in mind. As a result, R is become the
default choice for data science applications and data science
professionals.

Find Data

If you’re doing data science as part of your job, you’ll most


likely get the data as part of your job (although not necessarily).
What if you’re doing data science for fun? Data is everywhere,
but here are some starting points:

Data sets for data Visualization Projects


1. FiveThirtyEight (Makes data available on GitHub)
2. BuzzFeed (Makes data available on GitHub)
3. Socrata Open Data
For larger data sets we can use
4. AWS public data sets
5. Google Public Data Sets
6. Wikipedia Data sets
For Machine Learning projects
7. Kaggle Data Sets
8. UCI Machine Learning Repository
9. Quandl
Data sets for Data Cleaning Projects
10.data.world
11.Data.gov

228
12.The World Bank Data Sets
13.The reddit /r/datasets
14.Academic torrents
For Streaming Data
15.Twitter
16.GitHub
17.Quantopian
18.Wunderground

Practicing Data Science

Python is very easy language to put interest in and among the


most renowned programming languages. To get a hold of
actual data science it will always be a good idea to work on
actual interesting projects. A Data Science Project Flow can be
illustrated in the following way:-

229
Some of the ideas which you can go through are
1. Learning to mine on Twitter
This is a simple project for beginners and is useful for
understanding the importance of data science. When doing
this project you will come to know what is trending. It can be
a viral news being discussed or politics or some new movie.
It will also teach you to integrate API in scripts for accessing
any information on social media. It also exposes the
challenges faced in mining social media.
2. Identify your digits data set
This is a type of training your program to recognize different
digits. This problem is known as digit recognition problem.
It is similar to camera detecting faces. It has as many as 7000
images with 28 X 28 size making it 31MB sizing.
3. Loan Prediction data set
The biggest user of data science among industries is
insurance. It puts to use lots of analytics and data science
methods. In this problem we are provided with enough
information to work on data sets of insurance companies, the
challenges to be faced, strategies that are to be used, the
variables that influences the outcome etc. It has around
classification problem with 615 rows and 13 columns.
4. Credit Card Fraud Detection
It is a classification problem where we are supposed to
classify whether the transactions taking place on a card are
legal or illegal. This does not have a huge data set since banks
do not reveal their customer data due to privacy constraints.
5. Titanic dataset from Kaggle:
This dataset provides a good understanding of what a typical
data science project will involve. The starters can work on the
dataset in excel and the professionals can work on advanced

230
tools to extract hidden information and algorithms to
substitute some of the missing values in the dataset.

231
Thank you !
If you enjoyed this book and felt that it added value to your
life, we ask that you please take the time to review our books
in amazon.
Your honest feedback would be greatly appreciated. It
really does make a difference.
If you noticed any problem, please let us know by
sending us an email at [email protected] before
writing any review online. It will be very helpful for us
to improve the quality of our books.

We are a very small publishing company and our


survival depends on your reviews.
Please, take a minute to write us an honest review.

If you want to help us produce more material like this,


then please leave an honest review on amazon. It really
does make a difference.

https://www.amazon.com/dp/B07FTPKJMM

232
233

You might also like