AISCIENCES - Data Science Cookbook - V0
AISCIENCES - Data Science Cookbook - V0
P Y THO N DA TA S C IE N CE
CO O KBO O K
AI Sciences Publishing
ii
How to contact us
Please address comments and questions concerning this book
to our customer service by email at:
[email protected]
iii
Table of Contents
iv
Tuples ............................................................................................ 35
Dictionaries ................................................................................... 35
Defaultdict:.................................................................................... 38
Sets ................................................................................................. 40
Control Flow................................................................................. 40
Truthiness...................................................................................... 43
Moving ahead ............................................................................. 44
Sorting............................................................................................ 45
List Comprehensions .................................................................. 46
Randomness .................................................................................. 47
Regular Expressions .................................................................... 48
Object-Oriented Programming .................................................. 49
Class ............................................................................................... 49
Object ............................................................................................ 50
Method .......................................................................................... 50
Polymorphism .............................................................................. 51
Encapsulation: .............................................................................. 51
Enumerate ..................................................................................... 53
Zip .................................................................................................. 54
Args ................................................................................................ 54
Visualizing Data ............................................................ 56
Matplotlib ................................................................................... 56
Bar Charts................................................................................... 58
Line Charts ................................................................................. 59
Scatterplots .................................................................................. 61
v
Linear Algebra ............................................................... 63
Vectors 63
Matrices ...................................................................................... 65
Statistics ......................................................................... 66
Data in Statistics......................................................................... 66
Measures of central tendencies .................................................. 66
Dispersion .................................................................................. 67
Covariance .................................................................................. 68
Correlation .................................................................................. 69
Probability .................................................................................. 70
Dependence and Independence ................................................. 70
Conditional Probability ............................................................... 71
Bayes’ Theorem .......................................................................... 72
Random Variables ...................................................................... 73
Continuous Distributions ........................................................... 74
Distribution plot of the above graph .......................................... 75
The Normal Distribution ........................................................... 75
vi
Finding APIs ................................................................................. 93
Getting Credentials .................................................................... 93
vii
Generating Response ................................................................ 121
Evaluating Accuracy ................................................................. 122
Main Elements .......................................................................... 123
The Curse of Dimensionality .................................................... 124
viii
Step Number 3 ........................................................................... 161
Step Number 4 ........................................................................... 162
Implementation using Python................................................... 162
Bottom-up Hierarchical Clustering........................................... 168
ix
DELETE..................................................................................... 209
SELECT ...................................................................................... 209
GROUP BY ................................................................................ 210
ORDER BY ................................................................................ 210
Indexes......................................................................................... 211
Query Optimization .................................................................. 213
NoSQL 214
MapReduce ............................................................................... 215
Why MapReduce? ...................................................................... 216
MapReduce More Generally .................................................... 216
Python MapReduce Code ......................................................... 218
Reduce step: reducer.py ............................................................ 219
Go Forth and Do Data Science ................................... 223
IPython 223
Mathematics ..............................................................................224
Not from Scratch .......................................................................225
NumPy ......................................................................................... 225
Pandas .......................................................................................... 226
Scikit-learn ................................................................................... 226
Visualization................................................................................ 226
R.................................................................................................... 227
Find Data..................................................................................... 228
Practicing Data Science ............................................................229
x
xi
Do you want to discover, learn and understand the methods
and techniques of artificial intelligence, data science,
computer science, machine learning, deep learning or
statistics?
Would you like to have books that you can read very fast and
understand very easily?
Would you like to practice AI techniques?
If the answers are yes, you are in the right place. The AI
Sciences book series is perfectly suited to your expectations!
Our books are the best on the market for beginners,
newcomers, students and anyone who wants to learn more
about these subjects without going into too much theoretical
and mathematical detail. Our books are among the best sellers
on Amazon in the field.
About Us
Our books have had phenomenal success and they are today
among the best sellers on Amazon. Our books have helped
many people to progress and especially to understand these
techniques, which are sometimes considered to be complicated
rightly or wrongly.
The books we produce are short, very pleasant to read. These
books focus on the essentials so that beginners can quickly
understand and practice effectively. You will never regret
having chosen one of our books.
We also offer you completely free books on our website: Visit
our site and subscribe in our Email-List: www.aisciences.net
By subscribing to our mailing list, we also offer you all our new
books for free and continuously.
To Contact Us:
Website: www.aisciences.net
Email: [email protected]
Follow us on social media and share our publications
Facebook: @aisciencesllc
LinkedIn: AI Sciences
2
From AI Sciences Publishing
3
WWW.AISCIENCES.NET
EBooks, free offers of eBooks and online learning courses.
Did you know that AI Sciences offers free eBooks versions of
every books published? Please subscribe to our email list to be
aware about our free eBook promotion. Get in touch with us
at [email protected] for more details.
4
WWW.AISCIENCES.NET
Did you know that AI Sciences offers also online courses?
We want to help you in your career and take control of your
future with powerful and easy to follow courses in Data
Science, Machine Learning, Deep learning, Statistics and all
Artificial Intelligence subjects.
5
Preface
In the past ten years, Data Science has quietly grown to include
businesses and organizations world-wide. It is now being used
by governments, geneticists, engineers, and even astronomers.
Technically, this includes machine translation, robotics, speech
recognition, the digital economy, and search engines. In terms
of research areas, Data Science has expanded to include the
biological sciences, health care, medical informatics, the
humanities, and social sciences. Data Science now influences
economics, governments, and business and finance.
Book Objectives
6
Have an appreciation for data science and an
understanding of their fundamental principles.
Have an elementary grasp of data science concepts and
algorithms.
Have achieve a technical background in data science
Target Users
7
© Copyright 2017 by AI Sciences
All rights reserved.
First Printing, 2016
ISBN-13: 978-1986318471
ISBN-10: 1986318478
8
Legal Notice:
You cannot amend, distribute, sell, use, quote or paraphrase any part
or the content within this book without the consent of the author.
Disclaimer Notice:
9
Introduction
To put away Data Science in a simple sentence: It is the study
of where information comes from, what it represents and how
it can be turned into a valuable source in the establishment of
business and IT approaches.
History
There are a wide range of dates and courses of events that can
be utilized to follow the moderate development of Data
Science and its present effect on the Data Management
industry, a portion of the more huge ones are laid out
underneath.
10
Although data science isn’t a new profession, it has evolved
considerably over the last 50 years. If we look into the history
of data science it reveals a long and winding path that began as
early as 1962 when mathematician John W. Tukey predicted
the effect of modern-day electronic computing on data analysis
as an empirical science.
Yet, the data science of today is very different from the one
that Tukey imagined. Tukey’s predictions did not predict big
data and the ability to perform complex and large-scale
analyses. It wasn’t until 1964 that the first desktop computer—
Programma 101—was launched to the public at the New York
World’s Fair. Any analyses that took place were far more
elementary than the ones that are possible today.
11
Data Science Illuminated
Definition
12
First, Data analytics, the trend of applying data science's
practices and tools in the business world.
Second, Internet of things (IOT), the trend of connecting
devices and sensors via the cloud, which is generating
massive streams of data to be analyzed.
Third, Big data, a trend of creating tools and systems able
to store and process these enormous data sets at scale.
Fourth, Machine learning, a trend in artificial intelligence of
teaching machines to solve problems without explicitly
being programmed to so. Machines able to make decisions
and predications all by identifying statistical patterns in
these massive data sets.
All four of these trends are converging to create fully
autonomous, intelligent systems, machines capable of
acting rationally within their environment, and learning how
13
to optimize their performance over time without any
human intervention.
As a result, data science has now become a cost-effective
strategy for answering questions, making decisions, and
predicting outcomes in a wide variety of scenarios in our
world. Given this trend, it's unlikely that the demand for
data science will decrease any time in the near future.
14
Since the identification and analysis of large amounts of data
which is unstructured can prove complex, expensive as well as
time-consuming for companies, data science is still an evolving
field here.
15
Machine Learning:
Machine learning is a subfield of artificial intelligence based on
statistics. It involves machines learning how to complete tasks
without being explicitly programmed to do so. This part is
explained in details further.
16
In general, a subject-matter expert (SME) or domain expert is
a person who is a specialist in a particular area or topic. An
SME should also have basic knowledge of other technical
subjects too. In Data Science an SME Provides
industry/process-specific context for what the patterns
identified by the algorithms and models mean.
Such Individuals who master in all three skills are also called as
Unicorn Data Scientist. Despite how rare unicorn data
scientists are they are rapidly growing in demand. In addition,
there doesn't appear to be any end in sight for the growth of
this demand. As a result, in the very near future this specific
set of skills will be in high demand, whether you're a data
scientist or applying data science practices to your current job
role. The rarity of data scientists combined with their high
demand leads to the much higher salaries for data scientists and
IT professionals with similar skills.
17
A Basic Course in Python
Getting Started
Getting Python
18
PEP and the Zen of Python
Whitespace Formatting
19
Whitespace is ignored inside parentheses and brackets. This
can be which can be helpful for long-winded computations.
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7
+ 8 + 9 + 10 + 11 + 12 +
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)
Modules
import re
20
my_regex = re.compile("[0-9]+", re.I)
import re as regex
my_regex = regex.compile("[0-9]+", regex.I)
21
You can also import the entire contents of a module into your
script, which might automatically overwrite variables you’ve
already defined and also import functions not required.
match = 10
from re import * # re has a match function
print(match) # "<function re.match>" instead of
10
Scope of a variable
22
This implies that the local variables are accessible only inside
the function they are declared whereas global variables can be
accessed from anywhere throughout the program body by all
the functions.
Arithmetic Operators
hand operand.
operands
right operand
23
** Performs exponential (power) x**y =10 to
the power 20
calculation on operators
Functions
24
def double(x):
"""this is where you put an optional docstring that
explains what the function does.
for example, this function multiplies its input by 2
and returns the same"""
return x * 2
def apply_to_one(f):
"""calls the function f with 1 as its argument"""
return f(1)
my_double = double # refers to the previously defined
function
x = apply_to_one(my_double) # equals 2
25
The following example of an add function returns the sum of
its two arguments:
f = add x, y : x + y
f(1,1) # equals to 2
26
Strings
27
my_string = 'Python'
my_string + my_string = 'PythonPython'
my_string * 3 = 'PythonPythonPythonPython '
28
print("He said, \"What's there?\"")
29
print(para_str)
Exceptions
The words "try" and "except" are Python keywords that are
used to handle exceptions. The code which may cause
exception is placed under the try block and the code to handle
the exception is placed under the except block.
try:
print("Starting of try")
print (1/0)
print("Ending of try")
except ZeroDivisionError:
print ("Division by zero is not possible")
finally:
30
print("End of program")
Output
Starting of try
Division by zero is not possible
Ending of program
Lists
int_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ int_list, heterogeneous_list, [] ]
list_length = len(int_list) # equals 3
31
list_sum = sum(int_list) # equals 6
You can get or set the nth element of a list with square
brackets:
You can also access the list elements in reverse order using
negative index:
32
You can also delete individual list elements by index and the
entire list using the del keyword:
You can also use square brackets with the slicing operator " : "
to “slice” lists:
1 in [1, 2, 3, 4, 5] # True
0 in [1, 2, 3, 4, 5] # False
33
This check involves examining the elements of the list one at a
time, which means that you probably shouldn’t use it unless
you know your list is pretty small or unless you don’t care how
long the check takes.
z = [1, 2, 3]
z.extend([4, 5, 6]) # z is now [1,2,3,4,5,6]
z = [1, 2, 3]
y = x + [4, 5, 6] # y is [1, 2, 3, 4, 5, 6]; z is
unchanged
z = [1, 2, 3]
z.append(0) # x is now [1, 2, 3, 0]
y = x[-1] # equals 0
l = len(x) # equals 4
34
Tuples
Tuples are similar to lists except that they are immutable which
means that they cannot be modified once declared. To change
a tuple you have to replace it entirely or create it once again
after deleting it. You specify a tuple by using parentheses
instead of square brackets:
my_first_tuple = (1, 2)
my_first_tuple[1] = 3 # gives error
stating that you cannot modify a tuple
Dictionaries
35
empty_dict = {} # Creates an
empty dictionary
grades = { "James" : 80, "Tim" : 95 } # dictionary
literal
You can look up the value for a key using square brackets:
But you will get a KeyError if you ask for a key that is not in
the dictionary:
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
You can check for the existence of a key using the in operator:
36
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # equals
None, since default key is None
You can assign new key-value pairs or update the existing ones
using the square brackets:
tweet = {
"user" : "John",
"text" : "Data Science is Awesome",
"retweet_count" : 100,
"hashtags" : ["#data", "#science", "#datascience",
"#awesome", "#yolo"]
}
37
tweet_items = tweet.items() # returns a list of
(key, value) tuples
"user" in tweet_keys # equals True, uses
the slow list in
"user" in tweet # equals True, uses
the faster dict in
"John" in tweet_values # equals True
Defaultdict:
38
Counter:
A Counter is a container that keeps track of how many times
equivalent values are added. It turns a sequence of values into
a defaultdict(int) like object converting keys to counts and only
one single entry for similar values.
c = Counter([0, 1, 2, 0]) # c is
(basically) { 0 : 2, 1 : 1, 2 : 1 }
word_counts = Counter(document)
39
Sets
my_set = set()
my_set.add(1) # my_set is now { 1 }
my_set.add(2) # my_set is now { 1, 2 }
my_set.add(2) # my_set is still { 1, 2 }
x = len(s) # equals 2
y = 2 in s # equals True
z = 3 in s # equals False
Control Flow
x = 90
if x < 0:
print('Number is Negative')
elif x == 0:
print('Number is Zero')
else:
print('Number is Positive')
Output: Positive
There can be zero or more elif parts, and the else part is
optional. The keyword ‘elif’ is short for ‘else if’, and is useful
to avoid excessive indentation.
You can also write a ternary if-then-else on one line, which is
quite useful occasionally:
41
Output: 0123456789
for i in range(5):
print(i)
Output: 01234
for x in range(10):
if x == 2:
continue # go immediately to the next
iteration
if x == 5:
break # quit the loop entirely
print x
42
while True:
pass # Busy-wait for keyboard
interrupt (Ctrl+C)
Truthiness
z = None
print z == None # prints True
print z is None # prints True
• False
43
• None
• [] (an empty list)
• {} (an empty dict)
• "" (an empty string)
• set()(an empty set)
• 0
• 0.0
Pretty much anything else gets treated as True. This allows you
to easily use if statements to test for empty lists or empty
strings or empty dictionaries or so on.
Python has an all function, which takes a list and returns True
precisely when every element is truthy, and an any function,
which returns True when at least one element is truthy:
Moving ahead
44
Sorting
Every Python list has a sort method that sorts it in some or the
other way. The sorted function returns a new list:
y = [4,1,2,3]
z = sorted(y) # is [1,2,3,4], y is unchanged
x.sort() # now y is [1,2,3,4]
45
List Comprehensions
print(L)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Similarly:
M = [x for x in L if x % 2 == 0]
Print(M)
[0, 4, 16, 36, 64]
46
Randomness
import random
four_uniform_randoms = [random.random() for _ in
range(4)]
#random.random() produces numbers uniformly between 0
and 1
47
There are a few more methods that we will sometimes find
convenient. random.shuffle randomly reorders the elements of
a list:
up_to_ten = range(10)
random.shuffle(up_to_ten)
print(up_to_ten)
# [2, 5, 1, 9, 7, 3, 8, 6, 4, 0] (your results will
probably be different)
If you need to randomly pick one element from a list you can
use random.choice:
my_friend = random.choice(["Alice", "Bob",
"Charlie"]) # "Bob" for me
Regular Expressions
48
import re
print all([ # all of
these are true, because
not re.match("a", "bat"), # * 'bat'
doesn't start with 'a'
re.search("a", "bat"), # * 'bat' has
an 'a' in it
not re.search("c", "bat"), # * 'bat'
doesn't have a 'c' in it
3 == len(re.split("[ab]", "carbs")), # * split on
a or b to ['c','r','s']
"R-D-" == re.sub("[0-9]", "-", "R2D2") # * replace
digits with dashes
]) # prints True
Object-Oriented Programming
49
Object
Method
50
Polymorphism
Encapsulation:
51
# by convention, we give classes PascalCase names
class my_set:
# these are the member functions
# every one takes a first parameter "self"
(another convention)
# that refers to the particular Set object
being used
def __init__(self, values=None):
"""This is the constructor.
It gets called when you create a new
Set.
You would use it like
30 | Chapter 2: A Crash Course in Python
s1 = Set() # empty set
s2 = Set([1,2,2,3,4,4]) # initialize
with values"""
self.dict = {} # each instance of Set
has its own dict property
# which is what we'll use to track
memberships
if values is not None:
for value in values:
self.add(value)
def __repr__(self):
"""this is the string representation of
a Set object
if you type it at the Python prompt or
pass it to str()"""
return "Set: " + str(self.dict.keys())
# we'll represent membership by being a
key in self.dict with value True
def add_element(self, value):
52
self.dict[value] = True
# value is in the Set if it's a key in
the dictionary
def contains_element(self, value):
return value in self.dict
def remove_element(self, value):
del self.dict[value]
Enumerate
Sometimes you will want to iterate over a list and use both its
elements and their indexes:
The Pythonic solution is enumerate, which produces tuples
(index, element):
53
Remember that the _ 'single underscore' is always used
somewhere we are ignoring specific values.
Zip
If the lists are different lengths, zip stops as soon as the first list
ends.
Args
54
Output:
55
Visualizing Data
Matplotlib
56
Python has a library known as Matplotlib, which produces a variety
of graphs and other visual representations across platform. This is a
2D plotting library. It can be used in any Python script. With just a
small code it is possible to generate bar charts, histograms, error
charts and even power spectra and scatterplots.
The following example uses the matplotlib.pyplot module. In its
simplest use, pyplot maintains an internal state in which you build
up a visualization gradually and for simple bar charts, line charts, and
scatterplots, it works pretty well. After the generation of the
graphics, you can either save it or display it.
Following code generates the graph shown next called as a line chart:
57
Bar Charts
A bar chart is a graph with rectangular bars. The graph usually shows
a comparison between different categories. In other words, the
length or height of the bar is equal to the quantity within that
category.
For instance, figure below shows how many Academy Awards were
won by each of a variety of movies:
58
# plot bars with left x-coordinates [xs], heights
[num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
# label x-axis with movie names at bar centers
plt.xticks([i + 0.5 for i, _ in enumerate(movies)],
movies)
plt.show()
Line Charts
59
These are a good choice for showing trends, as illustrated in Figure
below:
60
Scatterplots
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145,
190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
'i']
plt.scatter(friends, minutes)
61
# label each point
for label, friend_count, minute_count in zip(labels,
friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put
the label with its point
xytext=(5, -5), # but
slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
62
Linear Algebra
Vectors
63
We can also get access to the vector array to use it with other
libraries.
v1.vector # [1, 2, 3]
v1.magnitude() # 3.7416573867739413
64
Cross/Scalar Product: We can find the cross product of two
vectors.
v1.cross(v2) # Vector(0, 0, 0)
Matrices
For example;
B =
[[80,75,85,90,95],[75,80,75,85,100],[80,80,80,90,95]]
65
Statistics
Data in Statistics
66
Our measure of central tendencies are mean, median and mode
The mean is the average and is computed as the sum of all the
observed outcomes from the sample divided by the total number of
events.
Median is the middle value when the data values have been sorted
(or the average of the 2 middle values if there are an even number of
data values).
Mode is the data value(s) that occur with the greatest frequency.
The numpy library which will be explained later gives us pre-defined
functions to calculate all the three.
import numpy as np
from statistics import mode
num_list = [1,2,3,3,5,3]
mean_list = np.mean(num_list) # 2.833
median_list = np.median(num_list) # 3.0
mode_list = mode(num_list) # mode
Dispersion
67
For the study of dispersion, we need some measures which show
whether the dispersion is small or large. Measures like standard
deviation and variance give us an idea about the amount of
dispersion in a set of observations.
def sum_of_squares(s):
"""computes the sum of squared elements in s"""
return sum(s_i ** 2 for s_i in v)
Z = [1,2,3,4,5]
#translate z by subtracting its mean (so the result
has mean 0)
z_bar = mean(z)
deviations = [z_i - z_bar for z_i in z]
variance1 = sum_of_squares(deviations) / (n - 1)
standard_deviations = math.sqrt(variance1)
Covariance
68
Covariance is a measure of how much two random variables vary
together from their mean. It’s similar to variance, but where variance
tells you how a single variable varies, co variance tells you how two
variables vary together.
The lines below shows how to calculate covariance:
l = len(z)
covariance = dot(mean(z), mean(y)) / ( - 1)
Correlation
69
The lines below shows how to calculate correlation:
stdev_a = standard_deviation(a)
stdev_b = standard_deviation(b)
correlation = covariance(a, b) / stdev_a / stdev_b
Probability
In general:
Probability of an event happening = Number of ways it can happen
/ Total number of outcomes
70
For Example: Where you work has no effect on what color car you
drive.
When two events are independent, one event does not influence the
probability of another event.
When two events are dependent events, one event influences the
probability of another event. In other words two events are
dependent if the outcome or occurrence of the first affects the
outcome or occurrence of the second so that the probability is
changed.
Conditional Probability
71
The conditional probability of an event B is the probability that the
event will occur given the knowledge that an event A has already
occurred. This probability is written P(B|A), notation for the
probability of B given A.
In the case where events A and B are independent, the conditional
probability of event B given event A is simply the probability of
event B, that is P(B).
P(B|A) = P(B)
Bayes’ Theorem
72
Which tells us how often A happens given that B happens, written
as P(A|B),
when we know how often B happens given that A happens, written
as P(B|A)
and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)
For Example:
Let us say P(Clouds) means how often there is fire, and P(Rain)
means how often we see smoke, then:
Random Variables
73
between 3 and 18, since the highest number of a die is 6 and the
lowest number is 1.
Further, there are two types of Random variables, discreet random
variable a variable that represents numbers found by counting, for
example: number of chocolates in a jar, and Continuous random
variable, which represents an infinite number of values on the
number line, for example distance traveled while delivering mail
Continuous Distributions
74
Distribution plot of the above graph
75
smaller percentage of students score an F or an A. This creates
a distribution that resembles a bell which is why this is known
as the bell shaped curve. This curve is symmetrical in nature.
Half of the data will fall to the left of the mean; half will fall to
the right.
Many groups follow this type of pattern. That’s why it’s widely
used in business, statistics and in government bodies like
the FDA:
The empirical rule tells you what percentage of your data falls
within a certain number of standard deviations from the mean:
• 68% of the data falls within one standard deviation of
the mean.
• 95% of the data falls within two standard deviations of
the mean.
• 99.7% of the data falls within three standard deviations of
the mean.
76
The standard deviation controls the spread of the distribution.
A smaller standard deviation indicates that the data is tightly
clustered around the mean; the normal distribution will be
taller. A larger standard deviation indicates that the data is
spread out around the mean; the normal distribution will be
flatter and wider.
77
Hypothesis and Inference
78
In particular, our test will involve flipping the coin some
number n times and counting the number of heads X. Each
coin flip is a Bernoulli trial, which means that X is a
Binomial(n,p) random variable, which we can approximate
using the normal distribution:
79
def normal_two_sided_bounds(probability, mu=0,
sigma=1):
"""returns the symmetric (about the mean) bounds
that contain the specified probability"""
tail_probability = (1 - probability) / 2
# upper bound should have tail_probability above it
upper_bound = normal_lower_bound(tail_probability,
mu, sigma)
# lower bound should have tail_probability below it
lower_bound = normal_upper_bound(tail_probability,
mu, sigma)
return lower_bound, upper_bound
mu_0, sigma_0 =
normal_approximation_to_binomial(1000, 0.5)
80
Assuming p really equals 0.5 (i.e., H0 is true), there is just a 5%
chance we observe an
X that lies outside this interval, which is the exact significance
we wanted. Said differently,
if H0 is true, then, approximately 19 times out of 20, this test
will give the correct result.
We are also often interested in the power of a test, which is the
probability of not making a type 2 error, in which we fail to reject
H0 even though it’s false. In order to measure this, we have to
specify what exactly H0 being false means. (Knowing merely
that p is not 0.5 doesn’t give you a ton of information about the
distribution of X.) In particular, let’s check what happens if p
is really 0.55, so that the coin is slightly biased toward heads.
For our two-sided test of whether the coin is fair, we compute:
81
One way to convince yourself that this is a sensible estimate is
with a simulation:
extreme_value_count = 0
for _ in range(100000):
num_heads = sum(1 if random.random() < 0.5 else 0 #
count # of heads
for _ in range(1000)) # in 1000 flips
if num_heads >= 530 or num_heads <= 470: # and count
how often extreme_value_count += 1 # the # is
'extreme'
print extreme_value_count / 100000 # 0.062
Gradient Descent
82
Here, we see that we make an update to the parameters by
taking gradient of the parameters. And multiplying it by a
learning rate, which is essentially a constant number suggesting
how fast we want to go the minimum. Learning rate is a hyper-
parameter and should be treated with care when choosing its
value.
83
Here is the main code for defining vanilla gradient descent,
return updates
84
grads = T.grad(cost=cost, wrt=params)
85
Stochastic Gradient Descent
86
When this is the case, we can instead apply a technique called
stochastic gradient descent, which computes the gradient (and
takes a step) for only one point at a time. It cycles over our data
repeatedly until it reaches a stopping point. During each cycle,
we’ll want to iterate through our data in a random order:
def in_random_order(data):
"""generator that returns the elements of data in
random order"""
indexes = [i for i, _ in enumerate(data)] # create a
list of indexes
87
value = sum( target_fn(x_i, y_i, theta) for x_i, y_i
in data )
if value < min_value:
# if we've found a new minimum, remember it
# and go back to the original step size
min_theta, min_value = theta, value
iterations_with_no_improvement = 0
alpha = alpha_0
else:
# otherwise we're not improving, so try shrinking the
step size
iterations_with_no_improvement += 1
alpha *= 0.9
# and take a gradient step for each of the data
points
for x_i, y_i in in_random_order(data):
gradient_i = gradient_fn(x_i, y_i, theta)
theta = vector_subtract(theta, scalar_multiply(alpha,
gradient_i))
return min_theta
x, y, theta_0, alpha_0)
88
Getting Data
Without data, it is impossible to be a data scientist. Actually as
a data scientist, you will spend, most of your time acquiring,
cleaning, and transforming data. It is possible but not
recommended that you always type the data in yourself but
usually this is not a good use of your time. So let us have a look
at how you can read the data in python
stdin and stdout
If you run your Python scripts at the command line, you can
pipe data through them using sys.stdin and sys.stdout. For
example, here is a script that reads in lines of text and spits
back out the ones that match a regular expression:
# egrep.py
import sys, re
# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the
command line
regex = sys.argv[1]
# for every line passed into the script
for lines in sys.stdin:
# if it matches the regex, write it to stdout
if re.search(regex, lines):
sys.stdout.write(lines)
We can then use these to count how many lines of a file contain
numbers. In Windows, we would use type SomeFile.txt |
python egrep.py "[0-9]" | python line_count.py
Whereas in a UNIX system we would use:
cat SomeFile.txt | python egrep.py "[0-9]" | python
line_count.py
The | is the pipe character, which means “use the output of
the left command as the input of the right command.” You can
build pretty elaborate data-processing pipelines this way.
Reading Files
90
file.readline(n) – This method reads an entire line from the
text file.
Closing a file
file.close()
Delimited Files
For example, if we had a tab-delimited file of stock prices:
6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34
import csv
with open('tab_delimited_stock_prices_file.txt',
'rb') as f:
reader = csv.reader(f, delimiter='\t')
for rows in reader:
date1 = rows[0]
symbol1 = rows[1]
closing_prices = float(row[2])
process(date, symbol, closing_prices)
Using APIs
91
Many websites and web services provide application
programming interfaces (APIs), which allow you to explicitly
request data in a structured format.
import json
serialized = """{ "title" : "Data Science Book",
"author" : "James",
"publicationYear" : 2014,
"topics" : [ "data", "science", ”python”, "data
science"] }"""
# parse the JSON to create a Python dict
92
deserialized = json.loads(serialized)
if "data science" in deserialized["topics"]:
print deserialized
Finding APIs
There are two directories namely Python API and Python for
Beginners. They can be useful if you are looking for lists of
APIs that have Python wrappers. If you want a directory of
web APIs more broadly (without Python wrappers
necessarily), a good resource is Programmable Web, which has
a huge directory of categorized APIs.
Example: Using the Twitter APIs
To interact with the Twitter APIs we will be using the Twython
library (pip install twython). There are quite a few Python
Twitter libraries out there.
Getting Credentials
93
5. Agree to the Terms of Service and click Create.
6. Take note of the consumer key and consumer secret.
7. Click on “Create my access token.”
8. Take note of the access token and access token secret (you
may have to refresh the page).
94
Working around Data
Exploring Data
Two Dimensions
Now imagine you have a data set with two dimensions. Maybe
in addition to daily minutes you have years of data science
95
experience. Of course, you’d want to understand each
dimension individually. However, you probably also want to
scatter the data.
For example, consider another fake data set:
def random_normal():
"""returns a random draw from a standard normal
distribution"""
return inverse_normal_cdf(random.random())
xs = [random_normal() for _ in range(1000)]
ys11 = [ x + random_normal() / 2 for x in xs]
ys22 = [-x + random_normal() / 2 for x in xs]
Many Dimensions
With many dimensions, you would like to know how all the
dimensions relate to one another. A simple approach is to look
at the correlation matrix, in which the entry in row i and
column j is the correlation between the ith dimension and the
jth dimension of the data:
def correlation_matrix(data):
"""returns the num_columns x num_columns matrix whose
(i, j)th entry
is the correlation between columns i and j of data"""
_, num_columns = shape(data)
def matrix_entry(i, j):
return correlation(get_column(data, i),
get_column(data, j))
return make_matrix(num_columns, num_columns,
matrix_entry)
96
Cleaning and Munging
97
return f_or_none
after which we can rewrite parse_row to use it:
def parse_row(input_row, parsers):
return [try_or_none(parser)(value) if parser is not
None else value
for value, parser in zip(input_row, parsers)]
Manipulating Data
data = [
{'closing_price': 102.06,
'date': datetime.datetime(2014, 8, 29, 0, 0),
'symbol': 'AAPL'},
# ...
]
98
for row in data:
by_symbol[row["symbol"]].append(row)
# use a dict comprehension to find the max for each
symbol
max_price_by_symbol = { symbol :
max(row["closing_price"]
for row in grouped_rows)
for symbol, grouped_rows in by_symbol.iteritems() }
Rescaling
99
b_to_c = distance([67, 160], [70, 171]) # 11.40
def scale(data_matrix):
"""returns the means and standard deviations of each
column"""
num_rows, num_cols = shape(data_matrix)
means = [mean(get_column(data_matrix,j))
for j in range(num_cols)]
stdevs =
[standard_deviation(get_column(data_matrix,j))
for j in range(num_cols)]
return means, stdevs
100
def rescale(data_matrix):
"""rescales the input data so that each column
has mean 0 and standard deviation 1
leaves alone columns with no deviation"""
means, stdevs = scale(data_matrix)
def rescaled(i, j):
if stdevs[j] > 0:
return (data_matrix[i][j] - means[j]) / stdevs[j]
else:
return data_matrix[i][j]
num_rows, num_cols = shape(data_matrix)
return make_matrix(num_rows, num_cols, rescaled)
101
Machine Learning
Modeling
102
What Is Machine Learning?
103
essence of what we're attempting to accomplish. Some
examples of tasks that machine learning algorithms can
perform are classification where we make a decision or a
prediction involving two or more categories or outcomes, for
example, deciding whether to accept or reject a loan based on
data from a customer's financial history. Regression, where we
attempt to predict a numeric outcome based on one or more
input variables. For example, how much will a house sell for
based on the features of the house compared to the sale price
of similar houses? Clustering, where we group similar objects
together based on similarities and their data, for example,
grouping customers into marketing segments based on their
income, age, gender, number of children, etc. To understand
Machine Learning in a better way we see its workflow:
104
First, we find a question that we want to answer. This can be a
hypothesis we want to test, a decision we want to make, or
something we want to attempt to predict.
Second, we collect data for our analysis. Sometimes this means
designing an experiment to create new data, other times the
data already exist and we just need to find them.
Third, we prepare the data for analysis, a process often referred
to as data munging or data wrangling. We need to clean and
transform these data to get them into a form suitable for
analysis.
Fourth, we create a model for our data. In the most generic
sense, this can be a numerical model, a visual model, a
statistical model, or a machine learning model. We use this
model to provide evidence for or against our hypothesis, to
help us make a decision, or to predict an outcome.
105
Fifth, we evaluate the model. We need to determine if our
model answers our question, helps us make a decision, or
creates an accurate prediction. In addition, we need to make
sure that our model is appropriate given our data and the
context.
Finally, if everything looks good, we deploy our model. This
could mean communicating the results of our analysis to
others, making a decision and acting upon our decision, or
deploying an application into production.
We then repeat this process for each question we would like to
answer using feedback from our previous results to help guide
our process.
Data science is typically an iterative process. We typically go
through the complete cycle multiple times learning and
improving with each iteration. In addition, this process is often
non-sequential; we often have to bounce back and forth
between steps as we discover problems and learn better ways
of solving these problems. In addition, there are times when
we don't need to complete the process. Often, we learn that
what we're doing isn't working or doesn't make sense give our
data or context, so we terminate the process and shift our focus
to the next most important question in our to-lo list instead.
There are some well-established practices for the data science
process available, like the CRISP-DM process, which stands
for Cross Industry Standard Process for Data Mining. These
established processes are useful to help you get started with
your data-science process.
106
Supervised learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Unsupervised Learning
107
Unsupervised Learning
Semi-supervised Learning
In the previous two types, either there are no labels for all the
observation in the dataset or labels are present for all the
observations. Semi-supervised learning falls in between these
two. In many practical situations, the cost to label is quite high,
since it requires skilled human experts to do that. So, in the
absence of labels in the majority of the observations but
present in few, semi-supervised algorithms are the best
candidates for the model building. These methods exploit the
idea that even though the group memberships of the unlabeled
data are unknown, this data carries important information
about the group parameters.
108
Reinforcement Learning
109
Overfitting and Underfitting
The horizontal line shows the best fit degree 0 (i.e., constant)
polynomial. It severely underfits the training data. The best fit
degree 9 (i.e., 10-parameter) polynomial goes through every
training data point exactly, but it very severely overfits—if we
were to pick a few more data points it would quite likely miss
them by a lot. And the degree 1line strikes a nice balance—it’s
pretty close to every point, and (if these data are representative)
the line will likely be close to new data points as well.
Clearly, models that are too complex lead to overfitting and
don’t generalize well beyond the data they were trained on. So
how do we make sure our models aren’t too complex? The
most fundamental approach involves using different data to
train the model and to test the model.
110
The simplest way to do this is to split your data set, so that (for
example) two-thirds of it is used to train the model, after which
we measure the model’s performance on the remaining third:
111
Correctness
113
features. Going from the degree 0 model in“Overfitting and
Underfitting” to the degree 1 model can be a big improvement.
If your model has high variance, then you can similarly remove
features. Nevertheless, another solution is to obtain more data.
114
In real applications, usually tens of thousands of features are
measured while only a very small percentage of them carry
useful information towards our learning goal. Therefore, we
usually need an algorithm that compress our feature vector and
reduce its dimension. Two groups of methods, which can be
used, for dimensionality reduction are: 1) Feature extraction
methods where apply a transformation on the original feature
vector to reduce its dimension from d to m. 2) Feature
selection methods that select a small subset of original features.
In this work, we want to compare the linear discriminant
analysis (LDA) which is a traditional feature extraction method
with a forward selection based method (which is an instance of
the feature selection algorithms) and find under which
conditions, one of these algorithms works better.
115
K-Nearest Neighbors
Imagine that you’re trying to predict how I’m going to vote in
the next presidential election. If you know nothing else about
me (and if you have the data), one sensible approach is to look
at how my neighbors are planning to vote. Living in downtown
Seattle, as I do, my neighbors are invariably planning to vote
for the Democratic candidate, which suggests that
“Democratic candidate” is a good guess for me as well.
Now imagine you know more about me than just geography—
perhaps you know my age, my income, how many kids I have,
and so on. To the extent my behavior is influenced (or
characterized) by those things, looking just at my neighbors
who are close to me among all those dimensions seems likely
to be an even better predictor than looking at all my neighbors.
This is the idea behind nearest neighbors classification.
The Model
Example:
This example is broken down into the following steps:
116
5. Evaluating Accuracy: Summarize the accuracy of
predictions.
6. Main: Tie it all together.
Handling Data
The first thing we need to do is load our data file. The data is
in CSV format without a header line or any quotes. We can
open the file with the open function and read the data lines
using the reader function in the csv module.
import csv
with open('iris.data', 'rb') as csvfile:
lines = csv.reader(csvfile)
for row in lines:
print ', '.join(row)
Next we need to split the data into a training dataset that kNN
can use to make predictions and a test dataset that we can use
to evaluate the accuracy of the model.
117
splits it randomly into train and test datasets using the provided
split ratio.
import csv
import random
def loadDataset(filename, split, trainingSet=[] ,
testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
Training_Set=[]
Test_Set=[]
loadDataset('iris.data', 0.66, training_Set,
test_Set)
118
print 'Train: ' + repr(len(training_Set))
print 'Test: ' + repr(len(test_Set))
Calculating Similarity
import math
def Euclidean_Distance(instance_1, instance_2, len):
distance = 0
for i in range(len):
dist += pow((instance_1[x] – instance_2[x]), 2)
return math.sqrt(dist)
119
We can test this function with some sample data, as
follows:
data_1 = [3, 3, 3, 'x']
data_2 = [5, 5, 5, 'y']
distance = Euclidean_Distance(data_1, data_2, 3)
print 'Distance: ' + repr(distance)
Locating Neighbors
import operator
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance,
trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
120
neighbors.append(distances[x][0])
return neighbors
We can test out this function as follows:
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)
Generating Response
import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
121
sortedVotes = sorted(classVotes.iteritems(),
key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
Evaluating Accuracy
122
for x in range(len(testSet)):
if testSet[x][-1] is predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
We can test this function with a test dataset and
predictions, as follows:
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSet, predictions)
print(accuracy)
Main Elements
We now have all the elements of the algorithm and we can tie
them together with a main function Running the example, you
will see the results of each prediction compared to the actual
class value in the test set. At the end of the run, you will see
the accuracy of the model. In this case, a little over 98%.
1 ...
2 > predicted='Iris-virginica', actual='Iris-virginica'
3 > predicted='Iris-virginica', actual='Iris-virginica'
4 > predicted='Iris-virginica', actual='Iris-virginica'
5 > predicted='Iris-virginica', actual='Iris-virginica'
6 > predicted='Iris-virginica', actual='Iris-virginica'
7 Accuracy: 98.0392156862745%
123
The Curse of Dimensionality
125
We now have a working, although quite simplistic classifier. If
we are presented an unseen observation, all we have to do is
figure out its region, in order to make a prediction for its class.
Now let's increase the dimensionality of the data set X, by
making x a two dimensional vector (D=2)
XT={(x11,x12),…,(xN1,xN2)}(4)
126
Let see how many observations we need if we want to keep the
one-dimensional density from our example in a three-
dimensional space.
2011=x13x=8000(5)
127
Naive Bayes
128
P(c|x) (read as probability of c given x) is the
posterior probability of class (c, target) given predictor
(x, attributes).
P(c) (read as probability of c ) is the prior probability
of class.
P(x|c) ) (read as probability of x given c)is the
likelihood which is the probability of predictor given
class.
P(x) (read as probability of x) is the prior probability
of predictor.
The three types of Naive Bayes model under scikit learn library
are:
Gaussian models
Multinomial models
Bernoulli models
129
Based on your data set, you can choose any of above discussed
model. Below is the example of Gaussian model.
Python Code
#Import Library of Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
import numpy as np
#Predict Output
predicted= model.predict([[1,2],[3,4]])
print predicted
Output: ([3,4])
130
Simple Linear Regression
The question which arises here is that how does regression
relate to machine learning?
Given data, we can try to find the best fit line. After we
discover the best fit line, we can use it to make predictions.
Data can be any data saved from Excel into a csv format. To
load the data, we will use Python Pandas.
Required modules are
1. sklearn
2. scipy
3. scikit-learn
You can use sudo pip install to install the above
matplotlib.use('GTKAgg')
131
The data will be split into a training and test set. Once we have
the test data, we can find a best fit line and make predictions.
import matplotlib
matplotlib.use('GTKAgg')
P = df['price']
S = df['lotsize']
S=S.reshape(len(S),1)
P=P.reshape(len(P),1)
# Plot outputs
plt.scatter(P_test, S_test, color='black')
132
plt.title('Test Data')
plt.slabel('Size')
plt.plabel('Price')
plt.pticks(())
plt.sticks(())
plt.show()
We have created the two datasets and have the test data on the
screen. We can continue to create the best fit line:
133
# Train the model using the training sets
regression.fit(X_train, Y_train)
# Plot outputs
plt.plot(X_test, regrression.predict(X_test),
color='red',linewidth=3)
This will output the best fit line for the given test data.
134
Logistic Regression
135
Logistic regression is similar to linear regression, with the only
difference being the y data, which should contain integer values
indicating the class relative to the observation. Using the Iris
dataset from the Scikit-learn datasets module, you can use the
values 0, 1, and 2 to denote three classes that correspond to
three species:
136
three classes. Based on the observation used for prediction,
logistic regression estimates a probability of 71 percent of its
being from class 2 — a high probability, but not a perfect
score, therefore leaving a margin of uncertainty.
Using probabilities lets you guess the most probable class, but
you can also order the predictions with respect to being part of
that class. This is especially useful for medical purposes:
Ranking a prediction in terms of likelihood with respect to
others can reveal what patients are at most risk of getting or
already having a disease.
The two multiclass classes OneVsRestClassifier and
OneVsOneClassifier operate by incorporating the estimator
(in this case, LogisticRegression). After incorporation, they
usually work just like any other learning algorithm in Scikit-
learn. Interestingly, the one-versus-one strategy obtained the
best accuracy thanks to its high number of models in
competition.
137
Decision Trees
A decision tree simply uses a tree structure to represent the
number of possible decision paths and an outcome for each
specified path.
def entropy_of_partition(subsets):
"""find the entropy from this partition of data into
subsets
138
subsets is a list of lists of labeled data"""
total_count = sum(len(subset) for subset in subsets)
return sum( data_entropy(subset) * len(subset) /
total_count for subset in subsets )
inputs = [
({'level':'Senior', 'lang':'Java', 'tweets':'no',
'phd':'no'}, False),
({'level':'Senior', 'lang':'Java', 'tweets':'no',
'phd':'yes'}, False),
({'level':'Mid', 'lang':'Python', 'tweets':'no',
'phd':'no'}, True),
({'level':'Junior', 'lang':'Python', 'tweets':'no',
'phd':'no'}, True),
({'level':'Junior', 'lang':'R', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Junior', 'lang':'R', 'tweets':'yes',
'phd':'yes'}, False),
({'level':'Mid', 'lang':'R', 'tweets':'yes',
'phd':'yes'}, True),
({'level':'Senior', 'lang':'Python', 'tweets':'no',
'phd':'no'}, False),
({'level':'Senior', 'lang':'R', 'tweets':'yes',
'phd':'no'}, True),
139
({'level':'Junior', 'lang':'Python', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Senior', 'lang':'Python', 'tweets':'yes',
'phd':'yes'}, True),
({'level':'Mid', 'lang':'Python', 'tweets':'no',
'phd':'yes'}, True),
({'level':'Mid', 'lang':'Java', 'tweets':'yes',
'phd':'no'}, True),
({'level':'Junior', 'lang':'Python', 'tweets':'no',
'phd':'yes'}, False)
]
Our tree will consist of decision nodes (which ask a question and
direct us differently depending on the answer) and leaf nodes
(which give us a prediction). We will build it using the relatively
simple ID3 algorithm, which operates in the following manner.
Let’s assume we have some labeled data, and a list of attributes
to consider branching on.
• If the data all have the same label, then we create a leaf node
that predicts that label and then stops.
• If the list of attributes is empty (i.e., there are no more
possible questions to ask), then we create a leaf node that
predicts the most common label and then stops.
• Otherwise, we try partitioning the data by each of the
attributes
• Then we choose the partition which has the lowest partition
entropy
• Then we add a decision node based on the chosen attribute
• Recur on each partitioned subset using the remaining
attributes
140
This is known as the “greedy” algorithm because, at each step,
it chooses the most immediately best option. Given a data set,
there may be a better tree with a worse-looking first move. If
so, this algorithm won’t find it. Nevertheless, it is relatively easy
to understand and implement, which makes it a good place to
begin exploring decision trees.
Let’s manually go through these steps on the interviewee data
set. The data set has both True and False labels, and we have
four attributes we can split on. So our first step will be to find
the partition with the least entropy. We’ll start by writing a
function that does the partitioning:
141
Then we just need to find the minimum-entropy partition for
the whole data set:
for key in ['level','langs','tweets','phd']:
print key, partition_entropy_by(inputs, key)
# level 0.693536138896
# lang 0.860131712855
# tweets 0.788450457308
# phd 0.892158928262
142
Finally, if we do the same thing for the Junior candidates, we
end up splitting on phd, after which we find that no PhD
always results in True and PhD always results in
False.
Figure below shows the complete decision tree.
143
Random Forests
Given how closely decision trees can fit themselves to their
training data, it’s not surprising that they have a tendency to
overfit. One way of avoiding this is a technique called random
forests, in which we build multiple decision trees and let them
vote on how to classify inputs:
144
# if there's already few enough split candidates,
look at all of them
if len(split_candidates) <= self.num_split_candidates:
sampled_split_candidates = split_candidates
# otherwise pick a random sample
else:
sampled_split_candidates =
random.sample(split_candidates,
self.num_split_candidates)
# now choose the best attribute only from those
candidates
best_attribute = min(sampled_split_candidates,
key=partial(partition_entropy_by, inputs))
partitions = partition_by(inputs, best_attribute)
145
Neural Networks
Perceptrons
146
Example of a function of Perceptron It computes a weighted
sum of its inputs and “fires”if that weighted sum is zero or
greater:
def my_step_function(x):
return 1 if x >= 0 else 0
def my_perceptron_output(weights, bias, x):
"""returns 1 if the perceptron 'fires', 0 if not"""
Backpropagation
147
Usually we don’t build neural networks by hand.
Instead (as usual) we use data to train neural networks. One
popular approach is an algorithm called backpropagation that has
similarities to the gradient descent algorithm we looked at
earlier.
Imagine we have a training set that consists of input vectors
and corresponding target output vectors. Imagine that our
network has some set of weights. We then adjust the weights
using the following algorithm:
148
# the output * (1 - output) is from the derivative of
sigmoid
output_deltas = [output * (1 - output) * (output -
target)
for output, target in zip(outputs, targets)]
# adjust weights for output layer, one neuron at a
time
for i, output_neuron in enumerate(network[-1]):
# focus on the ith output layer neuron
for j, hidden_output in enumerate(hidden_outputs +
[1]):
# adjust the jth weight based on both
# this neuron's delta and its jth input
output_neuron[j] -= output_deltas[i] * hidden_output
# back-propagate errors to hidden layer
hidden_deltas = [hidden_output * (1 - hidden_output)
*
dot(output_deltas, [n[i] for n in output_layer])
for i, hidden_output in enumerate(hidden_outputs)]
# adjust weights for hidden layer, one neuron at a
time
for i, hidden_neuron in enumerate(network[0]):
for j, input in enumerate(input_vector + [1]):
hidden_neuron[j] -= hidden_deltas[i] * input
149
01. import numpy as np
02.
03.# sigmoid function
04.def nonlin(x,deriv=False):
05.if(deriv==True):
06.return x*(1-x)
07.return 1/(1+np.exp(-x))
08.
09.# input dataset is as below
10.X = np.array([ [0,0,1],
11.[0,1,1],
12.[1,0,1],
13.[1,1,1] ])
14.
15.# output dataset is as below
16.y = np.array([[0,0,1,1]]).T
17.
18.# seed random numbers to make calculations
19.# deterministic in nature
20.np.random.seed(1)
21.
150
22.# initializing weights randomly with mean as 0
23.syn0 = 2*np.random.random((3,1)) - 1
24.
25.for iter in xrange(10000):
26.
27.# forward propagation
28.l0 = X
29.l1 = nonlin(np.dot(l0,syn0))
30.
31.# calculating error
32.l1_error = y - l1
33.
34.# multiplying error by the
35.# slope of the sigmoid at the values in l1
36.l1_delta = l1_error * nonlin(l1,True)
37.
38.# update weights
39.syn0 += np.dot(l0.T,l1_delta)
40.
41.print "The Output After Training of the Data is:"
42.print l1
The Output After Training of the Data is:
[[ 0.00966449]
[ 0.00786506]
[ 0.99358898]
[ 0.99211957]]
151
Variable Definition
152
called a "sigmoid". A sigmoid function maps any value to a
value between 0 and 1. We use it to convert numbers to
probabilities. It also has several other desirable properties for
training neural networks.
Line Number 05: Notice that this function can also generate
the derivative of a sigmoid (when deriv=True). One of the
desirable properties of a sigmoid function is that its output can
be used to create its derivative. If the sigmoid's output is a
variable "out", then the derivative is simply out * (1-out). This
is very efficient.
153
derivatives, check out this derivatives tutorial from Khan
Academy.
Line Number 23: This is our weight matrix for this neural
network. It's called "syn0" to imply "synapse zero". Since we
only have 2 layers (input and output), we only need one matrix
of weights to connect them. Its dimension is (3,1) because we
have 3 inputs and 1 output. Another way of looking at it is that
l0 is of size 3 and l1 is of size 1. Thus, we want to connect every
154
node in l0 to every node in l1, which requires a matrix of
dimensionality (3,1).
Line Number 28: Since our first layer, l0, is simply our data.
We explicitly describe it as such at this point. Remember that
X contains 4 training examples (rows). We're going to process
all of them at the same time in this implementation. This is
known as "full batch" training. Thus, we have 4 different l0
rows, but you can think of it as a single training example if you
want. It makes no difference at this point. (We could load in
1000 or 10,000 if we wanted to without changing any of the
code).
Line Number29: This is our prediction step. We first let the
network "try" to predict the output given the input. We will
then study how it performs so that we can adjust it to do a bit
better for each iteration.
155
the number of columns of the second matrix.
Since we loaded in 4 training examples, we ended up with 4
guesses for the correct answer, a (4 x 1) matrix. Each output
corresponds with the network's guess for a given input.
Perhaps it becomes intuitive why we could have "loaded in" an
arbitrary number of training examples. The matrix
multiplication would still work out. :)
Line Number 32: So, given that l1 had a "guess" for each
input. We can now compare how well it did by subtracting the
true answer (y) from the guess (l1). l1_error is just a vector of
positive and negative numbers reflecting how much the
network missed.
Line Number 36: let's further break it into two parts.
1.nonlin(l1,True)
If l1 represents these three dots, the code above generates the
slopes of the lines below. Notice that very high values such as
x=2.0 (green dot) and very low values such as x=-1.0 (purple
dot) have rather shallow slopes. The highest slope you can have
is at x=0 (blue dot). This plays an important role. Also notice
that all derivatives are between 0 and 1.
156
2nd Part: Entire Statement: The Error Weighted
Derivative
157
This means that the network was quite confident one way or
the other. However, if the network guessed something close to
(x=0, y=0.5) then it isn't very confident. We update these
"wishy-washy" predictions most heavily, and we tend to leave
the confident ones alone by multiplying them by a number
close to 0.
Line Number 39: We are now ready to update our network!
Let's take a look at a single training example.
158
For the far left weight, this would multiply 1.0 * the l1_delta.
Presumably, this would increment 9.5 ever so slightly. Why
only a small ammount? Well, the prediction was already very
confident, and the prediction was largely correct. A small error
and a small slope means a VERY small update. Consider all the
weights. It would ever so slightly increase all three.
So, what does line 39 do? It computes the weight updates for
each weight for each training example, sums them, and updates
the weights, all in a simple line.
159
Clustering
The Idea
The Model
160
Step Number 1
Step Number 2
Step Number 3
161
Step Number 4
The dataset we are going to use has 3000 entries with 3 clusters.
So we already know the value of K.
162
We will start by importing the dataset.
%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Import of the dataset
data = pd.read_csv('xclara.csv')
print(data.shape)
data.head()
(3000, 2)
163
# To calculate Euclidean Distance
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print(C)
[[ 11. 26.]
[ 79. 56.]
[ 79. 21.]]
# Plotting on graph along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')
164
# Now storing the values of centroids when they will
be updated
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. –The Distance between the new centroids
and old centroids is error
error = dist(C, C_old, None)
# Run the following loop till error becomes zero
while error != 0:
# Now assign each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# To store old centroid values
C_old = deepcopy(C)
165
# Finding the new centroids by taking the average
value
for i in range(k):
points = [X[j] for j in range(len(X)) if
clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in
range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7,
c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200,
c='#050505')
166
class KMeans:
"""To perform k-means clustering"""
def __init__(self, k):
self.k = k # number of clusters
self.means = None # means of clusters
def classify(self, input):
"""return the index of the cluster closest to the
input"""
return min(range(self.k),
key=lambda i: squared_distance(input, self.means[i]))
def train(self, inputs):
# choose k random points as the initial means
self.means = random.sample(inputs, self.k)
assignments = None
while True:
# To find new assignments
new_assignments = map(self.classify, inputs)
# If no assignments have changed, we can stop
if assignments == new_assignments:
return
# Else we keep the new assignments,
assignments = new_assignments
# And then compute new means based on the new
assignments
for i in range(self.k):
# find all the points which are assigned to cluster i
i_points = [p for p, a in zip(inputs, assignments) if
a == i]
# make sure i_points is not empty so don't divide by
0
167
if i_points:
self.means[i] = vector_mean(i_points)
168
Natural Language Processing
Word Clouds
169
This looks neat but doesn’t really tell us anything. A more
interesting approach might be to scatter them so that
horizontal position indicates posting popularity and vertical
position indicates resume popularity, which produces a
visualization that conveys a few insights
def text_size(total):
"""equals to 8 if total is 0 and equals to 28 if
total is 200"""
return 8 + total / 200 * 20
for word, job_popularity, resume_popularity in data:
plt.text(job_popularity, resume_popularity, word,
ha='center', va='center',
size=text_size(job_popularity + resume_popularity))
plt.xlabel("Popularity on Job Postings")
plt.ylabel("Popularity on Resumes")
plt.axis([0, 100, 0, 100])
plt.xticks([])
plt.yticks([])
plt.show()
170
A more meaningful (if less attractive) word cloud
N-gram Models
171
>>> txt = 'Python is one of the awesomest languages'
Grammars
172
grammar = {
"_S" : ["_NP _VP"],
"_NP" : ["_N",
"_A _NP _P _A _N"],
"_VP" : ["_V",
"_V _NP"],
"_N" : ["data science", "Python", "regression"],
"_A" : ["big", "linear", "logistic"],
"_P" : ["about", "near"],
"_V" : ["learns", "trains", "tests", "is"]
}
173
For example, one such progression might look like:
['_S']
['_NP','_VP']
['_N','_VP']
['Python','_VP']
['Python','_V','_NP']
['Python','trains','_NP']
['Python','trains','_A','_NP','_P','_A','_N']
['Python','trains','logistic','_NP','_P','_A','_N']
['Python','trains','logistic','_N','_P','_A','_N']
['Python','trains','logistic','data
science','_P','_A','_N']
['Python','trains','logistic','data
science','about','_A', '_N']
['Python','trains','logistic','data
science','about','logistic','_N']
['Python','trains','logistic','data
science','about','logistic','Python']
174
If we do find a nonterminal, then we randomly choose one of
its productions. If that production is a terminal (i.e., a word),
we simply replace the token with it. Otherwise it’s a sequence
of space-separated nonterminal tokens that we need to split
and then splice into the current tokens. Either way, we repeat
the process on the new set of tokens.
Putting it all together we get:
175
Try changing the grammar—add more words, add more rules,
add your own parts of speech—until you’re ready to generate
as many web pages as your company needs.
Grammars are actually more interesting when they are used in
the other direction.
Given a sentence we can use a grammar to parse the sentence.
This then allows us to identify subjects and verbs and helps us
make sense of the sentence.
Using data science to generate text is a neat trick; using it to
understand text is more magical. (See “For Further
Investigation” on page 200 for libraries that you could use for
this.)
Topic Modeling
176
Parameters of LDA
177
doc2 = "My father spends a lot of time driving my sister
around to dance practice."
doc3 = "Doctors suggest that driving may cause increased
stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school,
but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your
lifestyle."
# compile documents doc_complete = [doc1, doc2, doc3,
doc4, doc5]
178
every unique term is assigned an index. dictionary =
corpora.Dictionary(doc_clean)
```
# Creating the object for LDA model using gensim
library
Lda = gensim.models.ldamodel.LdaModel
```
Results
```
print(ldamodel.print_topics(num_topics=3,
num_words=3))
['0.168*health + 0.083*sugar + 0.072*bad,
'0.061*consume + 0.050*drive + 0.050*sister,
'0.049*pressur + 0.049*father + 0.049*sister]
```
179
Network Analysis
Betweenness Centrality
# egrep.py
import sys, re
180
G = nx.read_gml('lesmiserables.gml',relabel=True)
def most_important(G):
""" returns a copy of G with
the most important nodes
according to the pagerank """
ranking = nx.betweenness_centrality(G).items()
print ranking
r = [x[1] for x in ranking]
m = sum(r)/len(r) # mean centrality
t = m*3 # threshold, we keep only the nodes with 3
times the mean
Gt = G.copy()
for k, v in ranking:
if v < t:
Gt.remove_node(k)
return Gt
Gt = most_important(G) # trimming
And we can use the original network and the trimmed one to
visualize the network as follows:
181
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.2
,node_size=8)
nx.draw_networkx_edges(G,pos,alpha=0.1)
182
This graph is pretty interesting, indeed it highlights the nodes
which are very influential on the way the information spreads
over the network.
Eigenvector Centrality
183
To start with, we’ll need to represent the connections in our
network as an adjacency_matrix, whose (i,j)th entry is either 1
(if user i and user j are friends) or 0 (if they’re not):
184
Users with high eigenvector centrality should be those who
have a lot of connections and connections to people who
themselves have high centrality.
Here users 1 and 2 are the most central, as they both have three
connections to people who are themselves highly central. As
we move away from them, people’s centralities steadily drop
off.
On a network this small, eigenvector centrality behaves
somewhat erratically. If you try adding or subtracting links,
you’ll find that small changes in the network can dramatically
change the centrality numbers. In a much larger network this
would not particularly be the case.
We still haven’t motivated why an eigenvector might lead to a
reasonable notion of centrality.
In other words, eigenvector centralities are numbers, one per
user, such that each user’s value is a constant multiple of the
sum of his neighbors’ values. In this case centrality means
being connected to people who themselves are central. The
more centrality you are directly connected to, the more central
you are. This is of course a circular definition—eigenvectors
are the way of breaking out of the circularity.
Another way of understanding this is by thinking about what
find_eigenvector is doing here. It starts by assigning each node
a random centrality. It then repeats the following two steps
until the process converges:
1. Give each node a new centrality score that equals the sum
of its neighbors’ (old) centrality scores.
2. Rescale the vector of centralities to have magnitude 1.
185
Directed Graphs and PageRank
In this new model, we’ll track endorsements (source, target)
that no longer represent a reciprocal relationship, but rather
that source endorses target as an awesome data scientist
(Figure 21-5). We’ll need to account for this asymmetry:
endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1,
2),
(2, 1), (1, 3), (2, 3), (3, 4), (5, 4),
(5, 6), (7, 5), (6, 8), (8, 7), (8, 9)]
186
4. At each step, the remainder of each node’s PageRank is
distributed evenly among
all nodes.
187
The DataSciencester network sized by PageRank
188
Recommender Systems
users_interest = [
["Hadoop", "Big Data", "HBase", "Java", "Spark",
"Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase",
"Postgres"],
["Python", "scikit-learn", "scipy", "numpy",
"statsmodels", "pandas"],
["R", "Python", "statistics", "regression",
"probability"],
["machine learning", "regression", "decision trees",
"libsvm"],
["Python", "R", "Java", "C++", "Haskell",
"programming languages"],
["statistics", "probability", "mathematics",
"theory"],
["machine learning", "scikit-learn", "Mahout",
"neural networks"],
189
["neural networks", "deep learning", "Big Data",
"artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence",
"probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL",
"MongoDB"],
["libsvm", "regression", "support vector machines"]
]
Manual Curation
190
popular_interest = Counter(interest
for user_interest in users_interest
for interest in user_interest).most_common()
def most_popular_new_interests(user_interest,
max_results=5):
suggestions = [(interest, frequency)
for interest, frequency in popular_interest
if interest not in user_interest]
return suggestions[:max_results]
191
Then we’d recommend you:
most_popular_new_interests(users_interest[1], 5)
# [('Python', 4), ('R', 4), ('Java', 3),
('regression', 3), ('statistics', 3)]
192
ratings of those users who are like me. But how do we check
how much a user is similar to me?
193
if v is 0 whenever w is not (and vice versa) then dot(v, w) is 0
and so the cosine similarity will be 0.
194
'Haskell',
# ...
]
def make_user_interest_vector(user_interest):
"""given a list of interests, produce a vector whose
ith element is 1
if unique_interests[i] is in the list, 0 otherwise"""
return [1 if interest in user_interest else 0
for interest in unique_interests]
user_similarities =
[[cosine_similarity(interest_vector_i,
interest_vector_j)
for interest_vector_j in user_interest_matrix]
195
for interest_vector_i in user_interest_matrix]
def user_based_suggestions(user_id,
include_current_interests=False):
# sum up the similarities
suggestions = defaultdict(float)
for other_user_id, similarity in
most_similar_users_to(user_id):
for interest in users_interests[other_user_id]:
suggestions[interest] += similarity
# convert them to a sorted list
suggestions = sorted(suggestions.items(),
key=lambda (_, weight): weight,
reverse=True)
# and (maybe) exclude already-interests
if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]
196
If we call user_based_suggestions(0), the first several suggested
interests are:
[('MapReduce', 0.5669467095138409),
('MongoDB', 0.50709255283711),
('Postgres', 0.50709255283711),
('NoSQL', 0.3380617018914066),
('neural networks', 0.1889822365046136),
('deep learning', 0.1889822365046136),
('artificial intelligence', 0.1889822365046136),
#...
]
197
To start with, we will want to transpose our user-interest
matrix so that rows corresponding to interests and columns
correspond to users:
interest_user_matrix = [[user_interest_vector[j]
for user_interest_vector in user_interest_matrix]
for j, _ in enumerate(unique_interests)]
For example, we can find the interests most similar to Big Data
(interest 0) using:
198
def most_similar_interests_to(interest_id):
similarities = interest_similarities[interest_id]
pairs = [(unique_interests[other_interest_id],
similarity)
for other_interest_id, similarity in
enumerate(similarities)
if interest_id != other_interest_id and similarity >
0]
return sorted(pairs,
key=lambda (_, similarity): similarity,
reverse=True)
def item_based_suggestions(user_id,
include_current_interests=False):
# To add up similar interests
suggestions = defaultdict(float)
199
user_interest_vector = user_interest_matrix[user_id]
for interest_id, is_interested in
enumerate(user_interest_vector):
if is_interested == 1:
similar_interests =
most_similar_interests_to(interest_id)
for interest, similarity in similar_interests:
suggestions[interest] += similarity
# Then sorting them by weight
suggestions = sorted(suggestions.items(),
key=lambda (_, similarity): similarity,
reverse=True)
if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]
200
('C++', 0.4082482904638631),
('artificial intelligence', 0.4082482904638631),
('Python', 0.2886751345948129),
('R', 0.2886751345948129)]
201
Databases and SQL
The data you need will often live in databases, systems designed
for efficiently storing and querying data. The bulk of these are
relational databases, such as Oracle, MySQL, and SQL Server,
which store data in tables and are typically, queried using
Structured Query Language (SQL), a declarative language for
manipulating data. You can get a list of available Python
Database Interface API’s at
https://wiki.python.org/moin/DatabaseInterfaces
You can choose the right database for your application. Python
Database API supports a wide range of database servers such
as −
GadFly
mSQL
MySQL
202
PostgreSQL
Microsoft SQL Server 2000
Informix
Interbase
Oracle
Sybase
203
Step 2: Setting up the database
Make sure you have database access, from the command line
type:
mysql -u USERNAME -p
Then we can insert data into the table (these are SQL queries):
INSERT INTO examples(description) VALUES ("Hello ");
INSERT INTO examples(description) VALUES ("MySQL ");
INSERT INTO examples(description) VALUES ("Python
Example");
You can now grab all records from the table using a SQL
query:
204
mysql> SELECT * FROM examples;
+----+---------------+
| id | description |
+----+---------------+
| 1 | Hello |
| 2 | MySQL |
| 3 | Python Example |
+----+---------------+
3 rows in set (0.01 sec)
You can access the database directly from Python using the
MySQLdb module.
#!/usr/bin/python
import MySQLdb
205
cur.execute("SELECT * FROM examples")
Output:
1 Hello
2 MySQL
3 Python Example
Example
Let us create Database table EMPLOYEES −
#!/usr/bin/python
import MySQLdb
206
# Open database connection
db =
MySQLdb.connect("localhost","testuser","test123","TES
TDB" )
cursor.execute(sql)
INSERT
207
Example
The following example, executes SQL INSERT statement to
create a record into EMPLOYEE table −
# Prepare SQL query to INSERT a record into the
database.
sql = """INSERT INTO EMPLOYEES(FIRST_NAME,
LAST_NAME, AGE, SEX, INCOME)
VALUES ('Mac', 'Mohan', 20, 'M', 2000)"""
# Prepare SQL query to INSERT a record into the
database.
sql = "INSERT INTO EMPLOYEES(FIRST_NAME, \
LAST_NAME, AGE, SEX, INCOME) \
VALUES ('%s', '%s', '%d', '%c', '%d' )" % \
('Mac', 'Mohan', 20, 'M', 2000)
Example
Following code segment is another form of execution where
you can pass parameters directly –
..................................
user_id = "test123"
password = "password"
208
UPDATE
DELETE
SELECT
209
cur.execute("SELECT * FROM EMPLOYEES")
GROUP BY
ORDER BY
# Select data from table using SQL query to arrange
employees in ascending order
cur.execute("SELECT FIRST_NAME FROM EMPLOYEES ORDER
BY FIRST_NAME")
JOIN
210
# Select data from table using SQL query to find
manager of an employee
Considerations include having an Employee table with
employee_id with primary key and Manager table with
manager_id as primary key and employee_id as foreign
key
db.query("""SELECT e.employee_id 'Emp_Id',
e.last_name 'Employee',
m.employee_id 'Mgr_Id', m.last_name 'Manager'
FROM employees e join employees m
ON (e.manager_id = m.employee_id)""")
Indexes
211
To find the rows matching a WHERE clause quickly.
To eliminate rows from consideration. If there is a
choice between multiple indexes, MySQL normally
uses the index that finds the smallest number of rows
(the most selective index).
If the table has a multiple-column index, any leftmost
prefix of the index can be used by the optimizer to look
up rows. For example, if you have a three-column
index on (col1, col2, col3), you have indexed search
capabilities on (col1), (col1, col2), and (col1, col2, col3).
To retrieve rows from other tables when performing
joins. MySQL can use indexes on columns more
efficiently if they are declared as the same type and size.
In this context, VARCHAR and CHAR are considered
the same if they are declared as the same size. For
example, VARCHAR(10) and CHAR(10) are the same
size, but VARCHAR(10) and CHAR(15) are not.
212
To find the MIN() or MAX() value for a specific
indexed column key_col. This is optimized by a
preprocessor that checks whether you are using
WHERE key_part_N = constant on all key parts that
occur before key_col in the index. In this case, MySQL
does a single key lookup for each MIN() or MAX()
expression and replaces it with a constant. If all
expressions are replaced with constants, the query
returns at once. For example:
SELECT MIN(key_part2),MAX(key_part2)
Query Optimization
SELECT users.name
FROM users
213
JOIN user_interest
ON users.user_id = user_interests.user_id
WHERE user_interests.interest = 'SQL'
You’ll end up with the same results either way, but filter-
before-join is almost certainly more efficient, since in that case
join has many fewer rows to operate on.
NoSQL
214
There are column databases that store data in columns instead
of rows1, key-value stores that are optimized for retrieving
single (complex) values by their keys, databases for storing and
traversing graphs, databases that are optimized to run across
multiple datacenters, databases that are designed to run in
memory, databases for storing time-series data, and hundreds
more.
MapReduce
1 Good when data has many columns but queries need few of them.
215
"Shuffle" step: Worker nodes redistribute data based on the
output keys (produced by the "map()" function), such that all
data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of
output data, per key, in parallel.
Why MapReduce?
216
• Put the word in key; put the usernames and counts in the
values.
• Put the username and word in the key; put the counts in the
values.
If you think about it a bit more, we definitely want to group by
username, because we want to consider each person’s words
separately. In addition, we don’t want to group by word, since
our reducer will need to see all the words for each person to
find out which is the most popular. This means that the first
option is the right choice:
def words_per_user_mapper(status_update):
user = status_update["username"]
for word in tokenize(status_update["text"]):
yield (user, (word, 1))
def most_popular_word_reducer(user,
words_and_counts):
"""given a sequence of (word, count) pairs,
return the word with the highest total count"""
word_counts = Counter()
for word, count in words_and_counts:
word_counts[word] += count
word, count = word_counts.most_common(1)[0]
yield (user, (word, count))
user_words = map_reduce(status_updates,
words_per_user_mapper,
most_popular_word_reducer)
217
def liker_mapper(status_update):
user = status_update["username"]
for liker in status_update["liked_by"]:
yield (user, liker)
distinct_likers_per_user = map_reduce(status_updates,
liker_mapper,
count_distinct_reducer)
218
Make sure the file has execution permission (chmod +x
/home/hduser/mapper.py)
mapper.py
#!/usr/bin/env python
import sys
#!/usr/bin/env python
current_word = None
current_count = 0
word = None
220
continue
221
hduser@ubuntu:~$ echo "foo foo quux labs foo bar
quux" | /home/hduser/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
222
Go Forth and Do Data Science
IPython
223
whitespace formatting) and run scripts from within the shell.
Mastering IPython will make your life far easier. (Even learning
just a little bit of IPython will make your life a lot easier.)
Additionally, it allows you to create “notebooks” combining
text, live Python code, and visualizations that you can share
with other people, or just keep around as a journal of what you
did.
Mathematics
224
2. Choosing parameter settings and validation strategies.
3. Identifying under fitting and overfitting by
understanding the Bias-Variance tradeoff.
4. Estimating the right confidence interval and
uncertainty.
225
Pandas
Scikit-learn
Visualization
The matplotlib charts we’ve been creating have been clean and
functional but not particularly stylish (and not at all
interactive). If you want to get deeper into data visualization,
you have several options. The first is to further explore
matplotlib, only a handful of whose features we’ve actually
covered. Its website contains many examples of its
functionality and a Gallery of some of the more interesting
226
ones. If you want to create static visualizations (say, for
printing in a book), this is probably your best next step. You
should also check out seaborn, which is a library that (among
other things) makes matplotlib more attractive.
227
Community support: R has the community support of
leading statisticians, data scientists from different parts of
the world and is growing rapidly.
Find Data
228
12.The World Bank Data Sets
13.The reddit /r/datasets
14.Academic torrents
For Streaming Data
15.Twitter
16.GitHub
17.Quantopian
18.Wunderground
229
Some of the ideas which you can go through are
1. Learning to mine on Twitter
This is a simple project for beginners and is useful for
understanding the importance of data science. When doing
this project you will come to know what is trending. It can be
a viral news being discussed or politics or some new movie.
It will also teach you to integrate API in scripts for accessing
any information on social media. It also exposes the
challenges faced in mining social media.
2. Identify your digits data set
This is a type of training your program to recognize different
digits. This problem is known as digit recognition problem.
It is similar to camera detecting faces. It has as many as 7000
images with 28 X 28 size making it 31MB sizing.
3. Loan Prediction data set
The biggest user of data science among industries is
insurance. It puts to use lots of analytics and data science
methods. In this problem we are provided with enough
information to work on data sets of insurance companies, the
challenges to be faced, strategies that are to be used, the
variables that influences the outcome etc. It has around
classification problem with 615 rows and 13 columns.
4. Credit Card Fraud Detection
It is a classification problem where we are supposed to
classify whether the transactions taking place on a card are
legal or illegal. This does not have a huge data set since banks
do not reveal their customer data due to privacy constraints.
5. Titanic dataset from Kaggle:
This dataset provides a good understanding of what a typical
data science project will involve. The starters can work on the
dataset in excel and the professionals can work on advanced
230
tools to extract hidden information and algorithms to
substitute some of the missing values in the dataset.
231
Thank you !
If you enjoyed this book and felt that it added value to your
life, we ask that you please take the time to review our books
in amazon.
Your honest feedback would be greatly appreciated. It
really does make a difference.
If you noticed any problem, please let us know by
sending us an email at [email protected] before
writing any review online. It will be very helpful for us
to improve the quality of our books.
https://www.amazon.com/dp/B07FTPKJMM
232
233