What Is Data Science? Probability Overview Descriptive Statistics
What Is Data Science? Probability Overview Descriptive Statistics
The parameter k provides a way to tradeoff between the 1. Choose a K. Randomly assign a number between 1
largest and the total dimensional difference. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
differences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
• Manhattan (k=1): city block distance, or the sum the kth cluster.
KNN Algorithm (b) Assign each observation to the cluster whose
of the absolute difference between two points
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
• Euclidean (k=2): straight line distance
2. Select k closest points and their labels ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from different random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
• Vernoi Diagrams: partitioning plane into regions
qP
Weighted Minkowski: dk (a, b) = k d Hierarchical Clustering
i=1 wi |ai − bi | , in
k
based on distance to points in a specific subset of Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane to commit to a particular K. Another advantage is that it
this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite different- we draw conclusions based on
Cosine Similarity: cos(a, b) = |a||b| , calculates the • Locality Sensitive Hashing (LSH): abandons the location on the vertical rather than horizontal axis.
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n−1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii
P
point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 21 KL(M ||B),
(b) Assign each observation to the cluster whose
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods differ in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models differ in how fast they fit the
necessary parameters
Prediction Speed: models differ in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair different classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) = when we build the trees, everytime we consider a split, a
P (X)
random sample of the p predictors is chosen as split can-
Where: √
didates, not the full set (typically m ≈ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter λ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi∈classes(t) P (X)