0% found this document useful (0 votes)
10 views

Data Mining NOTES

Data Mining Complete Notes

Uploaded by

ABDUL ARHAM KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Mining NOTES

Data Mining Complete Notes

Uploaded by

ABDUL ARHAM KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

FAST NUCES

Spring Semester 2024

Teacher: Abdurrehman, Eesha, Javeria


Roll No.: 21L-5691, 21L-6269, 21L-7694

6th SEMESTER
Data Mining (DS3002)
Notes don't collapse headings !
DUA FOR US BHAIYON AND BEHNON!!! JBBHJBH REALL

Basic Classifiers
★ Performance metrics:
𝑇𝑃+𝑇𝑁
○ Accuracy = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝐹𝑃+𝐹𝑁
○ Error rate = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃

★ Base classifiers
○ Decision Tree based
○ Rule-based (hand-written)
○ KNN
○ Naive Bayes + Bayesian Belief Networks
○ SVM
○ NN (and its children)
★ Ensemble classifiers – results of 2 or more classifiers are joined
○ Boosting
○ Bagging
○ Random Forests

Decision Trees

★ Split the records based on a value of an attribute


★ For numerical, set a range to define split

★ Algorithms that are used to build decision trees:
○ Hunt’s
○ CART
○ ID3, C4.5 – use info gain using entropy
○ SLIQ
○ SPRINT

Advantages Disadvantages

★ Relatively inexpensive ★ Interacting attributes may be passed over


★ Fast on unknown records ★ Each decision boundary involves a single
★ Robust to noise attributes
★ Easily handles redundant attributes
★ Easily handles irrelevant attributes
★ The predictions remain unaffected by normalization
★ Can approximate any boolean function with sufficient depth

Hunt’s Algorithm

★ Greedy
★ If Dt is a training set with the node t:
○ Case 1: all the records in Dt have the same class yt
■ T belongs to yt
○ Case 2: records in Dt belong to different classes
■ Recursively split the data until you reach t such that the split on that level is
pure–belongs to one class
★ At each split, (a,b) represents the number of records that belong to each class based on
the current test condition

★ If a child node is empty, it is labeled with the majority class of its parent node
★ If all the records of a node have identical values besides the class, the node is declared a
leaf node and labeled with the majority class of training records associated with it
★ Stopping criteria:
○ All the records belong the same class
○ All the records have identical attribute values
○ Early termination to avoid overfitting of the model

Evaluating the goodness of a test condition

★ Greedy: purer class distribution preferred


★ Worst case: equal distribution
★ Using measures of node impurity
★ Info gain is the difference between error before and after split.
★ Info gain is mutual between class and splitting variable
★ In each of the following equations:
○ pi(t) is the (frequency of class i at node t) / n
○ c is the total number of classes
★ Misclassification error
○ 1 - max [pi (t) ]
○ Maximum value of 1-1/c
○ Min value 0 when purest split

○ Not sensitive to changes in the class probability so entropy>misclassification
★ Gini index
𝑐− 1
○ 1 - ∑ pi (t)2
𝑖=0

○ Max is 1-1/c when equal distribution


○ Information Gain – Gini
○ If using brute force to calculate GINI for continuous variables, the overall
complexity of a GINI task is O(N^2)
■ Approach? Sort the continuous values in ascending order, linearly scan
them
■ Choose the split position (i) with the least GINI index value
𝑘
𝑁(𝑣𝑗)
■ Δ = I(parent) - ∑ 𝑁
(𝑣𝑗)
𝑗=1

■ The second half is weighted gini. Multiply the probability of the class with its gini and
sum all for all the child nodes


★ Entropy
𝑐− 1
○ - ∑ pi (t) log2 pi (t)
𝑖=0

○ Max = log2c when equal distribution


○ Info gain / gain split
𝑘1
𝑁𝑖
■ entropy(p) - ∑ 𝑁
entropy(i)
𝑖=0

■ Entropy (p) is the entropy of the parent node


■ Entropy (i) is the entropy of the children
■ Ni is the number of records in child i
■ k is the no. of children
■ Used in ID3, C4.5
★ Problem with info gain is that it prefers splits that result in a larger number of partitions
even if they don’t cause a significant (or any) amount of learning

Gain Ratio

★ Gainsplit / split info


𝑘1
𝑁𝑖 𝑁𝑖
★ Split info = ∑ 𝑁
log2 𝑁
–essentially the weighted entropy of child nodes
𝑖=0
★ Penalizes large number of small partitions
★ Used in C4.5


★ The problem with large number of partitions: Node impurity measures tend to prefer
splits that result in large number of partitions, each being small but pure
○ Eg. instead of two split partitions of [9,2] and [3,7] the algorithm prefers [0,1]
[0,1] [0,2] [0,1] … [0,1]
○ This reduces the actual information gained from this
★ GPT EXAMPLE LINK

Classification errors
★ Training, testing, generalization (expected error over random selection from the same
distribution)
★ Underfitting: large training and test errors
★ Overfitting: small training error but large testing error
★ More training data → similar training and testing error values

Model Selection

★ To estimate generalization error, prevent overfitting


𝑘
★ Gen. Error = Train Error + Ω × 𝑛

○ Ω : trade-off hyper-parameter–relative cost of adding a leaf node. Could also be ɑ


○ Ntrain : total number of training records

Pre-Pruning

★ Stopping before the model becomes a fully-grown tree


○ All instances belong to the same class
○ Identical attribute values
○ Gen. error falls below a threshold

Post-Pruning

★ Grow tree to its entirety


★ Test trims (bottom-up) and replace if improvement in gen. error
★ Class label of leaf node → majority in the sub-tree
Model Evaluation

★ Performance on test set


★ Hold-out
○ Reserve k% data for testing
○ Random subsampling: repeated holdout
★ Cross validation
○ K disjoint subsets
○ K-fold: train on k-1, test on 1
○ Leave-one-out: k=n
★ Repeated cross validation
○ Cross validation multiple times
○ Estimates the variance of gen. Error
★ Stratified cross validation
○ Guarantees the same percentage of class labels in training and test
○ !!! when class imbalance + small sample
★ Nested cross validation for model selection + evaluation

Bayes Classifier
★ Law of total probability is basically sum probability distribution = 1
★ posterior probability = p(y|x)
★ Learn all possible p(y|x)
𝑃(𝑋,𝑌)
★ conditional probability : P(Y|X)= 𝑃(𝑌)

Bayes theorem

𝑃(𝑋|𝑌) 𝑃(𝑌)
★ P(Y|X) = 𝑃(𝑋)

Conditional Independence

★ X, Y conditionally independent given Z if


○ P(X|YZ) = P(X|Z)
○ X and Y are independent if P(X) x P(Y) =P(X,Y)

Naive Bayes

★ Assume independence between attributes


★ For continuous data, assume a normal distribution
★ Normal distribution:

○ personally, i doubt this is important but it was in the slides so here u


go

★ issue : if one probability is 0, the entire equation becomes 0


★ Robust to noise, irrelevant attributes
★ Redundant and correlated attributes violate class conditional assumption–use bayes
belief networks

KNN
★ Rank the feature vectors according to euclidean distance
1
○ d(x,xi)2 = 𝑛
Σi(xi - xij)2 normalized euclidean distance
○ Select k closest to x
★ Increasing k makes the boundary more smooth
★ As n increases, the optimal value of k tends to decrease
★ Weighted distances – to filter out unimportant features
○ d(x , x’ )2 = √[Σi wi (xi - xi’)2 ]
★ Solved example
★ GPT EXAMPLE LINK

SVM
★ Decision boundary will hereby be referred to as hyperplane
★ Margin is the distance between the hyperplane and the closest data points on each side
2
○ Margin = ||𝑤||

★ Try to maximize the value of margin to handle overfitting


★ Points are classified by the function


○ If f(x)=0 then the point is inside the margin which means you did something
stupid
★ For binary classification, use linear SVM
★ For non-binary, use non-linear SVM
★ Linear SVM example (yt)
★ Non-linear SVM example (again, yt)
★ The learning problem is formulated as a convex optimization problem
○ Easypeasy global minima
○ High computational complexity for building the model
★ Robust to noise
★ Can handle irrelevant and redundant attributes quite well
★ User provides kernel and cost functions
★ Missing values are a conundrum

Evaluation Metrics
★ Basic confusion matrix


𝑇𝑃+𝑇𝑁
★ Accuracy = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝑇𝑃
★ Precision = 𝑇𝑃+𝐹𝑃

○ The positively classified are indeed positive


○ Measure of relevant data points
𝑇𝑃
★ Recall/Sensitivity/True Positive Rate = 𝑇𝑃+𝐹𝑁

○ Get all the positives


2𝑟𝑝
★ F1 Measure = 𝑟+𝑝

○ r→recall
○ p→precision
○ The harmonic mean of the two. (closer to the smaller)
𝑇𝑁
★ Specificity/True Negative Rate = 𝑇𝑁+𝐹𝑃
𝐹𝑃
★ False Positive Rate = 𝑇𝑁+𝐹𝑃

○ Also, 1 - specificity
𝐹𝑁
★ False Negative Rate = 𝑇𝑃+𝐹𝑁

Bias and Variance


★ Bias: the amount that a model’s prediction differs from the target value compared to the
training data
○ Assumptions are to basic–the model does not learn important patterns
○ Creates consistent errors
○ Causes underfitting
★ Variance: high variance leads to overfitting

Ensemble methods
★ Combination of predictors
○ similar/different kinds
○ Weighted combinations–upweight the better predictors
★ Stacked ensembles: predictor of predictors
○ Multi-layer perceptron
○ Weighted predictions with learned weights
○ Training on validation data avoids giving high weight to overfit models

Bagging –bootstrap aggregation


★ Train each classifier on a portion of the data
★ Focus new learners on previous errors
★ Combine through model averaging
★ Cross-validation
○ Check for the best performing classifier combination
★ Sample a random subset of m samples
★ Create a training set of m` <m samples
★ Train on the random training set
★ Test with each trained classifier
○ Take the majority result
○ For regression, average the results
★ Missing data points suppresses memorization
★ Doesn’t work for linear models as averaging their function also results in a linear
function
★ Linear + threshold = non-linear
○ Hence perceptrons are good to go
★ Bagging reduces variance and increase bias
★ Decision trees:
○ Lowers complexity
○ Reducing variance and keeps bias low
○ Improves prediction accuracy but worsens interpretability

Random Forests

★ Introduce variance to actually learn:


○ At each step, allow a subset of the features
○ Enforces diversity
○ Low bias
○ Average over these learners

Gradient Boosting
There is gradient descent in the linear regression section below

★ The ensemble version of gradient descent that iteratively adds new models to
minimist cost
★ Learn a sequence of predictors
★ Sum of predictions is increasingly accurate–increasingly complex
★ If J is the cost, adjust the predictions ŷ[i]to reduce the error by
○ ŷ[i] = ŷ[i] + ɑ f[i]
■ f[i]=Δ J (y,ŷ)
★ Start with a simpler model → subsequent models predict the error residuals of the
previous predictions → overall prediction given by a weighted sum of the collection
★ Visualized

Minimizing Weighted error

★ Weighted error: J(θ) = ∑ wi Ji (θ , xi )


𝑖

Linear Regression

★ Prediction: ŷ (x) = θxT


○ xT is the transpose of vector x
○ θ = [θo, θ1,.....,θn]
○ x = [1, x1, ….., xn]

L2: Mean Squared Error

1
★ MSE: J(θ) = 𝑚
∑ ( yj - θ · x(j) T )2
𝑗
★ Computationally convenient
★ Measures the variance of the residuals
★ Corresponds to likelihood under Gaussian model of noise

MSE Cost Function

1
★ J(θ) = 𝑚
(yT - θXT) · (yT - θXT)T

Minimizing cost using gradient descent


★ Choose the direction where
∂𝐽(θ)
○ ∆𝐽(θ) = ∂(θ)
is < 0
2
○ ∆𝐽(θ) = - 𝑚
(yT - θXT) · X is < 0


○ X is the feature matrix

Newton’s Method

Find the root of cost function

★ Initialize by a point z
★ Compute tangent at z and where it crosses z-axis
𝑓(𝑧)
○ z′ = z - ∇𝑓(𝑧)

★ Now find the roots of f(z) which is the cost J(θ)


∇𝐽(θ)
○ θ′ = θ - ∇∇𝐽(θ)
★ Does not always converge
○ f(z) may be 0 which leads to an undefined, geometrically bonkers situation
★ If it is going to converge, it will do so fast
★ Works well for smooth, non pathological functions(does not violate conditions of the
function), locally quadratic
★ High computation for large n:
○ Storage = O(n2)
○ Time = O(n3)

Stochastic Gradient Descent

★ The zig-zag makes it cheaper


★ StatQuest <3
★ Benefits:
○ More data = more updates per pass
○ Computationally fast
★ Drawbacks:
○ Not strictly descending
○ Stopping conditions may be harder to evaluate (max iterations, threshold for
value, threshold for learning)
★ Advanced SGD adds a momentum term and preconditioning
★ MSE Minimum
○ Used when there is no linear function that hits the data exactly
★ Linear regression example:

L1: Mean Absolute Error

★ L1(θ) = ∑ | y - θ·xT|
𝑗

Non-Linear Regression

★ Polynomial regression in low dimensions ↔ Linear regression in high dimensions


★ To decide the number of features, plot MSE against polynomial order


★ Higher complexity leads to more chances of overfitting
★ Indicative bias is introduced for learning: the assumptions about the data encoded in
the learning algorithm.
Leave-one-out cross-validation examples

★ Ex 1: First-order (linear predictor)


○ Use the linear equation ŷ = ax + b
★ Ex 2: Zero-order :


○ ŷ = mean of other the two y

⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂☆

Association Rule Mining

Market Basket Transaction

● Huge amounts of customer purchase data


● Retailers are interested in analyzing the data
○ Understand customer purchasing behavior
● It is represented in binary

Issues In Market Basket Analysis

● Computationally expensive
● Discovered patterns can be false

Association Mining Terminologies

● Itemset
○ A collection of one or more items
○ Example: {Milk, Bread, Diaper}
● Support count (σ)
○ Frequency of occurrence of an itemset
○ E.g. σ({Milk, Bread,Diaper}) = 2
● Support
○ Fraction of transactions that contain an itemset
○ E.g. s({Milk, Bread, Diaper}) = 2/5
● Frequent Itemset
○ An itemset whose support is greater than or equal to a minsup threshold
■ Minsup is given in question

Rule Evaluation Metrics

● Support ( S )
○ Fraction of transactions that contain both X and Y
○ Formula for rule X → Y


■ N= total transactions
■ Frequency(X,Y)= Number of transactions with X and Y
○ Has anti-monotone property
■ for every itemset X that is a proper subset of itemset Y , i.e. X ⊂ Y , we
have f(Y ) ≤ f(X)
● The function is support in this case
● ABC(f(Y) will have lower or equal support to AB(f(X)
○ Support is often used to eliminate uninteresting rules.
● Confidence ( C )
○ Measures how often items in Y appear in transactions that contain X
○ Formula for rule X → Y


■ Frequency(X)= number of transactions with X
■ Frequency(X,Y)= Number of transactions with X and Y
○ Confidence does not have an anti-monotone property


○ Measures the reliability of the inference made by a rule.
■ Higher the confidence level the better

● Lift ( L )
○ Statistical based method
○ Formula for rule X -> Y

Brute Force Method

● (1) List all possible association rules


● (2) Compute the support and confidence for each rule
● (3) Prune rules that fail the minsup and minconf thresholds
● Issues?
○ Computationally expensive
● Total number of rules generated by brute force?

○ 6 IS THE TOTAL NUMBER OF ITEMS AND WE CHOOSE +1 ITEMS EVERY LEVEL


■ First Level


■ Second Level and so on we make itemsets of + 1
■ Here there should be 15 rules as 6C2 is 15

Apriori Principle

● If an itemset is frequent, then all of its subsets must also be frequent


○ DE is Subset of CDE
● if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.

○ ABC is superset of AB

Apriori Algorithm

● Characteristics
○ It is a level wise algorithm
■ from frequent 1-itemsets to the maximum size of frequent itemsets.
○ Employs a generate-and-test strategy for finding frequent itemsets.
■ Compare the itemset support to minsup and prune if it is lower
● Requirements
○ Only make candidates from frequent itemsets
○ No frequent itemset should be left out by the candidate generation process
○ It should not generate the same candidate itemset more than once.
● Candidate Generation
○ Merge two frequent (k-1)-itemsets if their first (k-2) items are identical
■ Merge(ABC, ABD) = ABCD
■ Merge(ABC, ABE) = ABCE
○ We can also do this alternatively
■ Merge(ABC, BCD) = ABCD
■ Merge(ABD, BDE) = ABDE
● Candidate Pruning


■ This is a frequent itemset


■ ADE is not in F3
■ Prune the supersets of ADE
● Apriori principle


● To learn more

Factors Affecting Complexity of Apriori

● Choice of minimum support threshold


○ Low minimum support increase the amount of frequent itemset
● Dimensionality (number of items) of the data set
○ Higher Dimensionality increase the amount of frequent itemset
● Size of database
○ Run time of algorithm increases with number of transactions
● Average transaction width
○ Transaction width increases the max length of frequent itemsets

Support Counting (Hash Tree)

● Requirements
○ Hash function
○ Max leaf size
■ If number of item sets stored in leaf node exceeds the max leaf size split
the node
● Numerical YT video
○ 1
○ 2

Maximal Frequent Itemset

● An itemset is maximal frequent if it is frequent and none of its immediate supersets is


frequent
● all frequent itemsets can be derived from a maximal frequent itemset
● Do not contain the support information of their subsets.

● All maximal frequent itemsets are closed
● Numerical YT video
○ 1

Closed Itemset

● An itemset X is closed if none of its immediate supersets has the same support as the
itemset X.
○ This has support information also

Closed Frequent Itemset

● An itemset is a closed frequent itemset if it is closed and its support is greater than or
equal to minsup.
○ A closed frequent itemset is simply a closed itemset that is also frequent
● Useful in removing redundant rules
● Honestly there is no difference in actually solving so in exam if they say closed/closed
frequent to aik hi baat ha

Frequent Pattern Growth (FP)

● Allows frequent pattern mining without generating candidate sets


● Generate FP TREE
○ Scan DB once, find frequent 1-itemset
○ Sort frequent items in frequency descending order
○ Scan DB again, construct FP-tree


● Construct Conditional Pattern Base
○ Traverse the FP-tree by following the link of each frequent item
○ Accumulate all of transformed prefix paths


■ Another better example
● Construct Frequent Patterns

Advantages and Disadvantages of FP Tree

● Advantages of FP-Growth
○ only 2 passes over data-set
○ “compresses” data-set
○ no candidate generation
○ much faster than Apriori
● Disadvantages of FP-Growth
○ FP-Tree may not fit in memory!!
○ FP-Tree is expensive to build

Partitioning Algorithm

● Partition database and find local frequent patterns


● Consolidate and combine the local frequent patterns
● Pigeon hole principle
○ Any itemset that is frequent in DB must be frequent in at least one of the n
partitions

ECLAT Algorithm

● Ok so in apriori we represent transactions as rows with items as columns


○ See above picture in apriori
● But in ECLAT we represent items as rows with transaction as columns


● Advantages:
○ Efficient for dense datasets with many frequent itemsets.
○ ECLAT avoids generating a large number of candidate itemsets like Apriori.
○ Depth-First Search: The recursive approach allows for effective traversal of the
search space.

Statistical Based Measures

Yt numerical

● Lift


○ P(Y|X) is calculated as
■ Prob of (X,Y) / Prob of Y


● 15 / 20 = 0.75
■ P(Y|X) is also called confidence
○ Lift = 0.75 / 0.9 = 0.833
■ < 1, therefore is negatively associated
● Interest


○ 0.15 / 0.20 * 0.90 = 0.675
■ From above matrix
● PS


○ 0.15 - 0.20 * 0.90 = -0.03
● Coefficient


○ Numerator = 0.15 - 0.20 * 0.90 = -0.03
○ Denominator = 0.20 * (1 - 0.20) * 0.90 * (1- 0.90) = under root (0.014) = 0.118
○ - 0.03 / 0.118 = 0.254
● To learn more

Simpson’s Paradox

● Hidden variables may influence the observed relationship



○ More people buy exercise machines when they buy HDTVs


○ who do not buy high-definition televisions are more likely to buy exercise
machines
○ The reversal in the direction of association is known as Simpson’s paradox
● How to choose min support ?
○ If minsup is too high, we could miss itemsets involving interesting rare items
■ (e.g., {caviar, vodka})
○ If minsup is too low, it is computationally expensive and the number of itemsets
is very large
● Use cross support


○ Given and itemset X with (x1, x2, x3, xd) items

Contingency Table

● Disjoint events



○ Additive rule for disjoints


● Additive rule


● Independence
○ When two events are independent, no information is gained from the knowledge
of the other event.


○ Example rolling dice
■ Chances of getting 5 the first time?
● 1/6
■ Chances of getting 5 the second time
● 1/6
○ Not independent


● Conditional probability


Clustering

● Unsupervised learning
● Types of clustering?
○ Partitional
■ Non overlapping
○ Hierarchical
■ Set of nested clusters organized as a hierarchical tree
● Distinctions bw clusterings
○ Exclusive
■ Points belong to only one cluster only
○ Non-Exclusive
■ Points may belong to multiple clusters
○ Partial clustering
■ Cluster only some of the data

K Means clustering

● Cost function


○ Distance of a point (X) from a centroid (U) must be minimum
● Initialization methods
○ Random
■ Issue: may choose nearby points
○ Distance based
■ Start with one random data point
■ Find the point farthest from the clusters chosen so far
■ Issue: may choose outliers
○ Random + Distance
■ Choose next points “far but randomly”
● Choosing Number of Clusters (K)
○ Cost always decreases with k!
■ Add penalty: Total = Error + Complexity
■ If the cost does not decrease rapidly
● Numericals YT videos
○ 1
○ 2
○ Will add later

Hierarchical Agglomerative Clustering

● Complexity
○ O(m2 log m)
● Cluster distance scoring methods
○ Single linkage
■ min
○ Complete linkage
■ max
○ Average linkage
● Numericals YT videos
○ 1
○ 2

DBSCAN

● Core points
○ They have at least specified number of points (MinPts) in epsilon
○ Counts themselves also
● Border points
○ In the neighborhood of a core point
● Noise points
○ Not a border or core point

● Advantages
○ Can handle clusters of different shapes and sizes
○ Resistant to noise
● When it does not work
○ Varying densities
○ High-dimensional data

Evaluation metrics

● Cluster cohesion
○ How closely related are points within the clusters
■ SSE: Sum of squared errors
● Cluster separation
○ How distinct are the separation
○ SSB:


● Numerical YT videos
○ 1
○ 2
⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂☆

Optimizing Models

★ Different orders of polynomials create various predictor patterns


★ Overfitting can be tackled by ‘holding out’ data
○ Separate data into train & test
○ Learn only training
○ Use test/validation set to quantify the quality of the model
★ Underfitting can be tackled by adding more features – make the data ‘complex’, or by
reducing features using various means

REGULARIZATION

★ Decreasing complexity of a model:


○ Regularization: trades a marginal decrease in training accuracy for an increase in
generalizability
■ Add a certain amount of penalty to the cost function

■ This shrinks the parameters (theta) towards zero


■ Alpha is regularization metric
■ Types of regularization functions

● L1 tends to shrink coefficients to zero whereas L2 tends to


shrink coefficients evenly
● Lasso tends to generate sparser solutions than quadratic
regularizer
○ Sparser meaning less diverse solution

MODEL SELECTION

★ Model selection – which model fits data the best?


○ Hold-out method – train test split. But what is the best t/t split?
■ Cross-validation: split data multiple times and assign the ‘train’ area at
different positions, and ‘test’ area at different positions
● K-fold cross-validation: divide data into K disjoint sets
● Compare the quality of each of the sets
■ Leave-one-out cross validation: train on all data except one section/part
(blinding or masking it)
■ Disadvantages:
● Cross-validation requires more work and computation, and it
doesn’t have a specific predictor
● Estimates performance with m’ < m data
● Needs enough data to actually work
○ Learning curves – plotting performance for ‘enough’ data

Supervised Learning

★ Supervised learning – X (features), Y (target), Ŷ (predictions) and Ө (parameters)


○ Older, historical data available to learn from
★ Linear regression – predicting a continuous value based on older patterns of
independent variables

★ Perceptron classifier – simple neural element

○ Parameters Ө are sometimes regarded as Weighs (W)


○ Additional constant element Bias (1)
○ Decision boundary

★ Separability – a data set is separable by a learner if there is some instance of that learner
that correctly predicts all data points
○ Linearly separable
○ Non-linearly separable

★ Class overlap – data is not separable


○ Can fix by adding features (which can lead to changing the model from, say,
linear to quadratic)
★ Logical gates (AND/XOR) using perceptrons

★ Dimensionality – degree of features


○ The good: separation is easier in higher dimensions
○ The bad: more features leads to overfitting (bias)
★ Perceptron learning algorithm
○ The simple delta rule weight updation we do in AI
★ Surrogate loss functions
○ Normally we use simple MSE in perceptron learning
○ But we can also use loss functions like
■ Squared Error Loss: Commonly used in regression tasks.
■ Cross-Entropy Loss: Used in multi-class classification problems
○ Why?
■ Surrogate loss functions provide a smooth approximation to the target
loss function
■ Some surrogate loss function works better in different tasks
● When multi class classification Cross entropy is better than MSE
★ Misc Topics:
○ Sigmoid activation
○ Maximizing decision boundaries
○ Minimizing loss function
○ Best parameters using gradient descent
○ Coordinate descent, stochastic gradient descent, Adam/AdaGrad/RMSprop

Optimizing a Neural Network

★ Optimizing NN – Avoid Overfitting - High Variance


○ Shallow/narrow networks: using fewer neurons per layer of neural networks
○ Dropouts: drop neurons at a certain percentage at each stage/layer of the neural
network (cannot use on output layer)
○ Regularization: refer above (slight modifications to learning algorithm so the
generalization or predictions are better)
○ Early Stopping: stopping the training process before it starts memorizing the
noise in the dataset (starts overfitting)
★ Optimizing NN – Avoid Underfitting - High Bias
○ increasing layers
○ Increasing number of neurons
○ Increasing the amount of features
★ Convergence of Cost Function
○ Maxima and minima
○ Saddle point

★ Batch size – a hyperparameter; number of samples used in one forward and backward
pass through the network
★ Gradient Descents
○ Batch gradient descent – all the training example is taken into consideration to
take a single step
■ Can be really slow
■ Very smooth
○ Stochastic gradient Descent – Only one training example is taken into
consideration to take a single step
■ This algorithm is faster
■ Not smooth at all
■ Used when dataset is very large
○ Mini-batch Gradient Descent – best of both worlds (A batch of training example
is taken into consideration to take a single step)

★ Vanishing gradient problem – In case of sigmoid and tanh activation functions, if your
weights are large, then the gradient will be very (vanishingly) small
○ Using ReLU/Leaky ReLU solves this problem
★ Exploding gradient problem – opposite
★ Weight initialization – the technique you choose for your neural network can determine
how quickly the network converges or whether it converges at all.
○ Zero initialization - sometimes, using constant values like 0’s and 1’s
■ If we set all the weights to be zero, then all the the neurons of all the
layers performs the same calculation, giving the same output and there
by making the whole deep net useless
● Fails to break symmetry
○ sometimes with values sampled from some distribution (typically a uniform
distribution or normal distribution)
○ Random initialization
■ symmetry-breaking and gives much better accuracy
■ the weights are initialized very close to zero, but randomly
■ can potentially lead to vanishing gradients or exploding gradients.
○ Xavier Initialization - sophisticated scheme
■ If the weights in a network start too small, then the signal shrinks as it
passes through each layer until it’s too tiny to be useful.
■ If the weights in a network start too large, then the signal grows as it
passes through each layer until it’s too massive to be useful.
○ He init initialization – weights are initialized keeping in mind the size of the
previous layer
■ Weights are random but in the rand function the ranges are different
★ Symmetry breaking – all the neurons shouldn’t learn the exact same thing
○ Weights should be small (not too small, medium small) – large weights cause
exploding gradients, especially while using sigmoid activation function.
○ Weights should not be the same – Same weights will prevent neural networks
from learning new features.
○ Weights should have good variance – This will help each neuron to learn new
features.

Multi Layer Perceptron

★ Need to be fully connected in order to work


★ No spatial information
★ Many parameters to be handled

Convolutional Neural Network

★ Understanding images
○ Pixel – smallest unit of an image
○ Greyscale – white to black
○ RGB – combination of red, green and blue
○ An image can have up to 56 channels
○ Images are usually represented as Width x Height x No. of Channels (250, 100, 3)
★ Computer vision – reproducing the capability of human vision
○ Biomedical, facial recognition, self-driving cars, landmark detection, object
detection, etc
★ Spatial/Translation Invariance
○ Object detection regardless of the position of the object in the image
○ Solves the MLP drawback

Regular Neural Networks Convolutional Neural Network

★ Transforms input by putting it through a ★ Layers are organized in height, width and
series of hidden layers depth
★ Every layer is made up of neurons which ★ The neurons in one layer do not connect
is fully connected to the layer before to all the neurons in the other layer
★ Output layer is also fully connected ★ Variable sizes

★ CNN Mathematics: https://youtu.be/Y3jF5kAOLvI?si=VRt0h4r87y2zzBn-


★ If you don’t know padding please degree chhor do
★ What do CNNs learn?
○ Each CNN layer learns filters of increasing complexity.
○ The first layers learn basic feature detection filters: edges, corners, etc
○ The middle layers learn filters that detect parts of objects. For faces, they might
learn to respond to eyes, noses, etc
○ The last layers have higher representations: they learn to recognize full objects,
in different shapes and positions.
★ Limited data hinders learning in deep learning
○ Transfer learning – using pre-trained models on your data
○ Data augmentation – creating new data based on existing data
○ Batch normalization

Neural Networks

★ Perceptrons
○ Features [1,x]
■ Decision rule: T(ax+b) = ax + b >/< 0
■ Boundary ax+b =0 => point
○ Features [1,x,x^2]
■ Decision rule T(ax2+bx+c)
■ Boundary ax2+bx+c = 0 = ?
★ Combining step functions
★ MLP Model

○ Each element in an MLP is a perceptron


○ Uses stacked activations
○ Has flexible function approximation
★ Neural Networks
○ Based on the biological neuron model
○ Activation functions
○ Feed forward neural networks – information flows from left-to-right
■ Used in regression
■ Classification
● Binary
● Multiclass
● MSE with saturating activation
○ Back propagation – training MLPs, optimization of parameters

■ Gradient descent/chain rule


★ Advanced Neural Networks
○ CNNs
○ RNNs
○ LSTMs
○ GNNs

Class Imbalance

★ Skewed datasets (more records for one class than the other(s))
○ Accuracy fails to be a well-suited evaluation criterion

○ We use precision, recall and f-score for this

FP Rate is also called Type 1 error

FN Rate is also called Type 2 error


★ ROC Curve
○ TPR against FPR


● Also called recall


■ Trade off between detection and false alarms (eg. patients with suspected
covid vs. patients who just have regular flu)
■ To draw ROC curve, classifier must produce continuous-valued output
● Tests likeliness of positive class
■ Many classifiers produce discrete outputs
■ Use decision trees, rule-based classifiers, neural networks, bayesian
classification, knn, svm
★ Class imbalance can be (coding) treated with
○ Undersampling
○ Oversampling
○ SMOTE
○ SMOTEN
○ BorderlineSMOTE

RNN & LSTM

★ Feed forward neural networks does not account for sequences and context

RNN (RECURRENT NEURAL NETWORK)

★ RNN allows information to cycle through a loop within parameters


○ Allows reusing parameters
○ recurrent neural networks produce predictive results in sequential data
○ Memory keeping of the initial input
★ RNN has 2 inputs
○ The present data
○ Recent past
★ Types of RNN
○ One-to-one
■ Vanilla Neural Network
○ One-to-many
■ Captioning images
○ Many-to-one
■ Sentiment analysis
○ Many-to-many
■ Machine translation
★ Issues with a standard RNN – vanishing/exploding gradients
★ Diagram

LSTM (LONG SHORT TERM MEMORY)

★ Modified version of RNNs


★ RNN finds it difficult for the network to learn long-term dependencies
○ LSTM solves it by ‘learning’ only the important inputs and ‘forgetting’ the rest
★ LSTM working:
○ Gates
■ Input
■ Output
■ Forget
○ Step 1: Receive input and the previous hidden state
○ Step 2: Forget from forget gate
■ Forget gate used previous hidden state to generate a value between 0
(forget) and 1 (retain)
○ Step 3: Input gate allows new information to be added
○ Step 4: Update the older cell state using the forget gate we calculated
○ Step 5: Get the output (sigmoid & tanh)
★ Diagram

OTHER NLPs

★ GRU also exists


○ Solves vanishing gradient problem
★ So does BiLSTM
○ Process the input in both forward and backward fashion and combine the output
★ And transformers
○ Encoder transform the input in N dimensions and feed it to the decoder which
outputs a sequence
○ Attention: looks at a sequence and decides which parts are important
★ And GAN
○ Generator: generate real looking images
○ Discriminator: identify which one is a fake.
★ Deep Fakes
○ Creating convincing image, audio and video hoaxes.
★ Autoencoders
○ Learns how to efficiently compress and encode data
○ Learns how to reconstruct the compressed data also

GRAPH THEORY

★ Networks – multiple things that are connected


○ A system/collection of interconnected entities
○ In order to understand complex systems, one must understand the networks
behind them
★ Graphs – a set of vertices or nodes that are connected by a set of edges, or the
mathematical representation of a network
○ G(V,E) → E is (u,v) ∈ V

CONNECTIVITY DEFINITIONS

★ Multi-graph – A graph with self-loops/multi-edges


★ Simple graph – a graph with no self-loops/multi-edges
★ Directed graphs – (u,v) ∈ V is distinct from (v,u) ∈ V
○ Directed E are called ‘arcs’
○ If both {(u, v),(v, u)} ⊆ E, the arcs are said to be mutual
★ Subgraph/induced graphs – given G(V,E), G’(V’,E’) is an induced subgraph if V’ ⊆ V and
E’ ⊆ E
★ Weighted graph – edges with numerical values/magnitude/measurements
★ Adjacency – representation
○ Vertices u, v ∈ V are said adjacent if joined by an edge in E
○ Edges e1, e2 ∈ E are adjacent if they share an endpoint in V
★ Degree – the count of edges in a graph
○ Degree sum == 2 |Count of Edges|
○ For directed graphs we have degree in and degree out


■ Blue is degree out
■ Red is degree in
★ Strongly connected – every pair (u,v) is reachable from v and vice versa

MOVEMENT IN A GRAPH

★ Walk – length l from v0 to vl in alternating sequence


★ Trail – walk without repeated edges
○ Closed trail is a circuit
★ Path – walk without repeated nodes
○ Closed path (v0 repeated) is a cycle
★ Graph is connected if every vertex is reachable from every other node
★ A component is maximal if adding a vertex ruins connectivity

★ Giant connected components have large amount of nodes, vertices, and connections
★ Complete graph has all possible edges in the order of N
○ n(n-1)/2 number of pairs

★ A d-regular graph has vertices with equal degree


○ Naturally, the complete graph Kn is (n − 1)-regular
■ Cycles are 2-regular (sub) graphs
★ Tree – connected acyclic graph
○ Root is the only vertex connected to all other nodes

★ Bipartite graph – when V can be partitioned into two disjoint sets and each edge in E has
one endpoint in partition one, and one in partition 2
★ Planar graphs – no overlapping/crossing edges

GRAPH ALGEBRA

★ Adjacency matrix A ∈ {0,1} of size No. V x No. V, such that


○ Au is undirected
○ Ad is directed
★ Properties
○ Row wise sum give vertex degrees

★ Degree of a graph – row-wise sum


○ For digraphs, A is not symmetric so row/column sum differ
■ Here, incoming degree is column sum, and outgoing degree is row sum
○ Stored in diagonals
○ Directed graphs have different matrix for degree in and degree out

★ Other notations

○ Walk

○ Corollary

○ Spectrum
★ Incidence matrix, B, is not a square matrix
○ unless Nv = Ne
○ Edges are the columns and vertex are the rows
○ For undirected graphs

○ For digraps
4 vertex and 5 edges so matrix is 4 X 5

★ Laplacian graph L:= D - A


○ Where D is the degree matrix
○ And A is the Adjacency matrix

★ Properties of Laplacian Matrix


○ A laplacian matrix is smooth

○ It is positive semi-definiteness

○ Rank deficient
○ Spectrum and connectivity
■ The smallest eigenvalue of L is zero
● If the second smallest eigenvalue is not zero, then G is connected
● If L has n zero eigenvalues, G has n connected components
GRAPH IN DATA STRUCTURES & ALGORITHMS

★ Store graph in a computer using adjacency matrix/list


○ DSA uses graphs for efficient storage and manipulation
○ Algo uses it to scale computational methods
★ Most real world networks are sparse,

★ Graph density, rho


★ Edge list – stores the vertex pairs of each edge

★ Memory requirement of adjacency list is O(Ne)


★ Visited nodes are ‘marked’ while graph exploration
★ BFS vs DFS vs Dijkstra
○ BFS & DFS: https://youtu.be/pcKY4hjDrxk?si=FqbUhNlluYqwue31
○ Dijkstra: https://youtu.be/XB4MIexjvY0?si=_C7vntH54D0avtBG
★ Distances in a graph – geodesic distance (length of the shortest u-v path)

Degree Distribution

★ Degree sequence is not unique for graphs

★ The fraction of vertices with degree d is P(d) –


★ The degree distribution of G is the histogram formed from the degree sequence (bins of
size one)

Graphical Neural Networks

★ Network ML tasks
○ Types:
■ Node classification
● Predict the type of a given node
● Eg: classifying the function of proteins
■ Link prediction
● Recommendation systems
● Eg: pinterest posts
■ Community detection
■ Network similarity
○ Lifecycle:
■ Doesn’t include feature engineering


■ Efficient task-independent feature for machine learning in networks
○ Why is it difficult?
■ Modern deep learning relies on sequences or grids
■ Networks are far more complex, i.e., no fixed ordering or reference point
● Often dynamic and multimodal features
★ Node embeddings – mapping nodes to low-dimensional embeddings
○ Goal is to encode nodes so that similarity in the embedding space (e.g., dot
product) approximates similarity in the original network.

○ Steps:
■ Define an encoder
■ Define a node similarity function
■ Optimize the parameters such that
similarity(u,v) ≈ zvT zu

○ Encoder maps each node to a low dim vector


○ Similarity function specifies how relationships in vector space map to those in
original network
★ Shallow encoding – embedding lookup
○ Each node is assigned to a unique embedding vector
○ Nodes have similar embeddings if they
■ Have adjacency based similarity
● Dot products between node embeddings approximate edge
existence
○ Use stochastic gradient descent
○ Solve matrix decomposition solvers
● Drawbacks – O(|V|^2) runtime, O(|V|) parameters, only considers
direct connections
■ Multi-hop similarity
● Train embeddings to predict k-hop neighbors

● Measure overlap between node neighborhoods

○ Jaccard similarity
○ Adamic-adar score

○ S(u,v) is the neighborhood overlap between u and v
■ Random walk approaches
● Probability that u and v co-occur on a random walk over a network
≈ zvT zu
○ Estimate the probability of visiting a node form u using
some random walk strategy
○ Optimize the embeddings to encode these random walk
statistics
● Random walks are expressive – flexible stochastic definition of
node similarity that incorporated local and higher order
neighborhood information
● Efficient – do not need to consider all node pairs while training,
only the co occurring one


○ Doing this is too expensive
■ O(|V|^2)
○ Solve???
■ Negative sampling
● instead of normalizing w.r.t. all nodes, just
normalize against k random “negative
samples”
● Node2vec biased walks
● Interpolate DFS and BFS
● Biased random walks
■ No method is the best but Random walk approaches are generally more
efficient (i.e., O(|E|) vs. O(|V|^2))
★ Multilayer networks

You might also like