Data Mining NOTES
Data Mining NOTES
6th SEMESTER
Data Mining (DS3002)
Notes don't collapse headings !
DUA FOR US BHAIYON AND BEHNON!!! JBBHJBH REALL
Basic Classifiers
★ Performance metrics:
𝑇𝑃+𝑇𝑁
○ Accuracy = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝐹𝑃+𝐹𝑁
○ Error rate = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃
★ Base classifiers
○ Decision Tree based
○ Rule-based (hand-written)
○ KNN
○ Naive Bayes + Bayesian Belief Networks
○ SVM
○ NN (and its children)
★ Ensemble classifiers – results of 2 or more classifiers are joined
○ Boosting
○ Bagging
○ Random Forests
Decision Trees
Advantages Disadvantages
Hunt’s Algorithm
★ Greedy
★ If Dt is a training set with the node t:
○ Case 1: all the records in Dt have the same class yt
■ T belongs to yt
○ Case 2: records in Dt belong to different classes
■ Recursively split the data until you reach t such that the split on that level is
pure–belongs to one class
★ At each split, (a,b) represents the number of records that belong to each class based on
the current test condition
★
★ If a child node is empty, it is labeled with the majority class of its parent node
★ If all the records of a node have identical values besides the class, the node is declared a
leaf node and labeled with the majority class of training records associated with it
★ Stopping criteria:
○ All the records belong the same class
○ All the records have identical attribute values
○ Early termination to avoid overfitting of the model
○
○ Information Gain – Gini
○ If using brute force to calculate GINI for continuous variables, the overall
complexity of a GINI task is O(N^2)
■ Approach? Sort the continuous values in ascending order, linearly scan
them
■ Choose the split position (i) with the least GINI index value
𝑘
𝑁(𝑣𝑗)
■ Δ = I(parent) - ∑ 𝑁
(𝑣𝑗)
𝑗=1
■ The second half is weighted gini. Multiply the probability of the class with its gini and
sum all for all the child nodes
■
★ Entropy
𝑐− 1
○ - ∑ pi (t) log2 pi (t)
𝑖=0
○
○ Info gain / gain split
𝑘1
𝑁𝑖
■ entropy(p) - ∑ 𝑁
entropy(i)
𝑖=0
Gain Ratio
★
★ The problem with large number of partitions: Node impurity measures tend to prefer
splits that result in large number of partitions, each being small but pure
○ Eg. instead of two split partitions of [9,2] and [3,7] the algorithm prefers [0,1]
[0,1] [0,2] [0,1] … [0,1]
○ This reduces the actual information gained from this
★ GPT EXAMPLE LINK
Classification errors
★ Training, testing, generalization (expected error over random selection from the same
distribution)
★ Underfitting: large training and test errors
★ Overfitting: small training error but large testing error
★ More training data → similar training and testing error values
Model Selection
Pre-Pruning
Post-Pruning
Bayes Classifier
★ Law of total probability is basically sum probability distribution = 1
★ posterior probability = p(y|x)
★ Learn all possible p(y|x)
𝑃(𝑋,𝑌)
★ conditional probability : P(Y|X)= 𝑃(𝑌)
Bayes theorem
𝑃(𝑋|𝑌) 𝑃(𝑌)
★ P(Y|X) = 𝑃(𝑋)
Conditional Independence
Naive Bayes
KNN
★ Rank the feature vectors according to euclidean distance
1
○ d(x,xi)2 = 𝑛
Σi(xi - xij)2 normalized euclidean distance
○ Select k closest to x
★ Increasing k makes the boundary more smooth
★ As n increases, the optimal value of k tends to decrease
★ Weighted distances – to filter out unimportant features
○ d(x , x’ )2 = √[Σi wi (xi - xi’)2 ]
★ Solved example
★ GPT EXAMPLE LINK
SVM
★ Decision boundary will hereby be referred to as hyperplane
★ Margin is the distance between the hyperplane and the closest data points on each side
2
○ Margin = ||𝑤||
○
○ If f(x)=0 then the point is inside the margin which means you did something
stupid
★ For binary classification, use linear SVM
★ For non-binary, use non-linear SVM
★ Linear SVM example (yt)
★ Non-linear SVM example (again, yt)
★ The learning problem is formulated as a convex optimization problem
○ Easypeasy global minima
○ High computational complexity for building the model
★ Robust to noise
★ Can handle irrelevant and redundant attributes quite well
★ User provides kernel and cost functions
★ Missing values are a conundrum
Evaluation Metrics
★ Basic confusion matrix
○
𝑇𝑃+𝑇𝑁
★ Accuracy = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝑇𝑃
★ Precision = 𝑇𝑃+𝐹𝑃
○ r→recall
○ p→precision
○ The harmonic mean of the two. (closer to the smaller)
𝑇𝑁
★ Specificity/True Negative Rate = 𝑇𝑁+𝐹𝑃
𝐹𝑃
★ False Positive Rate = 𝑇𝑁+𝐹𝑃
○ Also, 1 - specificity
𝐹𝑁
★ False Negative Rate = 𝑇𝑃+𝐹𝑁
Ensemble methods
★ Combination of predictors
○ similar/different kinds
○ Weighted combinations–upweight the better predictors
★ Stacked ensembles: predictor of predictors
○ Multi-layer perceptron
○ Weighted predictions with learned weights
○ Training on validation data avoids giving high weight to overfit models
Random Forests
Gradient Boosting
There is gradient descent in the linear regression section below
★ The ensemble version of gradient descent that iteratively adds new models to
minimist cost
★ Learn a sequence of predictors
★ Sum of predictions is increasingly accurate–increasingly complex
★ If J is the cost, adjust the predictions ŷ[i]to reduce the error by
○ ŷ[i] = ŷ[i] + ɑ f[i]
■ f[i]=Δ J (y,ŷ)
★ Start with a simpler model → subsequent models predict the error residuals of the
previous predictions → overall prediction given by a weighted sum of the collection
★ Visualized
Linear Regression
1
★ MSE: J(θ) = 𝑚
∑ ( yj - θ · x(j) T )2
𝑗
★ Computationally convenient
★ Measures the variance of the residuals
★ Corresponds to likelihood under Gaussian model of noise
1
★ J(θ) = 𝑚
(yT - θXT) · (yT - θXT)T
★
★ Choose the direction where
∂𝐽(θ)
○ ∆𝐽(θ) = ∂(θ)
is < 0
2
○ ∆𝐽(θ) = - 𝑚
(yT - θXT) · X is < 0
○
○ X is the feature matrix
Newton’s Method
★ Initialize by a point z
★ Compute tangent at z and where it crosses z-axis
𝑓(𝑧)
○ z′ = z - ∇𝑓(𝑧)
★ L1(θ) = ∑ | y - θ·xT|
𝑗
Non-Linear Regression
★
★ Higher complexity leads to more chances of overfitting
★ Indicative bias is introduced for learning: the assumptions about the data encoded in
the learning algorithm.
Leave-one-out cross-validation examples
○
○ Use the linear equation ŷ = ax + b
★ Ex 2: Zero-order :
○
○ ŷ = mean of other the two y
● Computationally expensive
● Discovered patterns can be false
● Itemset
○ A collection of one or more items
○ Example: {Milk, Bread, Diaper}
● Support count (σ)
○ Frequency of occurrence of an itemset
○ E.g. σ({Milk, Bread,Diaper}) = 2
● Support
○ Fraction of transactions that contain an itemset
○ E.g. s({Milk, Bread, Diaper}) = 2/5
● Frequent Itemset
○ An itemset whose support is greater than or equal to a minsup threshold
■ Minsup is given in question
● Support ( S )
○ Fraction of transactions that contain both X and Y
○ Formula for rule X → Y
○
■ N= total transactions
■ Frequency(X,Y)= Number of transactions with X and Y
○ Has anti-monotone property
■ for every itemset X that is a proper subset of itemset Y , i.e. X ⊂ Y , we
have f(Y ) ≤ f(X)
● The function is support in this case
● ABC(f(Y) will have lower or equal support to AB(f(X)
○ Support is often used to eliminate uninteresting rules.
● Confidence ( C )
○ Measures how often items in Y appear in transactions that contain X
○ Formula for rule X → Y
○
■ Frequency(X)= number of transactions with X
■ Frequency(X,Y)= Number of transactions with X and Y
○ Confidence does not have an anti-monotone property
○
○ Measures the reliability of the inference made by a rule.
■ Higher the confidence level the better
○
● Lift ( L )
○ Statistical based method
○ Formula for rule X -> Y
○
■ First Level
○
■ Second Level and so on we make itemsets of + 1
■ Here there should be 15 rules as 6C2 is 15
Apriori Principle
○
○ DE is Subset of CDE
● if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.
○
○ ABC is superset of AB
Apriori Algorithm
● Characteristics
○ It is a level wise algorithm
■ from frequent 1-itemsets to the maximum size of frequent itemsets.
○ Employs a generate-and-test strategy for finding frequent itemsets.
■ Compare the itemset support to minsup and prune if it is lower
● Requirements
○ Only make candidates from frequent itemsets
○ No frequent itemset should be left out by the candidate generation process
○ It should not generate the same candidate itemset more than once.
● Candidate Generation
○ Merge two frequent (k-1)-itemsets if their first (k-2) items are identical
■ Merge(ABC, ABD) = ABCD
■ Merge(ABC, ABE) = ABCE
○ We can also do this alternatively
■ Merge(ABC, BCD) = ABCD
■ Merge(ABD, BDE) = ABDE
● Candidate Pruning
○
■ This is a frequent itemset
○
○
■ ADE is not in F3
■ Prune the supersets of ADE
● Apriori principle
○
● To learn more
● Requirements
○ Hash function
○ Max leaf size
■ If number of item sets stored in leaf node exceeds the max leaf size split
the node
● Numerical YT video
○ 1
○ 2
Closed Itemset
● An itemset X is closed if none of its immediate supersets has the same support as the
itemset X.
○ This has support information also
● An itemset is a closed frequent itemset if it is closed and its support is greater than or
equal to minsup.
○ A closed frequent itemset is simply a closed itemset that is also frequent
● Useful in removing redundant rules
● Honestly there is no difference in actually solving so in exam if they say closed/closed
frequent to aik hi baat ha
○
● Construct Conditional Pattern Base
○ Traverse the FP-tree by following the link of each frequent item
○ Accumulate all of transformed prefix paths
○
○
■ Another better example
● Construct Frequent Patterns
● Advantages of FP-Growth
○ only 2 passes over data-set
○ “compresses” data-set
○ no candidate generation
○ much faster than Apriori
● Disadvantages of FP-Growth
○ FP-Tree may not fit in memory!!
○ FP-Tree is expensive to build
Partitioning Algorithm
ECLAT Algorithm
○
● Advantages:
○ Efficient for dense datasets with many frequent itemsets.
○ ECLAT avoids generating a large number of candidate itemsets like Apriori.
○ Depth-First Search: The recursive approach allows for effective traversal of the
search space.
Yt numerical
● Lift
○
○ P(Y|X) is calculated as
■ Prob of (X,Y) / Prob of Y
■
● 15 / 20 = 0.75
■ P(Y|X) is also called confidence
○ Lift = 0.75 / 0.9 = 0.833
■ < 1, therefore is negatively associated
● Interest
○
○ 0.15 / 0.20 * 0.90 = 0.675
■ From above matrix
● PS
○
○ 0.15 - 0.20 * 0.90 = -0.03
● Coefficient
○
○ Numerator = 0.15 - 0.20 * 0.90 = -0.03
○ Denominator = 0.20 * (1 - 0.20) * 0.90 * (1- 0.90) = under root (0.014) = 0.118
○ - 0.03 / 0.118 = 0.254
● To learn more
Simpson’s Paradox
●
○ who do not buy high-definition televisions are more likely to buy exercise
machines
○ The reversal in the direction of association is known as Simpson’s paradox
● How to choose min support ?
○ If minsup is too high, we could miss itemsets involving interesting rare items
■ (e.g., {caviar, vodka})
○ If minsup is too low, it is computationally expensive and the number of itemsets
is very large
● Use cross support
○
○ Given and itemset X with (x1, x2, x3, xd) items
Contingency Table
● Disjoint events
○
○
○ Additive rule for disjoints
■
● Additive rule
○
● Independence
○ When two events are independent, no information is gained from the knowledge
of the other event.
○
○ Example rolling dice
■ Chances of getting 5 the first time?
● 1/6
■ Chances of getting 5 the second time
● 1/6
○ Not independent
■
● Conditional probability
○
○
Clustering
● Unsupervised learning
● Types of clustering?
○ Partitional
■ Non overlapping
○ Hierarchical
■ Set of nested clusters organized as a hierarchical tree
● Distinctions bw clusterings
○ Exclusive
■ Points belong to only one cluster only
○ Non-Exclusive
■ Points may belong to multiple clusters
○ Partial clustering
■ Cluster only some of the data
K Means clustering
● Cost function
○
○ Distance of a point (X) from a centroid (U) must be minimum
● Initialization methods
○ Random
■ Issue: may choose nearby points
○ Distance based
■ Start with one random data point
■ Find the point farthest from the clusters chosen so far
■ Issue: may choose outliers
○ Random + Distance
■ Choose next points “far but randomly”
● Choosing Number of Clusters (K)
○ Cost always decreases with k!
■ Add penalty: Total = Error + Complexity
■ If the cost does not decrease rapidly
● Numericals YT videos
○ 1
○ 2
○ Will add later
● Complexity
○ O(m2 log m)
● Cluster distance scoring methods
○ Single linkage
■ min
○ Complete linkage
■ max
○ Average linkage
● Numericals YT videos
○ 1
○ 2
DBSCAN
● Core points
○ They have at least specified number of points (MinPts) in epsilon
○ Counts themselves also
● Border points
○ In the neighborhood of a core point
● Noise points
○ Not a border or core point
●
● Advantages
○ Can handle clusters of different shapes and sizes
○ Resistant to noise
● When it does not work
○ Varying densities
○ High-dimensional data
Evaluation metrics
● Cluster cohesion
○ How closely related are points within the clusters
■ SSE: Sum of squared errors
● Cluster separation
○ How distinct are the separation
○ SSB:
●
● Numerical YT videos
○ 1
○ 2
⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂⠁⠁⠂⠄⠄⠂ ⠂⠄⠄⠂☆
Optimizing Models
REGULARIZATION
MODEL SELECTION
Supervised Learning
★ Separability – a data set is separable by a learner if there is some instance of that learner
that correctly predicts all data points
○ Linearly separable
○ Non-linearly separable
★ Batch size – a hyperparameter; number of samples used in one forward and backward
pass through the network
★ Gradient Descents
○ Batch gradient descent – all the training example is taken into consideration to
take a single step
■ Can be really slow
■ Very smooth
○ Stochastic gradient Descent – Only one training example is taken into
consideration to take a single step
■ This algorithm is faster
■ Not smooth at all
■ Used when dataset is very large
○ Mini-batch Gradient Descent – best of both worlds (A batch of training example
is taken into consideration to take a single step)
★ Vanishing gradient problem – In case of sigmoid and tanh activation functions, if your
weights are large, then the gradient will be very (vanishingly) small
○ Using ReLU/Leaky ReLU solves this problem
★ Exploding gradient problem – opposite
★ Weight initialization – the technique you choose for your neural network can determine
how quickly the network converges or whether it converges at all.
○ Zero initialization - sometimes, using constant values like 0’s and 1’s
■ If we set all the weights to be zero, then all the the neurons of all the
layers performs the same calculation, giving the same output and there
by making the whole deep net useless
● Fails to break symmetry
○ sometimes with values sampled from some distribution (typically a uniform
distribution or normal distribution)
○ Random initialization
■ symmetry-breaking and gives much better accuracy
■ the weights are initialized very close to zero, but randomly
■ can potentially lead to vanishing gradients or exploding gradients.
○ Xavier Initialization - sophisticated scheme
■ If the weights in a network start too small, then the signal shrinks as it
passes through each layer until it’s too tiny to be useful.
■ If the weights in a network start too large, then the signal grows as it
passes through each layer until it’s too massive to be useful.
○ He init initialization – weights are initialized keeping in mind the size of the
previous layer
■ Weights are random but in the rand function the ranges are different
★ Symmetry breaking – all the neurons shouldn’t learn the exact same thing
○ Weights should be small (not too small, medium small) – large weights cause
exploding gradients, especially while using sigmoid activation function.
○ Weights should not be the same – Same weights will prevent neural networks
from learning new features.
○ Weights should have good variance – This will help each neuron to learn new
features.
★ Understanding images
○ Pixel – smallest unit of an image
○ Greyscale – white to black
○ RGB – combination of red, green and blue
○ An image can have up to 56 channels
○ Images are usually represented as Width x Height x No. of Channels (250, 100, 3)
★ Computer vision – reproducing the capability of human vision
○ Biomedical, facial recognition, self-driving cars, landmark detection, object
detection, etc
★ Spatial/Translation Invariance
○ Object detection regardless of the position of the object in the image
○ Solves the MLP drawback
★
★ Transforms input by putting it through a ★ Layers are organized in height, width and
series of hidden layers depth
★ Every layer is made up of neurons which ★ The neurons in one layer do not connect
is fully connected to the layer before to all the neurons in the other layer
★ Output layer is also fully connected ★ Variable sizes
Neural Networks
★ Perceptrons
○ Features [1,x]
■ Decision rule: T(ax+b) = ax + b >/< 0
■ Boundary ax+b =0 => point
○ Features [1,x,x^2]
■ Decision rule T(ax2+bx+c)
■ Boundary ax2+bx+c = 0 = ?
★ Combining step functions
★ MLP Model
Class Imbalance
★ Skewed datasets (more records for one class than the other(s))
○ Accuracy fails to be a well-suited evaluation criterion
■
● Also called recall
■
■ Trade off between detection and false alarms (eg. patients with suspected
covid vs. patients who just have regular flu)
■ To draw ROC curve, classifier must produce continuous-valued output
● Tests likeliness of positive class
■ Many classifiers produce discrete outputs
■ Use decision trees, rule-based classifiers, neural networks, bayesian
classification, knn, svm
★ Class imbalance can be (coding) treated with
○ Undersampling
○ Oversampling
○ SMOTE
○ SMOTEN
○ BorderlineSMOTE
★ Feed forward neural networks does not account for sequences and context
OTHER NLPs
GRAPH THEORY
CONNECTIVITY DEFINITIONS
○
★ Simple graph – a graph with no self-loops/multi-edges
★ Directed graphs – (u,v) ∈ V is distinct from (v,u) ∈ V
○ Directed E are called ‘arcs’
○ If both {(u, v),(v, u)} ⊆ E, the arcs are said to be mutual
★ Subgraph/induced graphs – given G(V,E), G’(V’,E’) is an induced subgraph if V’ ⊆ V and
E’ ⊆ E
★ Weighted graph – edges with numerical values/magnitude/measurements
★ Adjacency – representation
○ Vertices u, v ∈ V are said adjacent if joined by an edge in E
○ Edges e1, e2 ∈ E are adjacent if they share an endpoint in V
★ Degree – the count of edges in a graph
○ Degree sum == 2 |Count of Edges|
○ For directed graphs we have degree in and degree out
■
■ Blue is degree out
■ Red is degree in
★ Strongly connected – every pair (u,v) is reachable from v and vice versa
MOVEMENT IN A GRAPH
★ Giant connected components have large amount of nodes, vertices, and connections
★ Complete graph has all possible edges in the order of N
○ n(n-1)/2 number of pairs
★ Bipartite graph – when V can be partitioned into two disjoint sets and each edge in E has
one endpoint in partition one, and one in partition 2
★ Planar graphs – no overlapping/crossing edges
GRAPH ALGEBRA
★ Other notations
○ Walk
○ Corollary
○ Spectrum
★ Incidence matrix, B, is not a square matrix
○ unless Nv = Ne
○ Edges are the columns and vertex are the rows
○ For undirected graphs
○ For digraps
4 vertex and 5 edges so matrix is 4 X 5
○ It is positive semi-definiteness
○ Rank deficient
○ Spectrum and connectivity
■ The smallest eigenvalue of L is zero
● If the second smallest eigenvalue is not zero, then G is connected
● If L has n zero eigenvalues, G has n connected components
GRAPH IN DATA STRUCTURES & ALGORITHMS
Degree Distribution
★ Network ML tasks
○ Types:
■ Node classification
● Predict the type of a given node
● Eg: classifying the function of proteins
■ Link prediction
● Recommendation systems
● Eg: pinterest posts
■ Community detection
■ Network similarity
○ Lifecycle:
■ Doesn’t include feature engineering
■
■ Efficient task-independent feature for machine learning in networks
○ Why is it difficult?
■ Modern deep learning relies on sequences or grids
■ Networks are far more complex, i.e., no fixed ordering or reference point
● Often dynamic and multimodal features
★ Node embeddings – mapping nodes to low-dimensional embeddings
○ Goal is to encode nodes so that similarity in the embedding space (e.g., dot
product) approximates similarity in the original network.
○ Steps:
■ Define an encoder
■ Define a node similarity function
■ Optimize the parameters such that
similarity(u,v) ≈ zvT zu
○ Jaccard similarity
○ Adamic-adar score
●
○ S(u,v) is the neighborhood overlap between u and v
■ Random walk approaches
● Probability that u and v co-occur on a random walk over a network
≈ zvT zu
○ Estimate the probability of visiting a node form u using
some random walk strategy
○ Optimize the embeddings to encode these random walk
statistics
● Random walks are expressive – flexible stochastic definition of
node similarity that incorporated local and higher order
neighborhood information
● Efficient – do not need to consider all node pairs while training,
only the co occurring one
●
○ Doing this is too expensive
■ O(|V|^2)
○ Solve???
■ Negative sampling
● instead of normalizing w.r.t. all nodes, just
normalize against k random “negative
samples”
● Node2vec biased walks
● Interpolate DFS and BFS
● Biased random walks
■ No method is the best but Random walk approaches are generally more
efficient (i.e., O(|E|) vs. O(|V|^2))
★ Multilayer networks