0% found this document useful (0 votes)

453 views

Data Mining Cheat Sheet

1. Data mining involves cleaning, integrating, selecting, transforming, mining, evaluating, and presenting data. 2. There are different types of data attributes like nominal, ordinal, interval, and ratio attributes. Distance measures are used to calculate similarities between data points. 3. Popular data mining algorithms include decision trees, naive Bayes classification, rule-based classification using algorithms like Apriori, and clustering techniques like k-means and hierarchical clustering.

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

453 views

Data Mining Cheat Sheet

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

1. Data Cleaning Removal of noise and inconsistent records Nomial e.g., ID numbers, eye color, zip codes

2. Data Integration Combing multiple sources Ordinal e.g., rankings, grades, height

3. Data Selection Only data relevant for the task are retrieved from Interval e.g., calendar dates, temperatures
the database
Ratio e.g., length, time, counts
4. Data Converting data into a form more appropriate for
Transformation mining Distance Measures
5. Data Mining Application of intelligent methods to extract data
patterns

6. Model Evaluation Identification of truly interesting patterns

representing knowledge

7. Knowledge Visualization or other knowledge presentation

Presentation techniques

Data mining could also be called Knowledge Discovery in Databases (see

kdnuggets.com)

Manhattan = City Block

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

measures
Data Mining Cheat Sheet

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance

agreement)

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

the sum of values in diagonal of the confusion matrix

K-Nearest Neighbor

* Compute distance between two points

* Determine the class from nearest neighbor list
* Take the majority vote of class labels
among the k-nearest neighbors
* Weigh the vote according to distance
Data Mining Cheat Sheet

K-Nearest Neighbor (cont) Bayesian Classification

* weight factor, w = 1 / d^2

Rule-based Classification

Classify records by using a collection of

“if…then…” rules
Rule: (Condition) --> y
where:
* Condition is a conjunction of attributes
* y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds
(Taxable Income < 50K) ^ (Refund=Yes) --> Evade=No
Sequential covering is a rule-based classifier.

Rule Evaluation

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already

happened.

Terms

Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Analysis monotone property

Ensemble Staking, Random Forest

Methods
Data Mining Cheat Sheet

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Algorithm

Model Cross-validation, Bootstrap, Leave-one out (C-V),

Evaluation Misclassification error rate, Repeated holdout,
Stratification

Bayes Probabilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel

Visualization coordinates

Nonlinear Principal components, ISOMAP, Multidimensional

Dimensionality scaling
Reduction

Ensemble Techniques Apriori Algorithm

Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets
of length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
Manipulate training data: bagging and boosting ensemble of “experts”,
leaving only those that are frequent
each specializing on different portions of the instance space

Manipulate output values: error-correcting output coding (ensemble of

“experts”, each predicting 1 bit of the {multibit} full class label)

Methods: BAGGing, Boosting, AdaBoost

Data Mining Cheat Sheet

K-means Clustering Dendrogram Example

Select K points as the initial centroids

repeat
Form K Clusters by assigning all points to the
closest centroid
Recompute the centroid of each cluster
until the centroids don't change

Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Cosine), correlation.

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}

Single-Link or MIN
Density-Based Clustering
Similarity of two clusters is based on the two most similar (closest /
minimum) points in the different clusters current_cluster_label <-- 1
Determined by one pair of points, i.e., by one link in the proximity graph. for all core points do
Complete or MAX
if the core point has no cluster label then
Similarity of two clusters is based on the two least similar (most distant,
current_cluster_label <--
maximum) points in the different clusters
current_cluster_label +1
Determined by all pairs of points in the two clusters
Group Average Label the current core point with the cluster

Proximity of two clusters is the average of pairwise proximity between label

points in the two clusters end if

Agglomerative clustering starts with points as individual clusters and for all points in the Eps-neighborhood, except i-

merges closest clusters until only one cluster left. th the point itself do
if the point does not have a cluster label
Divisive clustering starts with one, all-inclusive cluster and splits a then
cluster until each cluster only has one point. Label the point with cluster label
end if
end for
Data Mining Cheat Sheet

Density-Based Clustering (cont) Regression Analysis (cont)

end for | Elastic Net

DBSCAN is a popular algorithm

Anomaly Detection

Density = number of points within a specified radius (Eps) Anomaly is a pattern in the data that does not conform to the expected
behavior (e.g., outliers, exceptions, peculiarities, surprise)
A point is a core point if it has more than a specified number of points Types of Anomaly
(MinPts) within Eps oint: An individual data instance is anomalous w.r.t. the data
P
ontextual: An individual data instance is anomalous within a context
C
These are points that are at the interior of a cluster ollective: A collection of related data instances is anomalous
C
Approaches
A border point has fewer than MinPts within Eps, but is in the * Graphical (e.g., boxplots, scatter plots)
neighborhood of a core point * Statistical (e.g., normal distribution, likelihood)
| Parametric Techniques
A noise point is any point that is not a core point or a border point | Non-parametric Techniques
* Distance (e.g., nearest-neighbor, density, clustering)
Other Clustering Methods
Local outlier factor (LOF) is a density-based distance approach
Fuzzy is a partitional clustering method. Fuzzy clustering (also referred
to as soft clustering) is a form of clustering in that each data point can Mahalanobis Distance is a clustering-based distance approach
belong to more than one cluster.
Graph-based methods: Jarvis-Patrick, Shared-Near Neighbor (SNN,
Density), Chameleon
Model-based methods: Expectation-Maximization

Regression Analysis

* Linear Regression
| Least squares
* Subset selection
* Stepwise selection
* Regularized regression
| Ridge
| Lasso

Download Full Medical Sciences Jeannette Naish PDF All Chapters
100% (2)
Download Full Medical Sciences Jeannette Naish PDF All Chapters
65 pages
Bece 2024 Final Confirmed Topics by Waec
No ratings yet
Bece 2024 Final Confirmed Topics by Waec
4 pages
The Very Hungry Caterpillar
No ratings yet
The Very Hungry Caterpillar
13 pages
Shortest Common Superstring1
No ratings yet
Shortest Common Superstring1
14 pages
Glee As A Pop-Culture Reflection
No ratings yet
Glee As A Pop-Culture Reflection
47 pages
FDSA UNIT 3
No ratings yet
FDSA UNIT 3
42 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
CS3251 Programming in C 2 Marks
No ratings yet
CS3251 Programming in C 2 Marks
23 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
CS402 Data Mining and Warehousing PDF
No ratings yet
CS402 Data Mining and Warehousing PDF
3 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
NUS MA4268 Ch1
No ratings yet
NUS MA4268 Ch1
9 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Data Mining
No ratings yet
Data Mining
2 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Question Bank_CSE-DS
No ratings yet
Question Bank_CSE-DS
5 pages
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
No ratings yet
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
3 pages
MCQ On Data Mining
No ratings yet
MCQ On Data Mining
20 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
MATH2045: Vector Calculus & Complex Variable Theory
100% (2)
MATH2045: Vector Calculus & Complex Variable Theory
50 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
DSF Unit IV MCQ Notes
No ratings yet
DSF Unit IV MCQ Notes
6 pages
ML Lab Observation
100% (1)
ML Lab Observation
44 pages
Assignment - Week2-With Solutionnn
No ratings yet
Assignment - Week2-With Solutionnn
6 pages
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
No ratings yet
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
8 pages
CP4252 Machine Learning lab manual
No ratings yet
CP4252 Machine Learning lab manual
37 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
Computer Networks UNIT-3 Media Access Control
No ratings yet
Computer Networks UNIT-3 Media Access Control
18 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
3c) Question Paper (Mid Exam)
No ratings yet
3c) Question Paper (Mid Exam)
2 pages
Question Bank Internal Image Processing
No ratings yet
Question Bank Internal Image Processing
6 pages
Discrete Cosine Transform PDF
No ratings yet
Discrete Cosine Transform PDF
4 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
7 pages
Fuzzy Logic and Applications PDF
No ratings yet
Fuzzy Logic and Applications PDF
13 pages
CH 6
No ratings yet
CH 6
72 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Lab 1: DTFT, DFT, and DFT Spectral Analysis: (LABE 410) Dr. Jad Abou Chaaya
No ratings yet
Lab 1: DTFT, DFT, and DFT Spectral Analysis: (LABE 410) Dr. Jad Abou Chaaya
4 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
Module 4 - Confusion Matrix-1
No ratings yet
Module 4 - Confusion Matrix-1
18 pages
Vtu Computer Network Lab Manual
No ratings yet
Vtu Computer Network Lab Manual
32 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
Data Mining
No ratings yet
Data Mining
15 pages
Chapter 7
100% (1)
Chapter 7
31 pages
Fuzzy C Means (Overlapping Clustering)
No ratings yet
Fuzzy C Means (Overlapping Clustering)
13 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Lab Manual 15
No ratings yet
Lab Manual 15
9 pages
Lec01 Conceptlearning
100% (1)
Lec01 Conceptlearning
49 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
3 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Lecture 7 - Classification (Rules and Naïve Bayes)
100% (1)
Lecture 7 - Classification (Rules and Naïve Bayes)
19 pages
Dwbi Unit 4 & 5
No ratings yet
Dwbi Unit 4 & 5
26 pages
Fuzzy-Sets Tutorial
No ratings yet
Fuzzy-Sets Tutorial
81 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
Unit 3 - Basic Search and Traversal Techniques
100% (2)
Unit 3 - Basic Search and Traversal Techniques
113 pages
Data Mining Cheat Sheet PDF
No ratings yet
Data Mining Cheat Sheet PDF
6 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
No ratings yet
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
12 pages
Logistic Regression and Regularization: Michael (Mike) Gelbart
No ratings yet
Logistic Regression and Regularization: Michael (Mike) Gelbart
19 pages
SAAI1-AI Analyst 2019-Course Guide 1
No ratings yet
SAAI1-AI Analyst 2019-Course Guide 1
166 pages
Welcome To The Course!: Michael (Mike) Gelbart
No ratings yet
Welcome To The Course!: Michael (Mike) Gelbart
17 pages
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
No ratings yet
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
22 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Supervised Learning With Scikit-Learn: Introduction To Regression
No ratings yet
Supervised Learning With Scikit-Learn: Introduction To Regression
31 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
32 pages
Introduction To Databases in Python: Filtering and Targeting Data
No ratings yet
Introduction To Databases in Python: Filtering and Targeting Data
32 pages
Introduction To Databases in Python: Calculating Values Ina Query
No ratings yet
Introduction To Databases in Python: Calculating Values Ina Query
30 pages
Introduction To Databases in Python: Creating Databases and Tables
No ratings yet
Introduction To Databases in Python: Creating Databases and Tables
31 pages
Importing Data in Python I: Introduction To Relational Databases
No ratings yet
Importing Data in Python I: Introduction To Relational Databases
33 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Science 3 DLP 3 - Parts of eyes and their functions
No ratings yet
Science 3 DLP 3 - Parts of eyes and their functions
12 pages
Routing _ Physical Design _ VLSI Back-End Adventure
No ratings yet
Routing _ Physical Design _ VLSI Back-End Adventure
8 pages
Daily Lesson LOG: School Grade Level Teacher Learning Area Teaching Dates and Time Semester
No ratings yet
Daily Lesson LOG: School Grade Level Teacher Learning Area Teaching Dates and Time Semester
3 pages
Storage Tank Heat Transfer
100% (3)
Storage Tank Heat Transfer
57 pages
Steady State Analysis of Muscle Stretch Reflex Action 01 - 13
No ratings yet
Steady State Analysis of Muscle Stretch Reflex Action 01 - 13
13 pages
[Graduate Texts in Mathematics №96] John B. Conway (auth.) - A Course in Functional Analysis (1985, Springer) [10.1007_978-1-4757-3828-5] - libgen.li
No ratings yet
[Graduate Texts in Mathematics №96] John B. Conway (auth.) - A Course in Functional Analysis (1985, Springer) [10.1007_978-1-4757-3828-5] - libgen.li
419 pages
Biphasic liquid dosage form SUSPENSION Fy G1G2
No ratings yet
Biphasic liquid dosage form SUSPENSION Fy G1G2
54 pages
0201 Intact Stability Essem 2 (Rev-01) PDF
No ratings yet
0201 Intact Stability Essem 2 (Rev-01) PDF
92 pages
Module in STS
No ratings yet
Module in STS
16 pages
TP 2 Gaming
No ratings yet
TP 2 Gaming
8 pages
Machine
No ratings yet
Machine
1 page
TEMs - MapInfo - Cellular Optimization
100% (2)
TEMs - MapInfo - Cellular Optimization
55 pages
8 Ressa Kombi Kenya's Updatd Nationally
No ratings yet
8 Ressa Kombi Kenya's Updatd Nationally
15 pages
Ear Irrigation Procedure Rationale
No ratings yet
Ear Irrigation Procedure Rationale
4 pages
Lesson 9
No ratings yet
Lesson 9
2 pages
Fast Food Chain in India
No ratings yet
Fast Food Chain in India
82 pages
DG Form CPC
No ratings yet
DG Form CPC
3 pages
IR Code Analy PDF
No ratings yet
IR Code Analy PDF
14 pages
Script Fibonacci and Golden Ratio
No ratings yet
Script Fibonacci and Golden Ratio
3 pages
Research Proposal
No ratings yet
Research Proposal
11 pages
Quantum Chaos and The Brain
No ratings yet
Quantum Chaos and The Brain
21 pages
Theories of Population
No ratings yet
Theories of Population
14 pages
Midterm English 9
No ratings yet
Midterm English 9
3 pages
BAUA - Essay Paper
No ratings yet
BAUA - Essay Paper
8 pages
Mind Map Physical Chem
100% (1)
Mind Map Physical Chem
1 page
SKKL 2 Nov
No ratings yet
SKKL 2 Nov
44 pages

Data Mining Cheat Sheet

Uploaded by

Data Mining Cheat Sheet

Uploaded by

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

6. Model Evaluation Identi​fic​ation of truly intere​sting patterns

7. Knowledge Visual​ization or other knowledge presen​tation

Data mining could also be called Knowledge Discovery in Databases (see

Manhattan = City Block

Jaccard coeffi​cient, Hamming, Cosine are a similarity / dissim​ilarity

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

* Compute distance between two points

K-Nearest Neighbor (cont) Bayesian Classi​fic​ation

​ ​ ​ ​ ​ ​ ​ * weight factor, w = 1 / d^2

Classify records by using a collection of

p(a,b) is the probab​ility that both a and b happen.

p(a|b) is the probab​ility that a happens, knowing that b has already

Associ​ation Min-Ap​riori, LIFT, Simpson's Paradox, Anti-

Ensemble Staking, Random Forest

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessim​istic estimate, Occam's Razor, Hunt's

Model Cross-​val​ida​tion, Bootstrap, Leave-one out (C-V),

Bayes Probab​ilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel

Nonlinear Principal compon​ents, ISOMAP, Multid​ime​nsional

Ensemble Techniques Apriori Algorithm

Mani​pulate output values: error-​cor​recting output coding (ensemble of

Meth​ods: BAGGing, Boosting, AdaBoost

K-means Clustering Dendrogram Example

Select K points as the initial centroids

Clos​eness is measured by distance (e.g., Euclid​ean), similarity (e.g.,

Cent​roid is typically the mean of the points in the cluster

Hierar​chical Clustering Data​set: {7, 10, 20, 28, 35}

Proximity of two clusters is the average of pairwise proximity between label

Densit​y-Based Clustering (cont) Regression Analysis (cont)

end for ​ ​| El​astic Net

DBSCAN is a popular algorithm

You might also like

6. Model Evaluation Identification of truly interesting patterns

7. Knowledge Visualization or other knowledge presentation

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

K-Nearest Neighbor (cont) Bayesian Classification

* weight factor, w = 1 / d^2

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already

Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Model Cross-validation, Bootstrap, Leave-one out (C-V),

Bayes Probabilistic classifier

Nonlinear Principal components, ISOMAP, Multidimensional

Manipulate output values: error-correcting output coding (ensemble of

Methods: BAGGing, Boosting, AdaBoost

Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}

Density-Based Clustering (cont) Regression Analysis (cont)

end for | Elastic Net