0% found this document useful (0 votes)
453 views

Data Mining Cheat Sheet

1. Data mining involves cleaning, integrating, selecting, transforming, mining, evaluating, and presenting data. 2. There are different types of data attributes like nominal, ordinal, interval, and ratio attributes. Distance measures are used to calculate similarities between data points. 3. Popular data mining algorithms include decision trees, naive Bayes classification, rule-based classification using algorithms like Apriori, and clustering techniques like k-means and hierarchical clustering.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
453 views

Data Mining Cheat Sheet

1. Data mining involves cleaning, integrating, selecting, transforming, mining, evaluating, and presenting data. 2. There are different types of data attributes like nominal, ordinal, interval, and ratio attributes. Distance measures are used to calculate similarities between data points. 3. Popular data mining algorithms include decision trees, naive Bayes classification, rule-based classification using algorithms like Apriori, and clustering techniques like k-means and hierarchical clustering.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

1. Data Cleaning Removal of noise and incons​istent records Nomial e.g., ID numbers, eye color, zip codes

2. Data Integr​ation Combing multiple sources Ordinal e.g., rankings, grades, height

3. Data Selection Only data relevant for the task are retrieved from Interval e.g., calendar dates, temper​atures
the database
Ratio e.g., length, time, counts
4. Data Converting data into a form more approp​riate for
Transf​orm​ation mining Distance Measures
5. Data Mining Applic​ation of intell​igent methods to extract data
patterns

6. Model Evaluation Identi​fic​ation of truly intere​sting patterns


repres​enting knowledge

7. Knowledge Visual​ization or other knowledge presen​tation


Presen​tation techniques

Data mining could also be called Knowledge Discovery in Databases (see


kdnugg​ets.com)

Manhattan = City Block

Jaccard coeffi​cient, Hamming, Cosine are a similarity / dissim​ilarity


measures
Data Mining Cheat Sheet

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance


agreement)

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates


the sum of values in diagonal of the confusion matrix

K-Nearest Neighbor

* Compute distance between two points


* Determine the class from nearest neighbor list
​ ​ ​ * Take the majority vote of class labels
​ ​ ​ ​ ​ ​among the k-nearest neighbors
​ ​ ​ * Weigh the vote according to distance
Data Mining Cheat Sheet

K-Nearest Neighbor (cont) Bayesian Classi​fic​ation

​ ​ ​ ​ ​ ​ ​ * weight factor, w = 1 / d^2

Rule-based Classi​fic​ation

Classify records by using a collection of


“if…then…” rules
Rule: (Condi​tion) --> y
where:
* Condition is a conjun​ction of attributes
* y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classi​fic​ation rules:
(Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds
(Taxable Income < 50K) ^ (Refun​d=Yes) --> Evade=No
Sequential covering is a rule-based classi​fier.

Rule Evaluation

p(a,b) is the probab​ility that both a and b happen.

p(a|b) is the probab​ility that a happens, knowing that b has already


happened.

Terms

Associ​ation Min-Ap​riori, LIFT, Simpson's Paradox, Anti-


Analysis m​onotone property

Ensemble Staking, Random Forest


Methods
Data Mining Cheat Sheet

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessim​istic estimate, Occam's Razor, Hunt's


Algorithm

Model Cross-​val​ida​tion, Bootstrap, Leave-one out (C-V),


Evaluation Miscla​ssi​fic​ation error rate, Repeated holdout,
Strati​fic​ation

Bayes Probab​ilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel


Visual​ization coordi​nates

Nonlinear Principal compon​ents, ISOMAP, Multid​ime​nsional


Dimens​ion​ality scaling
Reduction

Ensemble Techniques Apriori Algorithm

Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
​ ​ ​ ​Gen​erate length (k+1) candidate itemsets from
​ ​ ​ ​length k frequent itemsets
​ ​ ​ ​Prune candidate itemsets containing subsets
​ ​ ​ of length k that are infrequent
​ ​ ​ ​Count the support of each candidate by
​ ​ ​ ​sca​nning the DB
​ ​ ​ ​Eli​minate candidates that are infreq​uent,
Mani​pulate training data: bagging and boosting ensemble of “experts”,
​ ​ ​ ​leaving only those that are frequent
each specia​lizing on different portions of the instance space

Mani​pulate output values: error-​cor​recting output coding (ensemble of


“experts”, each predicting 1 bit of the {multibit} full class label)

Meth​ods: BAGGing, Boosting, AdaBoost


Data Mining Cheat Sheet

K-means Clustering Dendrogram Example

Select K points as the initial centroids


repeat
​ ​ ​ Form K Clusters by assigning all points to the
closest centroid
​ ​ ​ ​Rec​ompute the centroid of each cluster
until the centroids don't change

Clos​eness is measured by distance (e.g., Euclid​ean), similarity (e.g.,


Cosine), correl​ation.

Cent​roid is typically the mean of the points in the cluster

Hierar​chical Clustering Data​set: {7, 10, 20, 28, 35}

Sing​le-Link or MIN
Densit​y-Based Clustering
Similarity of two clusters is based on the two most similar (closest /
minimum) points in the different clusters current_cluster_label <-- 1
Determined by one pair of points, i.e., by one link in the proximity graph. for all core points do
Complete or MAX
​ ​ ​ ​ if the core point has no cluster label then
Similarity of two clusters is based on the two least similar (most distant,
​ ​ ​ ​ ​ ​ ​ ​cur​ren​t_c​lus​ter​_label <--
maximum) points in the different clusters
curren​t_c​lus​ter​_label +1
Determined by all pairs of points in the two clusters
Group Average ​ ​ ​ ​ ​ ​ ​ ​Label the current core point with the cluster

Proximity of two clusters is the average of pairwise proximity between label


points in the two clusters ​ ​ ​ ​ end if

Aggl​ome​rat​ive clustering starts with points as individual clusters and ​ ​ ​ ​ for all points in the Eps-ne​igh​bor​hood, except i-

merges closest clusters until only one cluster left. th the point itself do
​ ​ ​ ​ ​ ​ ​ ​ if the point does not have a cluster label
Divi​sive clustering starts with one, all-in​clusive cluster and splits a then
cluster until each cluster only has one point. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​Label the point with cluster label
​ ​ ​ ​ ​ ​ ​ ​ end if
​ ​ ​ ​ end for
Data Mining Cheat Sheet

Densit​y-Based Clustering (cont) Regression Analysis (cont)

end for ​ ​| El​astic Net

DBSCAN is a popular algorithm


Anomaly Detection

Density = number of points within a specified radius (Eps) Anomaly is a pattern in the data that does not conform to the expected
behavior (e.g., outliers, except​ions, peculi​ari​ties, surprise)
A point is a core point if it has more than a specified number of points Types of Anomaly
(MinPts) within Eps ​ o​int: An individual data instance is anomalous w.r.t. the data
​P
​ o​nte​xtual: An individual data instance is anomalous within a context
​C
These are points that are at the interior of a cluster ​ o​lle​ctive: A collection of related data instances is anomalous
​C
Appr​oac​hes
A border point has fewer than MinPts within Eps, but is in the ​ * Graphical (e.g., boxplots, scatter plots)
neighb​orhood of a core point ​ * Statis​tical (e.g., normal distri​bution, likeli​hood)
​ ​ ​ | Parametric Techniques
A noise point is any point that is not a core point or a border point ​ ​ ​ | Non-pa​ram​etric Techniques
​ * Distance (e.g., neares​t-n​eig​hbor, density, cluste​ring)
Other Clustering Methods
Local outlier factor (LOF) is a densit​y-based distance approach
Fuzzy is a partit​ional clustering method. Fuzzy cluste​ring (also referred
to as soft cluste​ring) is a form of clustering in that each data point can Maha​lanobis Distance is a cluste​rin​g-based distance approach
belong to more than one cluster.
Grap​h-b​ased methods: Jarvis​-Pa​trick, Shared​-Near Neighbor (SNN,
Density), Chameleon
Mode​l-b​ased methods: Expect​ati​on-​Max​imi​zation

Regression Analysis

* Linear Regression
​ ​| Least squares
* Subset selection
* Stepwise selection
* Regula​rized regression
​ ​| Ridge
​ ​| Lasso

You might also like