Data Mining Outline
Data Mining Outline
Course Contents:
Topic to be covered include:
Basic statistical ideas - populations, distributions, samples and random samples
Classification models and methods - including: linear discriminant analysis; trees;
random forests; neural nets; boosting and bagging approaches; support vector machines.
Linear regression approaches to classification, compared with linear discriminant
analysis,
The training/test approach to assessing accuracy, and cross-validation.
Strategies in the (common) situation where source and target population differ,
typically in time but in other respects also.
Unsupervised models - kmeans, association rules, hierarchical clustering, model based
clusters.
Low-dimensional views of classification results - distance methods and ordination.
Strategies for working with large data sets.
Practical approaches to classification with real life data sets, using different methods to
gain different insights into presentation.
Privacy and security.
Use of the R system for handling the calculations.
Course Objective:
The main focus of the course will be supervised learning, primarily for classification. The
emphasis will be on practical applications of the methodologies that are described, with the R
system used for the computations. Attention will be given to
1) Generalizability and predictive accuracy, in the practical contexts in which methods are
applied.
MID TERM
9 M9: Classification: more methods
Rules
Regression
Instance-based (Nearest neighbor)
13 M13: Clustering
Introduction
K-means
Hierarchical
14 M14: Associations
Transactions
Frequent itemsets
Association rules
Applications
15 M15: Visualization
Graphical excellence and lie factor
Representing data in 1,2, and 3-D
Representing data in 4+ dimensions
o Parallel coordinates
o Scatterplots
o Stick figures
Final Exam