WS - Data Analytics Fundamental-R
WS - Data Analytics Fundamental-R
qEducation:
• Sarjana Teknik at ITB (Engineering Physics), Indonesia
• M.Sc. at TU Kaiserslautern (Industrial Math), Germany
• Ph.D at TU Delft (Applied Math), The Netherlands
qExperience:
• Researcher at Fraunhofer ITWM, Germany
• Lecturer at Mechatronics Engineering, Swiss German University
• Head of Master of Information Technology, Swiss German University
BACKGROUND
• Data has
become the
commodity of
the future
• What is
needed is
actually
information
DATA ANALYTICS FLOW
P
U
R
P
O
S
E
CROSS-INDUSTRY
STANDARD PROCESS
FOR DATA MINING
(CRISP – DM)
SUPERVISED VS UNSUPERVISED LEARNING
Parameter Supervised Unsupervised
Goal To predict the outcome of To get the underlying
unseen data pattern or structure of the
data
Training data Needed Not needed
Learning time Learning is done offline Learning is done real time
Number of classes Known before the result is Not known before the
obtained result is obtained
Unknown pattern/class Can not be obtained Can be obtained
Examples Classification, decision Clustering, association, k-
tree, regression, etc means, etc.
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
No
GENERAL APPROACH OF
3 No Small 70K
4 Yes Medium 120K No
Induction CLASSIFICATION
5 No Large 95K Yes
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
DECISION TREE
ANOTHER EXAMPLE OF DECISION TREE
APPLY DECISION TREE TO TEST DATA
HUNT’S ALGORITHM
DESIGN ISSUES WITH DECISION TREE Annual
Income
qHow should training records be split? > 80K?
§ Node impurity measures tend to prefer splits that result in large number of
partitions, each being small but pure
– Customer ID has highest information gain because entropy for all the children
is zero
GAIN RATIO
§ Gain Ratio:
GAIN n n
SplitINFO = - å log
k
GainRATIO = Split i i
SplitINFO
split
n
i =1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
CLASSIFICATION ERROR
§ Test errors
– Errors committed on the test set
§ Generalization errors
– Expected error of a model over random selection of records from same
distribution
Two class problem:
EXAMPLE + : 5200 instances
• 5000 instances generated from a
Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
• Underfitting: when model is too simple, both training and test errors are large
• Overfitting: when model is too complex, training error is small but test error is large
• Overfitting results in decision trees that are
more complex than necessary
• Training error alone does not always provide a
good estimate of how well the tree will MODEL EVALUATION
perform on previously unseen records
• Need ways for estimating generalization errors
à can use validation sets or adding model
complexity to the training error
• One way: cross validation
o Partition data into k disjoint subsets
o k-fold: train on k-1 partitions, test on the
remaining one
o Leave-one-out: k=n
CLUSTER ANALYSIS
The problem is
changed into an
optimization
problem à
minimize a certain
objective function
SUPPORT VECTOR MACHINE
NON-LINEAR SVM