0% found this document useful (0 votes)
38 views

WS - Data Analytics Fundamental-R

1. The document discusses fundamentals of data analytics including data analytics flow, supervised vs unsupervised learning, decision trees, and cluster analysis. 2. Key aspects of supervised vs unsupervised learning are compared, including goals, need for training data, and ability to obtain unknown patterns. 3. Decision tree algorithms are explained and issues with large numbers of partitions are discussed. Model evaluation techniques like cross-validation are also introduced. 4. Different types of clustering algorithms are covered, including partitional vs hierarchical clustering and the concept of centroids.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

WS - Data Analytics Fundamental-R

1. The document discusses fundamentals of data analytics including data analytics flow, supervised vs unsupervised learning, decision trees, and cluster analysis. 2. Key aspects of supervised vs unsupervised learning are compared, including goals, need for training data, and ability to obtain unknown patterns. 3. Decision tree algorithms are explained and issues with large numbers of partitions are discussed. Model evaluation techniques like cross-validation are also introduced. 4. Different types of clustering algorithms are covered, including partitional vs hierarchical clustering and the concept of centroids.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

DATA ANALYTICS FUNDAMENTAL

DR. EKA BUDIARTO, S.T., M.SC.


DR. ENG. BAGUS MAHAWAN, B.ENG., M.ENG
EKA BUDIARTO

qEducation:
• Sarjana Teknik at ITB (Engineering Physics), Indonesia
• M.Sc. at TU Kaiserslautern (Industrial Math), Germany
• Ph.D at TU Delft (Applied Math), The Netherlands
qExperience:
• Researcher at Fraunhofer ITWM, Germany
• Lecturer at Mechatronics Engineering, Swiss German University
• Head of Master of Information Technology, Swiss German University
BACKGROUND
• Data has
become the
commodity of
the future
• What is
needed is
actually
information
DATA ANALYTICS FLOW
P
U
R
P
O
S
E
CROSS-INDUSTRY
STANDARD PROCESS
FOR DATA MINING
(CRISP – DM)
SUPERVISED VS UNSUPERVISED LEARNING
Parameter Supervised Unsupervised
Goal To predict the outcome of To get the underlying
unseen data pattern or structure of the
data
Training data Needed Not needed
Learning time Learning is done offline Learning is done real time

Number of classes Known before the result is Not known before the
obtained result is obtained
Unknown pattern/class Can not be obtained Can be obtained
Examples Classification, decision Clustering, association, k-
tree, regression, etc means, etc.
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

No
GENERAL APPROACH OF
3 No Small 70K
4 Yes Medium 120K No
Induction CLASSIFICATION
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
DECISION TREE
ANOTHER EXAMPLE OF DECISION TREE
APPLY DECISION TREE TO TEST DATA
HUNT’S ALGORITHM
DESIGN ISSUES WITH DECISION TREE Annual
Income
qHow should training records be split? > 80K?

• Method for specifying test condition depends on


Yes No
attribute types à sometimes need discretization
• Measure for evaluating the goodness of a test
condition (i) Binary split
Annual Annual
Income Income?
qHow should the splitting procedure >stop?
80K?
< 10K > 80K
• Stop splitting if all the records belong to the same class
Yes No
or have identical attribute values
[10K,25K) [25K,50K) [50K,80K)
• Or early termination
(i) Binary split (ii) Multi-way split
HOW TO DETERMINE THE BEST SPLIT?

• Which test condition is the best?


• Need a measure of node impurity

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
FINDING BEST SPLIT
ONE MEASURE: GINI RATIO
§ Gini Index for a given node t : C1 0
C2 6
GINI (t ) = 1 - å[ p( j | t )]2
j
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
(NOTE: p( j | t) is the relative frequency of Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
class j at node t).

– Maximum (1 - 1/nc) when records are C1 2


equally distributed among all classes, C2 4
implying least interesting information
– Minimum (0.0) when all records belong to P(C1) = 2/6 P(C2) = 4/6
one class, implying most interesting
information Gini = 1 – (2/6)2 – (4/6)2 = 0.444
PROBLEM WITH LARGE NUMBER OF PARTITIONS

§ Node impurity measures tend to prefer splits that result in large number of
partitions, each being small but pure

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain because entropy for all the children
is zero
GAIN RATIO
§ Gain Ratio:

GAIN n n
SplitINFO = - å log
k

GainRATIO = Split i i

SplitINFO
split
n
i =1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
CLASSIFICATION ERROR

§ Training errors (apparent errors)


– Errors committed on the training set

§ Test errors
– Errors committed on the test set

§ Generalization errors
– Expected error of a model over random selection of records from same
distribution
Two class problem:
EXAMPLE + : 5200 instances
• 5000 instances generated from a
Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution

10 % of the data used for training


and 90% of the data used for
testing
WHICH ONE IS BETTER?
MODEL OVERFITTING

• Underfitting: when model is too simple, both training and test errors are large
• Overfitting: when model is too complex, training error is small but test error is large
• Overfitting results in decision trees that are
more complex than necessary
• Training error alone does not always provide a
good estimate of how well the tree will MODEL EVALUATION
perform on previously unseen records
• Need ways for estimating generalization errors
à can use validation sets or adding model
complexity to the training error
• One way: cross validation
o Partition data into k disjoint subsets
o k-fold: train on k-1 partitions, test on the
remaining one
o Leave-one-out: k=n
CLUSTER ANALYSIS

• Finding groups of objects


such that the objects in a
group will be similar (or
related) to one another
and different from (or
unrelated to) the objects in
other groups
TYPES OF CLUSTERING

§ A clustering is a set of clusters


§ Important distinction between hierarchical and partitional sets of clusters
§ Partitional Clustering
– A division of data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
§ Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
PARTITIONAL
CLUSTERING
HIERARCHICAL
CLUSTERING
CENTROID
• Cluster can be made center-based à A cluster is a set of objects such that an object
in a cluster is closer (more similar) to the “center” of a cluster, than to the center of
any other cluster
• The concept of distance is very important à easiest one is the Euclidean distance
• The center of a cluster is often a centroid, the average of all the points in the cluster,
or a medoid, the most “representative” point of a cluster
K-MEANS CLUSTERING
• Partitional clustering approach
• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• The basic algorithm is very simple
DIFFERENT CLUSTERING
K-MEANS CLUSTERING

§ Most common evaluation measure is Sum of Squared Error (SSE)


§ For each point, the error is the distance to the nearest cluster
§ To get SSE, we square these errors and sum them
K
SSE = å å dist 2 (mi , x )
i =1 xÎCi

§ K-means has problems when clusters are of differing


– Sizes
– Densities
– Non-globular shapes
§ K-means has problems when the data contains outliers
K-MEANS LIMITATION: DIFFERENT SIZE
K-MEANS LIMITATION: DIFFERENT DENSITY
K-MEANS LIMITATION: NON-GLOBULAR SHAPE
OVERCOMING K-MEANS LIMITATION
• One way: use a lot of centroids à need to put together in the end
IMPORTANCE OF CHOOSING INITIAL CENTROIDS
IMPORTANCE OF CHOOSING INITIAL CENTROIDS
SOME SOLUTIONS TO INITIAL CENTROID PROBLEM

§ Sample and use hierarchical clustering to determine initial centroids


§ Select more than k initial centroids and then select among these initial centroids
– Select most widely separated
§ Postprocessing
§ Generate a larger number of clusters and then perform a hierarchical clustering
ANOMALY OR OUTLIER

§ What are anomalies/outliers?


– The set of data points that are
considerably different than the
remainder of the data
§ Natural implication is that anomalies are relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., heavy rain in dry season
§ Can be important or a nuisance
– Malware intrusion
– Fraud in credit card transactions
– Unusually high blood pressure
NOISE VS ANOMALY

§ Noise is erroneous, perhaps random, values or contaminating objects


– Weight recorded incorrectly
– Grapefruit mixed in with the oranges
§ Noise doesn’t necessarily produce unusual values or objects
§ Noise is not interesting
§ Anomalies may be interesting if they are not a result of noise
§ Noise and anomalies are related but distinct concepts
ANOMALY DETECTION

• Visual approach: scatter plot à subjective, not autonomous


• Statistical approach: an outlier is an object that has a low probability with respect
to a probability distribution model of the data à example removing samples that
creates too much covariance
• Model based : example one-class SVM (support vector machine)
NORMAL DISTRIBUTION
SUPPORT VECTOR MACHINE (SVM)
SUPPORT VECTOR MACHINE

The problem is
changed into an
optimization
problem à
minimize a certain
objective function
SUPPORT VECTOR MACHINE
NON-LINEAR SVM

• Need kernel to transform the problem into high-dimensional space


• Example:
HANDS-ON

• Use Orange software (https://orange.biolab.si/)


• Advantage: interactive data visualization and visual programming approach à
should help the participants to understand the concept easier

You might also like