Data Splitting and Bias Variance Tradeoff
Data Splitting and Bias Variance Tradeoff
Aniket Kesari
UC Berkeley
October 1, 2020
Supervised v. Unsupervised Learning
• Supervised Learning
• Training data contains labels for the outcome9s)
• Machine Learning algorithm infers a function describing the
relationship between the inputs and the output
• Algorithm can be used on a new set of input data to infer the
output
• Examples: Linear Regression, Decision Trees, Support Vector
Machines, etc.
• Unsupervised Learning
• Training data does not contain any labels
• Algorithm instead trains to uncover underlying patterns in the
data
• Used for clustering, dimensionality reduction, etc.
• Examples: k-means, Principal Compnents Analysis, Singular
Value Decomposition, Expectation-Maximization
Regression and Classification
• Two goals:
• Minimize test bias: This means using as much data as we can
in the training phase, which necessarily means reducing the
amount of test data available
• Minimize test variance: But, we also want a decent number of
points in the test set, otherwise the estimates will have large
variances
• Fewer folds lead to higher test bias, but more folds lead to
higher test variance
Cross-Validation
• Advantages:
• Tends to avoid overfitting problems
• Usable with relatively small datasets (compared to
train/test/validation split)
• Does not make the background assumptions required in
information criteria approach
• Disadvantages:
• Assumes that the out-of-sample data was drawn from the same
population as the training data
• Computationally VERY expensive
• k=5 or 10 is conventionally used, but it is by no means
perfectly suited to every context
• In general, this problem lessens the more data you have
K-Fold CV Illustration
• K-Fold
• Divide into k-folds and rotate
• Leave-one-out (LOO)
• Leave out one observation, train the model on the rest, and
calculate on the left out observations
Comparison