My Notes
My Notes
Deep learning: subset of machine learning where artificial neural networks adapt and learn from vast
amounts of data
Performance: Measure of accuracy/ results of task eg. Actual weather accuracy, classification
accuracy
Types of learning
/
Supervised
Learning
' Unsupervised
Learning
a,
ai Qualitative value
u
.!!!
Classification Clustering
C
"'
::)
0
Dimensionality
::)
Regression Quantitative value
·=
"E
0
Reduction
u .)
\.
Continuous
Ordinal
Categorical (Nominal)
Missing
Censored
Categorical/ Numerical/
Qualitative Quantitative
Interval: No natural zero, can have values below 0 eg. Temperature in Celsius, pH
Ratio: Includes natural zero eg. Temperature in Kelvin, height, weight, duration
Data wrangling
- Transforming and mapping data into a format more appropriate for downstream analytics
- Ensures that one feature does not dominate over the others
Feature clipping
Ordinal data: Numerical value have no significance on ranking order, can be normalised into
standardised distance values
Data cleaning
Imputation
Data integrity
Ensuring accuracy and consistency of data over its entire life cycle
Data visualisation
Uneven distribution of training data: Biased AI system, Preferred to have even training data
Boxplot
Interquartile Range
(IQR)
Outliers Outliers
l
•
1
u - - - - - - - ---1
2 3 4
1--- - - - - ---1 "
l
"Minimum " "Ma ximum"
(Ql - l.S*IQR) Q] Median 03 (Q3 + l.S*IQR)
Excludes (25 t h Percentile) (75th Percenti le)
Excludes
outliers outliers
-4 -3 -2 -1 0 1 2 3 4
Log scaling – Exponential data distribution, make skewed data more distributed
Linear independence: c1x1 + c2x2 +…+ckuk = 0 holds for some values of c≠0
Probability mass function: Discrete function, each point on the graph = probability
0.51
0.2 I
1 3
0 .3 I►
7
Probability density function: Continuous function, integrate within range to find probability
J(x)!
~
Joo
_ [ (x) dx = 1 P(a S X :S b)
00
0 a b
Probability
- Combine specific experimental designs such as randomised studies with standard statistical
analysis techniques
Xw=y
For wTX=yT,
Use left inverse for under determined and right inverse for over determined in forms:
Left inverse (Underdetermined): wT = yT(XTX)-1XT
Right inverse (Overdetermined): wT = yTXT(XXT)-1
OR
Convert to Xw=y and solve.
Set
Functions
Linear function
Affine functions
Max and Arg Max: max f(a) gives maximum value of f(a) and arg max f(a) = a when f(a) is maximum
Linear regression
Use system of linear equations with bias/offset (add a column of 1s) to find w
Polynomial regression
w = (PTP)-1PTy (same as left inverse, P includes bias and x12, depending on order)
w = PT(PPT)-1y
Ridge regression
- Shrinks the regression coeff w by imposing a penalty on their size (λ controls amount of
shrinkage, larger λ, larger shrinkage)
- Makes a singular matrix non-singular
w = (PTP + λI)-1PTy where λ is a small number eg. 0.0000001 and I is identity matrix
w = PT(PPT + λI)-1y where λ is a small number eg. 0.0000001 and I is identity matrix
- Solutions from ridge regression are close approximations (eg. 0.99998 ≈ 0.1)
Discrete functions:
Signum discrimination
~) ([ 0.4444 ]) (+1)
sgn ( Yt = sgn -0.1111 = -1
After performing linear or polynomial regression, find the largest value in each row to find class.
Class 1 Class 2
0.542977 0.179245]
Yt = 0.218029 0.226415
Underdetermined systems are either inconsistent or have infinitely many complex solutions.
(different orders of polynomial regression)
Feature selection and regularisation
Solutions:
Underfitting
Feature selection
Only do feature selection in training set and evaluate model using test set
Feature selection with test set/ full dataset leads to inflated performance (No good!!)
Pearson’s coefficient, r
High Variance
(/)
ctl Low Bias
CD
.c
.!2'
I
Model Complexity
or Number of Features
General machine learning algorithm
111
Hinge loss:
Exponential loss:
- Binary Loss
- Hinge Loss
- Exp Loss
0 1 2 3
f(x;, w)y;
Decision Trees
Use conditions to classify into nodes
Advantages
- Easy to visualise and understand tree
- Can work with a mix of continuous and discrete data
- Less data cleaning required
- Makes less assumptions about relationship between features and target
Disadvantages
- Trees can be overly complex (overfitting)
- Trees can be unstable (small changes in training data result in very different trees
Depth: 1 Terminal
21 = 2
Node/Leaf
Depth: 2
22 = 4
B and C are Terminal Nodes or Leaves
LOOCV, Leave one out cross validation: Repeat training and testing by using different "folds" as training
and test set, recommended for small datasets, not for large (enough training/test data)
Evaluation metrics
Regression
- Mean square error – sum of errors squared divided by number of datapoints
∑𝑛𝑛 �𝑖𝑖 )2
𝑖𝑖=1(𝑦𝑦𝑖𝑖 −𝑦𝑦
MSE =
𝑛𝑛
- Mean absolute error – sum of absolute errors (Made positive) divided by number of
datapoints
∑𝑛𝑛
𝑖𝑖=1|𝑦𝑦𝑖𝑖 −𝑦𝑦
�𝑖𝑖 |
MAE =
𝑛𝑛
Classification
Confusion Matrix for Binary Classification
p N
(predicted) (predicted)
p
(actual) TP FN False positive: Type I error
N False negative: type II error
(actual) FP TN
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Recall: =
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Precision: =
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝐶𝐶𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
- Accuracy: =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
~
60
Area under curve (AUC of ROC)
~ 50 = A + ½ Area of plot
p..
E-; 40
AUC ranges from 0 (worst) to 100
30
(best)
20
*A+B = 1/2
10
20 40 60 80 100
FPR(% )
Unsupervised learning
- Clustering – groups a set of objects where objects that are more similar gets grouped
together
- Density estimation – models pdf of unknown probability distribution
- Component analysis
- Neural network – Autoencoder, self-organising map
K-means clustering (No label)
1. Randomly pick K centroids.
2. Compute distance from each datapoint to each centroid.
3. Each point will belong to the group whose centroid it is closest to
4. Recompute centroids of each group and repeat until it stabilises.
Initialisation
- Random partition
Randomly assign a cluster to each observation
- Forgy method
Randomly choose k observations and use these as initial means
Can have different outcomes for different initialisations
Hard vs soft clustering
Hard clustering – Each data point belongs to one cluster (Naïve K means)
Soft clustering (fuzzy clustering/ fuzzy C means) – Each data point can belong to more than one
cluster.
Parameters affecting clustering results
1. Algorithm type (eg. K-means)
2. Number of clusters
3. Starting Centroids/points
4. Number of iterations
5. Metrics used (eg. Distance between points)
Neural networks
- Nested function (Function within a function)
- Weight on each line
Multilayer perceptrons (MLPs)
More dense connections
output layer
input layer
hidden layer 1 hidden layer 2
Convolution neural networks (CNNs) Each output value does not need to be
Output (0l[01 = (9•0) + (4•2) + (1•4) +
mapped directly to input image
(l *l) + (l*0) + (l*l) + (2*0) + (l '*l)
y. :~ ~ ( r ) = -1-
I +•-• ,.•
.
E-----=----H--------
..
.
3: ----------- -------- 5.0
.......
-1.0
2.0
. 1.J: ·U
•10 _, -I •7 -4i •S -4 -3 •2 •1 0 1 2 1 4 5 l 7 I I 10 •10 •I 4 •7 -6 •5 ◄ •J ·2 •l O 1 2 J 4 5 l 7 I I 10
X X X
Used when probability needs Output range between -1 and 1 Negative values output as 0
to be predicted as range Mean closer to 0
A bit faster to compute than
output between 0 and 1
Centered value of 0 makes other activation functions with
May cause neural network to learning of next layer easier and less saturation effects for large
get stuck during training, faster input values
preferably used in output
When input values are very If inputs are negative,
layer
large/small, gradient is small derivative will be zero (not
When input values are very and slows down gradient useful for < 0!)
large/small, gradient is small descent
and slows down gradient
descent