0% found this document useful (0 votes)
29 views15 pages

My Notes

The document discusses key concepts in machine learning and artificial intelligence including different types of learning such as supervised, unsupervised, and reinforcement learning. It also covers important machine learning techniques like linear regression, classification, and clustering. Additionally, it defines fundamental concepts like data types, data preprocessing, and linear algebra operations that are important for machine learning.

Uploaded by

owenwongsohyik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

My Notes

The document discusses key concepts in machine learning and artificial intelligence including different types of learning such as supervised, unsupervised, and reinforcement learning. It also covers important machine learning techniques like linear regression, classification, and clustering. Additionally, it defines fundamental concepts like data types, data preprocessing, and linear algebra operations that are important for machine learning.

Uploaded by

owenwongsohyik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Artificial intelligence: Ability to learn and reason like humans

Machine learning: Ability to learn without being explicitly programmed

Deep learning: subset of machine learning where artificial neural networks adapt and learn from vast
amounts of data

Experience, Task, Performance

Experience: Data eg. Historical weather data

Task: Type of prediction/ inference eg. Weather prediction/ digit recognition

Performance: Measure of accuracy/ results of task eg. Actual weather accuracy, classification
accuracy

Types of learning

Supervised learning: Use of labelled data

- Expensive in terms of time and resources


- Identify known elements

Unsupervised learning: Use of unlabelled data

- Identify unknown elements, “Discover”, “Identify”, “Try to learn”

Semi-supervised learning: Use of labelled + unlabelled data

Reinforcement learning: Reward desirable behaviours and punishing undesired ones

- Agent does an action in the environment and gains a reward


- Eg. Environment: maze, Agent: robot, Action: Moving around, Reward: Battery at the exit
- Action is optimal if it maximises expected average rewards

/
Supervised
Learning
' Unsupervised
Learning

a,
ai Qualitative value
u
.!!!
Classification Clustering
C

"'
::)
0
Dimensionality
::)
Regression Quantitative value
·=
"E
0
Reduction
u .)
\.

Inductive reasoning Deductive reasoning

- Probable conclusions - Logical conclusions


- Missing information/data - All information needed are available
- Uncertainty present - Deterministic
Types of data

Continuous

- Measured on quantitative scale that could be any fractional number


- Eg. Mass in kg

Ordinal

- Ranked data (Cannot change order)


- Eg. Survey responses (Excellent, Good, Average, Fair, Poor)

Categorical (Nominal)

- Multiple categories but not ordered (not ranked)


- Eg. Gender, blood type, name of fruits

Missing

- Missing and do not know mechanism, use NA to denote

Censored

- Missing but know missing mechanism on some level


- Below detection limit/ lost to follow-up

Categorical/ Numerical/
Qualitative Quantitative

Nominal I I. __ o_rd_1·_n_a_1_ _, Discrete Continuous

Eg. Gender, religion Can count Cannot be counted

Interval: No natural zero, can have values below 0 eg. Temperature in Celsius, pH

Ratio: Includes natural zero eg. Temperature in Kelvin, height, weight, duration

Data wrangling

- Transforming and mapping data into a format more appropriate for downstream analytics
- Ensures that one feature does not dominate over the others

Eg. Scaling, clipping, z-score

Scaling to a range (min-max normalisation)

- Use bounds or range of each dimension/feature to normalise data into min-max


- Used because data on different scale, some data dominate others
- Advantage: Ensure standard scale eg. 0 to 1 for min-max normalisation

Feature clipping

- For data with extreme outliers


- Clip extreme outliers to a maximum value eg. Clipping all temperatures above 40°C to be
40°C
Z-score standardisation

- Converts into normal distribution of mean 0 and standard deviation 1


- Advantages: Handles outliers well
- Disadvantage: Does not have the same exact scale for different features

Ordinal data: Numerical value have no significance on ranking order, can be normalised into
standardised distance values

Categorical: Arbitrary numbers to represent attributes

- Higher values may have greater influence


- Use binary coding eg. One-hot encoding: red = [1,0,0], yellow = [0,1,0], green = [0,0,1]

Data cleaning

- Detecting and correcting (or removing) corrupt or inaccurate records


- Methods: Removing examples with missing features, use a learning algorithm that can deal
with missing feature values, use data imputation technique

Imputation

- Replace missing value with an average value


- Replace missing value with a value outside normal range of values

Data integrity

Ensuring accuracy and consistency of data over its entire life cycle

- Physical integrity (error-correction codes, checksum, redundancy)


- Logical integrity (Drop down list, Product price is positive)

Data visualisation

Uneven distribution of training data: Biased AI system, Preferred to have even training data

Eg. Many samples of 1 race and few of the others

Boxplot

- Not good for visualisation of distribution of data

Interquartile Range
(IQR)
Outliers Outliers

l

1
u - - - - - - - ---1
2 3 4
1--- - - - - ---1 "
l
"Minimum " "Ma ximum"
(Ql - l.S*IQR) Q] Median 03 (Q3 + l.S*IQR)
Excludes (25 t h Percentile) (75th Percenti le)
Excludes
outliers outliers
-4 -3 -2 -1 0 1 2 3 4

Scatter plot – Good for distribution of data


Pie chart – Good for showing proportions

Log scaling – Exponential data distribution, make skewed data more distributed

Probability and Statistics, Linear algebra

Linear independence: c1x1 + c2x2 +…+ckuk = 0 holds for some values of c≠0

Probability mass function: Discrete function, each point on the graph = probability

Used for discrete random variable

0.51
0.2 I
1 3
0 .3 I►
7

Probability density function: Continuous function, integrate within range to find probability

Used for continuous random variable

J(x)!

~
Joo
_ [ (x) dx = 1 P(a S X :S b)

00
0 a b

Probability

Pr(A u B) = Pr(A) + Pr(B) - Pr(A n B)


I _ Pr(AIB) x Pr(B)
Pr(B A) - Pr(A)

Causality: One thing lead to the other

- Combine specific experimental designs such as randomised studies with standard statistical
analysis techniques

Correlation: Statistical relationship between two random variables

Linear correlation coefficient, r

Correlation does not imply causation.

But, a causal relationship means they are correlated.

Operations on vectors and matrices

Dot product: x.y = xTy


1
A-1 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝐴𝐴)
det(𝐴𝐴)

Adjoint = Adjugate Matlab: adjoint(A)


System of linear equations

Xw=y

1. Even determined system:


Equal number of equations and unknowns, one unique solution

2. Over determined system


More equations than unknowns
X is tall
Use left inverse: w = (XTX)-1XTy

3. Under determined system


More unknowns than equations
Use right inverse: w = XT(XXT)-1y

For wTX=yT,
Use left inverse for under determined and right inverse for over determined in forms:
Left inverse (Underdetermined): wT = yT(XTX)-1XT
Right inverse (Overdetermined): wT = yTXT(XXT)-1
OR
Convert to Xw=y and solve.

Set

Notations: curly brackets eg. {1,2,3} = set of numbers 1, 2 and 3


Range of values eg. (1,2] = set of values from 1 to 2, including 2, excluding 1
* ( and ) indicates exclude, [ and ] indicates include

Set operations: Intersection and union

Functions

y=f(x), x = argument/ input, y= value of function or output

inner product function

if a is a d-dimension vector and x is a d-dimension vector,

f(x) = aTx = a1x1 + a2x2 + … +adxd

Linear function

Must satisfy homogeneity and additivity (no offset)

- Homogeneity: for any d-vector x and scalar α, f(αx) = αf(x)


- Additivity: for any d-vector x and y, f(x+y) = f(x) + f(y)

Affine functions

Linear function plus a constant

f(x) = aTx + b where b is a scalar known as bias/offset and a is a d-vector

Max and Arg Max: max f(a) gives maximum value of f(a) and arg max f(a) = a when f(a) is maximum
Linear regression

Use system of linear equations with bias/offset (add a column of 1s) to find w

Multiply prediction/ testing set with w to get prediction

Polynomial regression

- Motivation: nonlinear decision surface


- Number of variables in polynomial regression: use variable explorer for P

Primal form: Over-determined system

w = (PTP)-1PTy (same as left inverse, P includes bias and x12, depending on order)

Dual form: Under-determined system

w = PT(PPT)-1y

Ridge regression

- Shrinks the regression coeff w by imposing a penalty on their size (λ controls amount of
shrinkage, larger λ, larger shrinkage)
- Makes a singular matrix non-singular

Primal form (swopped, under-determined):

w = (PTP + λI)-1PTy where λ is a small number eg. 0.0000001 and I is identity matrix

Dual form (swopped, over-determined):

w = PT(PPT + λI)-1y where λ is a small number eg. 0.0000001 and I is identity matrix

- Solutions from ridge regression are close approximations (eg. 0.99998 ≈ 0.1)

Discrete functions:

Signum discrimination

Returns 1 if positive, -1 if negative and 0 if 0

~) ([ 0.4444 ]) (+1)
sgn ( Yt = sgn -0.1111 = -1

One Hot encoding


1 0 0 Class 1
1 0 0 Class 1
0 1 0 Class 2
0 0 1 s l
0 1 0 Class 2

After performing linear or polynomial regression, find the largest value in each row to find class.

Class 1 Class 2

0.542977 0.179245]
Yt = 0.218029 0.226415

Underdetermined systems are either inconsistent or have infinitely many complex solutions.
(different orders of polynomial regression)
Feature selection and regularisation

Training and test set cannot overlap - Prediction of unseen data

Mean square error: average of errors squared

- Make signs positive


- Penalise larger errors by squaring

Training set fit Test set fit


Overfitting Good Bad
Underfitting Bad Bad
I Just nice I Good I Good I
Overfitting

- Model too complex for data


- Too many features but training samples small

Solutions:

- Use simpler models (eg. Lower order polynomial)


- Use regularisation

Underfitting

- Model too simple for data (Try more complex models)


- Features not informative enough (develop more informative features)

Feature selection

Less features might reduce overfitting

- Discard useless features and keep good features

Only do feature selection in training set and evaluate model using test set

Feature selection with test set/ full dataset leads to inflated performance (No good!!)

Pearson’s coefficient, r

- Linear relationship between two variables


- Two options:
1. Pick K features with largest absolute correlations (largest r values)
2. Pick all features with absolute correlations > C (pick all r > C)
- Reflects strength and direction of linear relationship
- Prone to be significantly affected by outliers
- Does not reflect slope of line/ not affected by slope
- Undefined when slope is 0 (Variance for y-axis = 0)
∑(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )(𝑦𝑦𝑖𝑖 −𝑦𝑦�) 1
𝑟𝑟 = where 𝑥𝑥̅ and 𝑦𝑦� are means, ∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )(𝑦𝑦𝑖𝑖 − 𝑦𝑦�) = covariance and
�∑(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )2 ∑(𝑦𝑦𝑖𝑖 −𝑦𝑦�)2 𝑁𝑁
1 1
� ∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )2 = standard deviation of x, � ∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦�)2 = standard deviation of y
𝑁𝑁 𝑁𝑁
Regularization
Forcing learning algorithm to build less complex models
- Solve an ill-posed problem (eg. Estimating 10th order polynomial with just 5 datapoints)
- Reduce overfitting (eg. Ridge regression)
Optimisation problem: argmin Data-Loss(w) + λRegularization(w)
argmin (Pw - yf( P w - y ) + >.wT w
w ~-----,----~ L..,-1
Data-Loss(w) Reg(w)

Data-Loss(w) – fitting error to training set given parameters, w


Regularization(w) penalises more complex models
Purpose of λ:
1. Make a singular matrix non-singular to allow inverse
2. Penalise overcomplex polynomial models
Bias vs Variance
Bias: how well an average prediction model will perform
Variance: Variability of prediction models across different training sets
Test error = Bias Squared + Variance + Irreducible Noise (Tradeoff between bias and variance)
Low Variance High Variance
- Test Error
- Bias Square
(/)
ctl -Variance
CD
~
_J

High Variance
(/)
ctl Low Bias
CD
.c
.!2'
I
Model Complexity
or Number of Features
General machine learning algorithm
111

argmin C(w) = argmin


w w
L l(f (xi, w), Yi) + >.R (w )
i= l

Learning model, f – belief about relationship between features xi and target yi


Loss function, L – penalty for predicting f(xi,w) when the true value is yi
Regularization, R encourages less complex models
Cost function, C – final optimisation criterion we want to minimise
Optimisation routine to find solution to cost function
Gradient descent
Optimise the cost function, C(w) iteratively
1. Find gradient, g(w) of C(w) at initialisation
2. At each iteration, wk+1 = wk – n × g(wk) where n = learning rate and g = gradient
Possible convergence criteria:
1. Maximum number of iterations
2. Percentage or absolute change in C below threshold
3. Percentage or absolute change of w be low threshold
- Gradient descent can only find local minimum (Gradient = 0) at local minimum)
Learning rate, n
n too small
- slow convergence (require many iterations before convergence)
- Might be stuck at local minimum
n too big
- Overshoot local minimum - slow convergence/ no convergence (“Ding Dong”)
Different learning models

General learning model: f (xi, w) = u(pf w)


Sigmoid function: map piTw to between 0 and 1 (binary classification)
1
o-(a) - ReLU Exponentia l
I + e-a 4 8

Rectified linear unit (ReLU): 6


~2 ~4
o-(a) = rnax(O, a) b b
2
Exponential: 0 0
-4 -2 0 2 4 -2 0 2
o-(a) = exp( - a) a a

Different loss functions


General loss function: L(f (xi, w), yi)
Square error loss: L(f (xi, w), yi) = (f (xi, w) - yi)2
Binary loss:

O if f(xi, w) = Yi o if J(x; , w)y.; > 0


L(f(xi, w) , Yi ) = { ]_ Alternatively, L(f(x i, w) , y;) = {
if f (xi, w) =I- 1/i 1 if f (x,, w)y.; < U

Hinge loss:

Exponential loss:

- Binary Loss
- Hinge Loss
- Exp Loss

0 1 2 3
f(x;, w)y;
Decision Trees
Use conditions to classify into nodes
Advantages
- Easy to visualise and understand tree
- Can work with a mix of continuous and discrete data
- Less data cleaning required
- Makes less assumptions about relationship between features and target
Disadvantages
- Trees can be overly complex (overfitting)
- Trees can be unstable (small changes in training data result in very different trees

Root Node Max nodes


root
20 = 1

Depth: 1 Terminal

21 = 2
Node/Leaf

Depth: 2
22 = 4
B and C are Terminal Nodes or Leaves

A is a parent node of B and C while B and C are children nodes of A.


Prefer less complex trees – least number of nodes (ie. Less depth to reduce overfitting)
Node impurity
Pure node – only contains data from 1 class
Node impurity measures – Gini, entropy and miscalculation
Gini impurity (good for computational speed)
Gini impurity, Qm = 1 − ∑𝐾𝐾 2
𝑖𝑖=1 𝑝𝑝𝑖𝑖 where pi = fraction of data sample belonging to class i in node m
where there are K classes (1 to K)
For 2 classes, Qm = 1 – p12 – p22
Overall gini for each depth = Σ fraction of data sample in each node × Qnode
Entropy (Good for imbalanced datasets, better performance)
Entropy, Qm = − ∑𝐾𝐾
𝑖𝑖=1 𝑝𝑝𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝𝑖𝑖 where pi = fraction of data sample belonging to class i in node m
where there are K classes (1 to K)
For 2 classes, Qm = −𝑝𝑝1 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝1 − 𝑝𝑝2 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝2
Overall entropy for each depth = Σ fraction of data sample in each node × Qnode
Miscalculation rate
Miscalculation rate, Qm = 1 – max(pi) where pi = fraction of data sample belonging to class i in node
m where there are K classes
For 2 classes, Qm = 1 – max(p1, p2)
Overall miscalculation rate for each depth = Σ fraction of data sample in each node × Qnode
Classification trees
- Predict discrete variables
- Minimise impurity (Gini, Entropy and Miscalculation)
Regression trees
- Predict continuous variables
- Minimise mean square error
Reducing overfitting in decision trees
- Set max depth for the tree
- Set minimum number if samples for splitting leaf nodes
- Set minimum decrease in impurity
- Randomly look at a subset of features (feature selection)
Reducing instability in decision trees
Perturb training set to generate M perturbed training sets and train one tree for each perturbed
training set.
Take average predictions across the M trees.
“Bootstrapping” to perturb training set by sampling data with replacement (datasets may be
repeated). Bootstrapped dataset is the same size as original dataset.
Calculating MSE for regression trees
Find mean of all samples and compute mean
squared error.
y •• MSE =
∑(𝑦𝑦−𝑦𝑦�)2
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

Threshold for depth 1 = w1

Compute MSE of each node (< w1 and > w1)

Compute MSE of each node.

MSE at depth 1 = Σ fraction of data sample in


X
each node × MSE of each node
Training, validation and test
- No overlap between training, validation and test sets (use unseen data)
- Validation set – measure performance of model
Common partitioning of Training, Validation and Test
Type Training set Validation set Test set
2-fold CV 40% 10% 50%
4-fold CV 50% 25% 25%
5-fold CV 60% 20% 20%
10-fold CV 80% 10% 10%

LOOCV, Leave one out cross validation: Repeat training and testing by using different "folds" as training
and test set, recommended for small datasets, not for large (enough training/test data)
Evaluation metrics
Regression
- Mean square error – sum of errors squared divided by number of datapoints
∑𝑛𝑛 �𝑖𝑖 )2
𝑖𝑖=1(𝑦𝑦𝑖𝑖 −𝑦𝑦
MSE =
𝑛𝑛

- Mean absolute error – sum of absolute errors (Made positive) divided by number of
datapoints
∑𝑛𝑛
𝑖𝑖=1|𝑦𝑦𝑖𝑖 −𝑦𝑦
�𝑖𝑖 |
MAE =
𝑛𝑛

Classification
Confusion Matrix for Binary Classification
p N
(predicted) (predicted)

p
(actual) TP FN False positive: Type I error
N False negative: type II error
(actual) FP TN
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Recall: =
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Precision: =
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝐶𝐶𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
- Accuracy: =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛

(True Positive Rate) TPR = TP/(TP+FN)


(False Negative Rate) FNR = FN/(TP+FN) } TPR + FNR = 1
(True Negative Rate) TNR = TN/(FP+TN) TNR + FPR = 1
(False Positive Rate) FPR = FP/(FP+TN) }
Cost sensitive accuracy: Different costs for TP, FP, TN and FN, higher cost for more important factor
Threshold – point/line at which separates the classes (Use validation data to set threshold)
Equal error rate, EER: error rate at which FPR = FNR
Gini coefficient
100 𝐴𝐴
Gini coefficient =
90 𝐴𝐴+𝐵𝐵

80 High Gini: Able to tell the 2 classes


70 apart easily (Good)

~
60
Area under curve (AUC of ROC)
~ 50 = A + ½ Area of plot
p..
E-; 40
AUC ranges from 0 (worst) to 100
30
(best)
20
*A+B = 1/2
10

20 40 60 80 100
FPR(% )
Unsupervised learning
- Clustering – groups a set of objects where objects that are more similar gets grouped
together
- Density estimation – models pdf of unknown probability distribution
- Component analysis
- Neural network – Autoencoder, self-organising map
K-means clustering (No label)
1. Randomly pick K centroids.
2. Compute distance from each datapoint to each centroid.
3. Each point will belong to the group whose centroid it is closest to
4. Recompute centroids of each group and repeat until it stabilises.
Initialisation
- Random partition
Randomly assign a cluster to each observation
- Forgy method
Randomly choose k observations and use these as initial means
Can have different outcomes for different initialisations
Hard vs soft clustering
Hard clustering – Each data point belongs to one cluster (Naïve K means)
Soft clustering (fuzzy clustering/ fuzzy C means) – Each data point can belong to more than one
cluster.
Parameters affecting clustering results
1. Algorithm type (eg. K-means)
2. Number of clusters
3. Starting Centroids/points
4. Number of iterations
5. Metrics used (eg. Distance between points)
Neural networks
- Nested function (Function within a function)
- Weight on each line
Multilayer perceptrons (MLPs)
More dense connections

Each neuron is connected to all other neurons


in next layer

output layer

input layer
hidden layer 1 hidden layer 2
Convolution neural networks (CNNs) Each output value does not need to be
Output (0l[01 = (9•0) + (4•2) + (1•4) +
mapped directly to input image
(l *l) + (l*0) + (l*l) + (2*0) + (l '*l)

Less dense connections


=0+8+1+4+1+0+1+0+1
• 16

Good for image and text processing

Take advantage of hierarchical patterns in data


and assemble patterns of increasing
q complexity using smaller and simpler patterns
lnputima,t Filter

Sigmoid activation function Tanh ReLU


1
𝑔𝑔(𝑥𝑥) = 𝑔𝑔(𝑥𝑥) = 𝑡𝑡𝑡𝑡ℎ𝑛𝑛(𝑥𝑥) 𝑔𝑔(𝑥𝑥) = max (0, 𝑥𝑥)
1+𝑒𝑒 −𝛽𝛽𝛽𝛽

Plot of Sigmoid Plot of Hyperbolic Tangent Plot of ReLU

y. :~ ~ ( r ) = -1-
I +•-• ,.•

.
E-----=----H--------
..
.
3: ----------- -------- 5.0

.......
-1.0
2.0

. 1.J: ·U
•10 _, -I •7 -4i •S -4 -3 •2 •1 0 1 2 1 4 5 l 7 I I 10 •10 •I 4 •7 -6 •5 ◄ •J ·2 •l O 1 2 J 4 5 l 7 I I 10

X X X

Used when probability needs Output range between -1 and 1 Negative values output as 0
to be predicted as range Mean closer to 0
A bit faster to compute than
output between 0 and 1
Centered value of 0 makes other activation functions with
May cause neural network to learning of next layer easier and less saturation effects for large
get stuck during training, faster input values
preferably used in output
When input values are very If inputs are negative,
layer
large/small, gradient is small derivative will be zero (not
When input values are very and slows down gradient useful for < 0!)
large/small, gradient is small descent
and slows down gradient
descent

Key ideas of deep neural networks


1. End to end learning: no distinction between feature extractor and classifier
- Combine feature extraction and classification to optimise for downstream task
2. Deep architectures: Cascade of simpler non-linear modules
Complex functions are a composition of simpler functions eg. log(cos(exp(sin3(x))))
3. Learn features from data
- Split data into many layers and get low level parts, mid level parts and high level parts
- Each layer assembled to build next layer (hierarchical structure)
4. Use differentiable functions that produce features efficiently (run gradient descent)
Forward propagation
Computing the output given its input
Assumes we have w, weights (can be randomly assigned initially)
Backward propagation
*idea 4 of neural networks (See above)
Use backward propagation to update previous w (w3 then w2 then w1) using gradient descent
Convolutional neural networks

- Sliding window operations (Box filter)


Take average of each smaller box
- Pooling (Max pooling)
Shrink the size of feature map by taking the maximum of each patch

You might also like