0% found this document useful (0 votes)

29 views15 pages

My Notes

The document discusses key concepts in machine learning and artificial intelligence including different types of learning such as supervised, unsupervised, and reinforcement learning. It also covers important machine learning techniques like linear regression, classification, and clustering. Additionally, it defines fundamental concepts like data types, data preprocessing, and linear algebra operations that are important for machine learning.

Uploaded by

owenwongsohyik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views15 pages

My Notes

Uploaded by

owenwongsohyik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Artificial intelligence: Ability to learn and reason like humans

Machine learning: Ability to learn without being explicitly programmed

Deep learning: subset of machine learning where artificial neural networks adapt and learn from vast
amounts of data

Experience, Task, Performance

Experience: Data eg. Historical weather data

Task: Type of prediction/ inference eg. Weather prediction/ digit recognition

Performance: Measure of accuracy/ results of task eg. Actual weather accuracy, classification
accuracy

Types of learning

Supervised learning: Use of labelled data

- Expensive in terms of time and resources

- Identify known elements

Unsupervised learning: Use of unlabelled data

- Identify unknown elements, “Discover”, “Identify”, “Try to learn”

Semi-supervised learning: Use of labelled + unlabelled data

Reinforcement learning: Reward desirable behaviours and punishing undesired ones

- Agent does an action in the environment and gains a reward

- Eg. Environment: maze, Agent: robot, Action: Moving around, Reward: Battery at the exit
- Action is optimal if it maximises expected average rewards

/
Supervised
Learning
' Unsupervised
Learning

a,
ai Qualitative value
u
.!!!
Classification Clustering
C

"'
::)
0
Dimensionality
::)
Regression Quantitative value
·=
"E
0
Reduction
u .)
\.

Inductive reasoning Deductive reasoning

- Probable conclusions - Logical conclusions

- Missing information/data - All information needed are available
- Uncertainty present - Deterministic
Types of data

Continuous

- Measured on quantitative scale that could be any fractional number

- Eg. Mass in kg

Ordinal

- Ranked data (Cannot change order)

- Eg. Survey responses (Excellent, Good, Average, Fair, Poor)

Categorical (Nominal)

- Multiple categories but not ordered (not ranked)

- Eg. Gender, blood type, name of fruits

Missing

- Missing and do not know mechanism, use NA to denote

Censored

- Missing but know missing mechanism on some level

- Below detection limit/ lost to follow-up

Categorical/ Numerical/
Qualitative Quantitative

Nominal I I. __ o_rd_1·_n_a_1_ _, Discrete Continuous

Eg. Gender, religion Can count Cannot be counted

Interval: No natural zero, can have values below 0 eg. Temperature in Celsius, pH

Ratio: Includes natural zero eg. Temperature in Kelvin, height, weight, duration

Data wrangling

- Transforming and mapping data into a format more appropriate for downstream analytics
- Ensures that one feature does not dominate over the others

Eg. Scaling, clipping, z-score

Scaling to a range (min-max normalisation)

- Use bounds or range of each dimension/feature to normalise data into min-max

- Used because data on different scale, some data dominate others
- Advantage: Ensure standard scale eg. 0 to 1 for min-max normalisation

Feature clipping

- For data with extreme outliers

- Clip extreme outliers to a maximum value eg. Clipping all temperatures above 40°C to be
40°C
Z-score standardisation

- Converts into normal distribution of mean 0 and standard deviation 1

- Advantages: Handles outliers well
- Disadvantage: Does not have the same exact scale for different features

Ordinal data: Numerical value have no significance on ranking order, can be normalised into
standardised distance values

Categorical: Arbitrary numbers to represent attributes

- Higher values may have greater influence

- Use binary coding eg. One-hot encoding: red = [1,0,0], yellow = [0,1,0], green = [0,0,1]

Data cleaning

- Detecting and correcting (or removing) corrupt or inaccurate records

- Methods: Removing examples with missing features, use a learning algorithm that can deal
with missing feature values, use data imputation technique

Imputation

- Replace missing value with an average value

- Replace missing value with a value outside normal range of values

Data integrity

Ensuring accuracy and consistency of data over its entire life cycle

- Physical integrity (error-correction codes, checksum, redundancy)

- Logical integrity (Drop down list, Product price is positive)

Data visualisation

Uneven distribution of training data: Biased AI system, Preferred to have even training data

Eg. Many samples of 1 race and few of the others

Boxplot

- Not good for visualisation of distribution of data

Interquartile Range
(IQR)
Outliers Outliers

l
•
1
u - - - - - - - ---1
2 3 4
1--- - - - - ---1 "
l
"Minimum " "Ma ximum"
(Ql - l.S*IQR) Q] Median 03 (Q3 + l.S*IQR)
Excludes (25 t h Percentile) (75th Percenti le)
Excludes
outliers outliers
-4 -3 -2 -1 0 1 2 3 4

Scatter plot – Good for distribution of data

Pie chart – Good for showing proportions

Log scaling – Exponential data distribution, make skewed data more distributed

Probability and Statistics, Linear algebra

Linear independence: c1x1 + c2x2 +…+ckuk = 0 holds for some values of c≠0

Probability mass function: Discrete function, each point on the graph = probability

Used for discrete random variable

0.51
0.2 I
1 3
0 .3 I►
7

Probability density function: Continuous function, integrate within range to find probability

Used for continuous random variable

J(x)!

~
Joo
_ [ (x) dx = 1 P(a S X :S b)

00
0 a b

Probability

Pr(A u B) = Pr(A) + Pr(B) - Pr(A n B)

I _ Pr(AIB) x Pr(B)
Pr(B A) - Pr(A)

Causality: One thing lead to the other

- Combine specific experimental designs such as randomised studies with standard statistical
analysis techniques

Correlation: Statistical relationship between two random variables

Linear correlation coefficient, r

Correlation does not imply causation.

But, a causal relationship means they are correlated.

Operations on vectors and matrices

Dot product: x.y = xTy

1
A-1 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝐴𝐴)
det(𝐴𝐴)

Adjoint = Adjugate Matlab: adjoint(A)

System of linear equations

Xw=y

1. Even determined system:

Equal number of equations and unknowns, one unique solution

2. Over determined system

More equations than unknowns
X is tall
Use left inverse: w = (XTX)-1XTy

3. Under determined system

More unknowns than equations
Use right inverse: w = XT(XXT)-1y

For wTX=yT,
Use left inverse for under determined and right inverse for over determined in forms:
Left inverse (Underdetermined): wT = yT(XTX)-1XT
Right inverse (Overdetermined): wT = yTXT(XXT)-1
OR
Convert to Xw=y and solve.

Set

Notations: curly brackets eg. {1,2,3} = set of numbers 1, 2 and 3

Range of values eg. (1,2] = set of values from 1 to 2, including 2, excluding 1
* ( and ) indicates exclude, [ and ] indicates include

Set operations: Intersection and union

Functions

y=f(x), x = argument/ input, y= value of function or output

inner product function

if a is a d-dimension vector and x is a d-dimension vector,

f(x) = aTx = a1x1 + a2x2 + … +adxd

Linear function

Must satisfy homogeneity and additivity (no offset)

- Homogeneity: for any d-vector x and scalar α, f(αx) = αf(x)

- Additivity: for any d-vector x and y, f(x+y) = f(x) + f(y)

Affine functions

Linear function plus a constant

f(x) = aTx + b where b is a scalar known as bias/offset and a is a d-vector

Max and Arg Max: max f(a) gives maximum value of f(a) and arg max f(a) = a when f(a) is maximum
Linear regression

Use system of linear equations with bias/offset (add a column of 1s) to find w

Multiply prediction/ testing set with w to get prediction

Polynomial regression

- Motivation: nonlinear decision surface

- Number of variables in polynomial regression: use variable explorer for P

Primal form: Over-determined system

w = (PTP)-1PTy (same as left inverse, P includes bias and x12, depending on order)

Dual form: Under-determined system

w = PT(PPT)-1y

Ridge regression

- Shrinks the regression coeff w by imposing a penalty on their size (λ controls amount of
shrinkage, larger λ, larger shrinkage)
- Makes a singular matrix non-singular

Primal form (swopped, under-determined):

w = (PTP + λI)-1PTy where λ is a small number eg. 0.0000001 and I is identity matrix

Dual form (swopped, over-determined):

w = PT(PPT + λI)-1y where λ is a small number eg. 0.0000001 and I is identity matrix

- Solutions from ridge regression are close approximations (eg. 0.99998 ≈ 0.1)

Discrete functions:

Signum discrimination

Returns 1 if positive, -1 if negative and 0 if 0

~) ([ 0.4444 ]) (+1)
sgn ( Yt = sgn -0.1111 = -1

One Hot encoding

1 0 0 Class 1
1 0 0 Class 1
0 1 0 Class 2
0 0 1 s l
0 1 0 Class 2

After performing linear or polynomial regression, find the largest value in each row to find class.

Class 1 Class 2

0.542977 0.179245]
Yt = 0.218029 0.226415

Underdetermined systems are either inconsistent or have infinitely many complex solutions.
(different orders of polynomial regression)
Feature selection and regularisation

Training and test set cannot overlap - Prediction of unseen data

Mean square error: average of errors squared

- Make signs positive

- Penalise larger errors by squaring

Training set fit Test set fit

Overfitting Good Bad
Underfitting Bad Bad
I Just nice I Good I Good I
Overfitting

- Model too complex for data

- Too many features but training samples small

Solutions:

- Use simpler models (eg. Lower order polynomial)

- Use regularisation

Underfitting

- Model too simple for data (Try more complex models)

- Features not informative enough (develop more informative features)

Feature selection

Less features might reduce overfitting

- Discard useless features and keep good features

Only do feature selection in training set and evaluate model using test set

Feature selection with test set/ full dataset leads to inflated performance (No good!!)

Pearson’s coefficient, r

- Linear relationship between two variables

- Two options:
1. Pick K features with largest absolute correlations (largest r values)
2. Pick all features with absolute correlations > C (pick all r > C)
- Reflects strength and direction of linear relationship
- Prone to be significantly affected by outliers
- Does not reflect slope of line/ not affected by slope
- Undefined when slope is 0 (Variance for y-axis = 0)
∑(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )(𝑦𝑦𝑖𝑖 −𝑦𝑦�) 1
𝑟𝑟 = where 𝑥𝑥̅ and 𝑦𝑦� are means, ∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )(𝑦𝑦𝑖𝑖 − 𝑦𝑦�) = covariance and
�∑(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )2 ∑(𝑦𝑦𝑖𝑖 −𝑦𝑦�)2 𝑁𝑁
1 1
� ∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )2 = standard deviation of x, � ∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦�)2 = standard deviation of y
𝑁𝑁 𝑁𝑁
Regularization
Forcing learning algorithm to build less complex models
- Solve an ill-posed problem (eg. Estimating 10th order polynomial with just 5 datapoints)
- Reduce overfitting (eg. Ridge regression)
Optimisation problem: argmin Data-Loss(w) + λRegularization(w)
argmin (Pw - yf( P w - y ) + >.wT w
w ~-----,----~ L..,-1
Data-Loss(w) Reg(w)

Data-Loss(w) – fitting error to training set given parameters, w

Regularization(w) penalises more complex models
Purpose of λ:
1. Make a singular matrix non-singular to allow inverse
2. Penalise overcomplex polynomial models
Bias vs Variance
Bias: how well an average prediction model will perform
Variance: Variability of prediction models across different training sets
Test error = Bias Squared + Variance + Irreducible Noise (Tradeoff between bias and variance)
Low Variance High Variance
- Test Error
- Bias Square
(/)
ctl -Variance
CD
~
_J

High Variance
(/)
ctl Low Bias
CD
.c
.!2'
I
Model Complexity
or Number of Features
General machine learning algorithm
111

argmin C(w) = argmin

w w
L l(f (xi, w), Yi) + >.R (w )
i= l

Learning model, f – belief about relationship between features xi and target yi

Loss function, L – penalty for predicting f(xi,w) when the true value is yi
Regularization, R encourages less complex models
Cost function, C – final optimisation criterion we want to minimise
Optimisation routine to find solution to cost function
Gradient descent
Optimise the cost function, C(w) iteratively
1. Find gradient, g(w) of C(w) at initialisation
2. At each iteration, wk+1 = wk – n × g(wk) where n = learning rate and g = gradient
Possible convergence criteria:
1. Maximum number of iterations
2. Percentage or absolute change in C below threshold
3. Percentage or absolute change of w be low threshold
- Gradient descent can only find local minimum (Gradient = 0) at local minimum)
Learning rate, n
n too small
- slow convergence (require many iterations before convergence)
- Might be stuck at local minimum
n too big
- Overshoot local minimum - slow convergence/ no convergence (“Ding Dong”)
Different learning models

General learning model: f (xi, w) = u(pf w)

Sigmoid function: map piTw to between 0 and 1 (binary classification)
1
o-(a) - ReLU Exponentia l
I + e-a 4 8

Rectified linear unit (ReLU): 6

~2 ~4
o-(a) = rnax(O, a) b b
2
Exponential: 0 0
-4 -2 0 2 4 -2 0 2
o-(a) = exp( - a) a a

Different loss functions

General loss function: L(f (xi, w), yi)
Square error loss: L(f (xi, w), yi) = (f (xi, w) - yi)2
Binary loss:

O if f(xi, w) = Yi o if J(x; , w)y.; > 0

L(f(xi, w) , Yi ) = { ]_ Alternatively, L(f(x i, w) , y;) = {
if f (xi, w) =I- 1/i 1 if f (x,, w)y.; < U

Hinge loss:

Exponential loss:

- Binary Loss
- Hinge Loss
- Exp Loss

0 1 2 3
f(x;, w)y;
Decision Trees
Use conditions to classify into nodes
Advantages
- Easy to visualise and understand tree
- Can work with a mix of continuous and discrete data
- Less data cleaning required
- Makes less assumptions about relationship between features and target
Disadvantages
- Trees can be overly complex (overfitting)
- Trees can be unstable (small changes in training data result in very different trees

Root Node Max nodes

root
20 = 1

Depth: 1 Terminal

21 = 2
Node/Leaf

Depth: 2
22 = 4
B and C are Terminal Nodes or Leaves

A is a parent node of B and C while B and C are children nodes of A.

Prefer less complex trees – least number of nodes (ie. Less depth to reduce overfitting)
Node impurity
Pure node – only contains data from 1 class
Node impurity measures – Gini, entropy and miscalculation
Gini impurity (good for computational speed)
Gini impurity, Qm = 1 − ∑𝐾𝐾 2
𝑖𝑖=1 𝑝𝑝𝑖𝑖 where pi = fraction of data sample belonging to class i in node m
where there are K classes (1 to K)
For 2 classes, Qm = 1 – p12 – p22
Overall gini for each depth = Σ fraction of data sample in each node × Qnode
Entropy (Good for imbalanced datasets, better performance)
Entropy, Qm = − ∑𝐾𝐾
𝑖𝑖=1 𝑝𝑝𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝𝑖𝑖 where pi = fraction of data sample belonging to class i in node m
where there are K classes (1 to K)
For 2 classes, Qm = −𝑝𝑝1 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝1 − 𝑝𝑝2 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑝𝑝2
Overall entropy for each depth = Σ fraction of data sample in each node × Qnode
Miscalculation rate
Miscalculation rate, Qm = 1 – max(pi) where pi = fraction of data sample belonging to class i in node
m where there are K classes
For 2 classes, Qm = 1 – max(p1, p2)
Overall miscalculation rate for each depth = Σ fraction of data sample in each node × Qnode
Classification trees
- Predict discrete variables
- Minimise impurity (Gini, Entropy and Miscalculation)
Regression trees
- Predict continuous variables
- Minimise mean square error
Reducing overfitting in decision trees
- Set max depth for the tree
- Set minimum number if samples for splitting leaf nodes
- Set minimum decrease in impurity
- Randomly look at a subset of features (feature selection)
Reducing instability in decision trees
Perturb training set to generate M perturbed training sets and train one tree for each perturbed
training set.
Take average predictions across the M trees.
“Bootstrapping” to perturb training set by sampling data with replacement (datasets may be
repeated). Bootstrapped dataset is the same size as original dataset.
Calculating MSE for regression trees
Find mean of all samples and compute mean
squared error.
y •• MSE =
∑(𝑦𝑦−𝑦𝑦�)2
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

Threshold for depth 1 = w1

Compute MSE of each node (< w1 and > w1)

Compute MSE of each node.

MSE at depth 1 = Σ fraction of data sample in

X
each node × MSE of each node
Training, validation and test
- No overlap between training, validation and test sets (use unseen data)
- Validation set – measure performance of model
Common partitioning of Training, Validation and Test
Type Training set Validation set Test set
2-fold CV 40% 10% 50%
4-fold CV 50% 25% 25%
5-fold CV 60% 20% 20%
10-fold CV 80% 10% 10%

LOOCV, Leave one out cross validation: Repeat training and testing by using different "folds" as training
and test set, recommended for small datasets, not for large (enough training/test data)
Evaluation metrics
Regression
- Mean square error – sum of errors squared divided by number of datapoints
∑𝑛𝑛 �𝑖𝑖 )2
𝑖𝑖=1(𝑦𝑦𝑖𝑖 −𝑦𝑦
MSE =
𝑛𝑛

- Mean absolute error – sum of absolute errors (Made positive) divided by number of
datapoints
∑𝑛𝑛
𝑖𝑖=1|𝑦𝑦𝑖𝑖 −𝑦𝑦
�𝑖𝑖 |
MAE =
𝑛𝑛

Classification
Confusion Matrix for Binary Classification
p N
(predicted) (predicted)

p
(actual) TP FN False positive: Type I error
N False negative: type II error
(actual) FP TN
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Recall: =
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
- Precision: =
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝐶𝐶𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
- Accuracy: =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛

(True Positive Rate) TPR = TP/(TP+FN)

(False Negative Rate) FNR = FN/(TP+FN) } TPR + FNR = 1
(True Negative Rate) TNR = TN/(FP+TN) TNR + FPR = 1
(False Positive Rate) FPR = FP/(FP+TN) }
Cost sensitive accuracy: Different costs for TP, FP, TN and FN, higher cost for more important factor
Threshold – point/line at which separates the classes (Use validation data to set threshold)
Equal error rate, EER: error rate at which FPR = FNR
Gini coefficient
100 𝐴𝐴
Gini coefficient =
90 𝐴𝐴+𝐵𝐵

80 High Gini: Able to tell the 2 classes

70 apart easily (Good)

~
60
Area under curve (AUC of ROC)
~ 50 = A + ½ Area of plot
p..
E-; 40
AUC ranges from 0 (worst) to 100
30
(best)
20
*A+B = 1/2
10

20 40 60 80 100
FPR(% )
Unsupervised learning
- Clustering – groups a set of objects where objects that are more similar gets grouped
together
- Density estimation – models pdf of unknown probability distribution
- Component analysis
- Neural network – Autoencoder, self-organising map
K-means clustering (No label)
1. Randomly pick K centroids.
2. Compute distance from each datapoint to each centroid.
3. Each point will belong to the group whose centroid it is closest to
4. Recompute centroids of each group and repeat until it stabilises.
Initialisation
- Random partition
Randomly assign a cluster to each observation
- Forgy method
Randomly choose k observations and use these as initial means
Can have different outcomes for different initialisations
Hard vs soft clustering
Hard clustering – Each data point belongs to one cluster (Naïve K means)
Soft clustering (fuzzy clustering/ fuzzy C means) – Each data point can belong to more than one
cluster.
Parameters affecting clustering results
1. Algorithm type (eg. K-means)
2. Number of clusters
3. Starting Centroids/points
4. Number of iterations
5. Metrics used (eg. Distance between points)
Neural networks
- Nested function (Function within a function)
- Weight on each line
Multilayer perceptrons (MLPs)
More dense connections

Each neuron is connected to all other neurons

in next layer

output layer

input layer
hidden layer 1 hidden layer 2
Convolution neural networks (CNNs) Each output value does not need to be
Output (0l[01 = (9•0) + (4•2) + (1•4) +
mapped directly to input image
(l *l) + (l*0) + (l*l) + (2*0) + (l '*l)

Less dense connections

=0+8+1+4+1+0+1+0+1
• 16

Good for image and text processing

Take advantage of hierarchical patterns in data

and assemble patterns of increasing
q complexity using smaller and simpler patterns
lnputima,t Filter

Sigmoid activation function Tanh ReLU

1
𝑔𝑔(𝑥𝑥) = 𝑔𝑔(𝑥𝑥) = 𝑡𝑡𝑡𝑡ℎ𝑛𝑛(𝑥𝑥) 𝑔𝑔(𝑥𝑥) = max (0, 𝑥𝑥)
1+𝑒𝑒 −𝛽𝛽𝛽𝛽

Plot of Sigmoid Plot of Hyperbolic Tangent Plot of ReLU

y. :~ ~ ( r ) = -1-
I +•-• ,.•

.
E-----=----H--------
..
.
3: ----------- -------- 5.0

.......
-1.0
2.0

. 1.J: ·U
•10 _, -I •7 -4i •S -4 -3 •2 •1 0 1 2 1 4 5 l 7 I I 10 •10 •I 4 •7 -6 •5 ◄ •J ·2 •l O 1 2 J 4 5 l 7 I I 10

X X X

Used when probability needs Output range between -1 and 1 Negative values output as 0
to be predicted as range Mean closer to 0
A bit faster to compute than
output between 0 and 1
Centered value of 0 makes other activation functions with
May cause neural network to learning of next layer easier and less saturation effects for large
get stuck during training, faster input values
preferably used in output
When input values are very If inputs are negative,
layer
large/small, gradient is small derivative will be zero (not
When input values are very and slows down gradient useful for < 0!)
large/small, gradient is small descent
and slows down gradient
descent

Key ideas of deep neural networks

1. End to end learning: no distinction between feature extractor and classifier
- Combine feature extraction and classification to optimise for downstream task
2. Deep architectures: Cascade of simpler non-linear modules
Complex functions are a composition of simpler functions eg. log(cos(exp(sin3(x))))
3. Learn features from data
- Split data into many layers and get low level parts, mid level parts and high level parts
- Each layer assembled to build next layer (hierarchical structure)
4. Use differentiable functions that produce features efficiently (run gradient descent)
Forward propagation
Computing the output given its input
Assumes we have w, weights (can be randomly assigned initially)
Backward propagation
*idea 4 of neural networks (See above)
Use backward propagation to update previous w (w3 then w2 then w1) using gradient descent
Convolutional neural networks

- Sliding window operations (Box filter)

Take average of each smaller box
- Pooling (Max pooling)
Shrink the size of feature map by taking the maximum of each patch

All Models Are Wrong
No ratings yet
All Models Are Wrong
429 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
4 pages
Linear Regression Example
No ratings yet
Linear Regression Example
26 pages
Linear Regression Example
No ratings yet
Linear Regression Example
28 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
DS Notes
No ratings yet
DS Notes
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Lecture5
No ratings yet
Lecture5
26 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
LFD-1
No ratings yet
LFD-1
39 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Final ML
No ratings yet
Final ML
2 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Data Mining
No ratings yet
Data Mining
33 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
12. ML Mid 1 Scheme
No ratings yet
12. ML Mid 1 Scheme
8 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Reviews Less 1 -4
No ratings yet
Reviews Less 1 -4
115 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Week 10
No ratings yet
Week 10
50 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
ML (1)
No ratings yet
ML (1)
6 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
178 hw 9
No ratings yet
178 hw 9
153 pages
DS 1
No ratings yet
DS 1
20 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
178 hw 6
No ratings yet
178 hw 6
125 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
KCA 034 - Unit 2
No ratings yet
KCA 034 - Unit 2
97 pages
Maths$Stats_NOTES.docx
No ratings yet
Maths$Stats_NOTES.docx
50 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
Unit 3
No ratings yet
Unit 3
55 pages
unit-2.pptx
No ratings yet
unit-2.pptx
133 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
No ratings yet
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
2 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)