0% found this document useful (0 votes)
62 views

Chapter 3-Unsupervised learning_updated

Chapter 3 discusses unsupervised learning, focusing on cluster analysis, which groups similar data objects based on their characteristics. It outlines the procedures for cluster analysis, applications, and the evaluation of clustering quality, emphasizing methods like k-means and hierarchical clustering. Additionally, it touches on reinforcement learning, highlighting its trial-and-error nature and the importance of maximizing rewards in dynamic environments.

Uploaded by

yosefdemeke08
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Chapter 3-Unsupervised learning_updated

Chapter 3 discusses unsupervised learning, focusing on cluster analysis, which groups similar data objects based on their characteristics. It outlines the procedures for cluster analysis, applications, and the evaluation of clustering quality, emphasizing methods like k-means and hierarchical clustering. Additionally, it touches on reinforcement learning, highlighting its trial-and-error nature and the importance of maximizing rewards in dynamic environments.

Uploaded by

yosefdemeke08
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter-3

Unsupervised Learning

By: Yeshambel A.
Unsupervised learning

• Unsupervised learning aims to find the underlying structure or the distribution of


data.
We want to explore the data to find some intrinsic structures in them.

• Learn from inputs x 1 , . . . , x n ∈ Rd without any labels y1 , . . . ,


y n.
*No predefined classes - Clustering
What is Cluster Analysis?

• Cluster: a collection of data objects

• Similar to one another within the same cluster

• Dissimilar to the objects in other clusters

• Cluster analysis

Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
What is Cluster Analysis?

 Finding groups of objects such that the objects in a group will be similar
(or related) to one another and different from (or unrelated to) the objects
in other groups
Inter-cluster
Intra- distances
cluster are
distances maximized
are
minimized
Clustering-Applications
Example: Clustering
⚫ The example below demonstrates the clustering of padlocks of same

kind. There are a total of 10 padlocks which various in color,


size, shape, etc.

⚫ How many possible clusters of padlocks can be identified?

⚫ There are three different kind of padlocks; which can be grouped into
three different clusters.
⚫ The padlocks of same kind are clustered into a group as shown below.

7
Examples of Clustering Applications
⚫ Marketing: H elp marketers discover distinct groups in their customer bases, and the use
this knowledge to develop targeted marketing programs

⚫ Land use: Identification of areas of similar land use in an earth observation database
⚫ Insurance: Identifying groups of motor insurance policy holders with a high average
claim
cost

⚫ City-planning: Identifying groups of houses according to their house type, value,


and geographical location
⚫ Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
What is Good Clustering

• A good clustering method will produce high quality clusters with


• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the similarity measure
used by the method and its implementation.
Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Distance measure greatly depends on the type of the attribute
value (binary, nominal, ordinal, etc) that each object will
assume.
Procedures of Cluster Analysis

• A typical cluster analysis consists of four


steps
Overview of clustering

 Feature Selection
 identifying the most effective subset of the original features to
use in clustering
 Feature Extraction
 transformations of the input features to produce new salient features.
 Inter-pattern Similarity
 measured by a distance function defined on pairs of patterns.
 Grouping
 methods to group similar patterns in the same cluster
Outliers

• Outliers are objects that do not belong to any cluster or


form clusters of very small cardinality

cluster

outliers

•In some applications we are interested in discovering outliers, not


clusters (outlier analysis)
Evaluation

⚫ Intra-cluster cohesion (compactness):

⚫ Cohesion measures how near the data points in a cluster are to the
cluster
centroid.
⚫ Sum of squared error (SSE) is a commonly used measure.

⚫ Inter-cluster separation (isolation):

⚫ Separation means that different cluster centroids should be far away


from
one another.


Measuring clustering validity

Internal Index:
•Validate without external info
•With different number of clusters ? ?
•Solve the number of clusters

External Index (Reality)


•Validate
against ground truth ?
•Compare two clusters:(how similar)
?
Internal Measures: SSE

⚫ Internal Index: Used to measure the goodness of a clustering


structure without reference to external information
⚫ Example: SSE

⚫ SSE is good for comparing two clustering or two clusters (average SSE).
⚫ Can also be used to estimate the number of clusters

10

6 9
8
4
7
2 6

SS
E
0 5

4
-2
3
-4
2
-6 1
2 5 1 15 20 25 30
5 10 1
0 0 K
5
Internal Measures: Cohesion and Separation
⚫ Cluster Cohesion: Measures how closely related are objects in a cluster.

⚫Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters.
⚫ A proximity graph based approach can also be used for cohesion and
separation.
⚫ Cluster cohesion is the sum of the weight of all links within a cluster.
⚫ Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.
Categories of clustering Algorithms
Partitioning method

 Partitioning method: Construct a partition of a database D of n objects into a set


of
k clusters
 Various approaches have been proposed and some of them are:
 K-means Approach
 K-medoid Approach, CLARA (Custering LARge Application) and
CLARANS (CLustering Algorithm based on RANdomized Search)
K-means clustering

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be

{x1, x2, …, xn},

where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X ∈Rr, and r is


the number of attributes (dimensions) in the data.

• The k-means algorithm partitions the given data into k clusters.

• Each cluster has a cluster center, called centroid.

• k is specified by the user


K-means algorithm

 Given k, the k-means algorithm is implemented in 4 steps:

1. Select K centroid (Can be K values randomly, or K data points randomly)

2. Partition objects into k subsets. An object will be clustered into class J if it has
the smallest distance with this class mean compared to the distance with the
other class mean

3. Compute the new centroids of the clusters of the current partition. The
centroid of the jth cluster is the center (mean point) of the data point whose
cluster index is found to be the center of class j in the above step.

4. Go back to Step 3, stop when the process converge.


Strengths of k-means

• Strengths:
• Simple: easy to understand and to implement

• Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of
clusters, and t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.


Weaknesses of k-means

• The algorithm is only applicable if the mean is defined.

• For categorical data, k-mode - the centroid is represented by most


frequent values.

• The user needs to specify k.

• The algorithm is sensitive to outliers

• Outliers are data points that are very far away from other data points.

• Outliers could be errors in the data recording or some special data points with
very different values.
Iteration 1
 First we list all points in the first column of the table below. The initial cluster
centers - are (2, 10),
centroids, (8,4) and (1, 2) - chosen randomly.
 The Table shows the distance of each data points (instances) from
chosen centroids. The last column the shows the cluster the instance should be assigned to
based on its distance from the centroids.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, centroid (1,
4) 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Iteration 1
⚫ Next, we need to re-compute the new cluster centers.We do so, by taking the
mean of all points in each cluster.
⚫ For Cluster 1, we have three points (A1, A4, A8) and needs to take average
of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9) Cluster 1
⚫ For Cluster 2, we have three points (A3, A5, A6).The new centroid
is: ((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33) Cluster 2
⚫ ForCluster 3, we have two points (A2, A7).The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
Cluster 3
⚫ Sincecentroids changes in Iteration1, we go to the next Iteration (epoch2)
using the new means we computed.
⚫ The
iteration continues until the centroids do not change
anymore.
Second epoch
⚫ After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9); cluster 2:
{A3,A5,A6} with new centroid = (7,4.33); cluster 3:
{A2,A7} with new centroid=(1.5,3.5)
⚫ Using the new centroid compute cluster members again.
Data Points Cluster 1 with Cluster 2 with Cluster 1 with Cluster
centroid (3.67, 9) centroid (7, 4.33) centroid (1.5, 3.5)

A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
0
A7 (1, 2)
A8 (4, 9)
Final results
⚫Finally in the 2th epoch there is no change of members of clusters and
centroids. So the algorithm stops.
⚫ The result of clustering is shown in the following figure
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also called Dendrogram.
Example of hierarchical clustering
Hierarchical clustering
Types of hierarchical clustering

• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)


from the bottom level, and
• merges the most similar (or nearest) pair of clusters

• stops when all the data points are merged into a single cluster (i.e., the root cluster).

• Divisive (top down) clustering: It starts with all data points in one cluster, the
root.
• Splits the root into a set of child clusters. Each child cluster is recursively divided further

• stops when only singleton clusters of individual data points remain, i.e., each cluster
with only a single point
Types of hierarchical clustering
Step Step Step Step Step
0 1 2 3 4 Agglomerative Nesting
a ab (AGNES)
b abcde
cde
c de

d
ab
e abcde
Divisive
a cde
Analysis
de
(DIANA)
b
Step Step Step
Step 3 2 1 0
c
Agglomerative clustering

It is more popular than divisive methods.

• At the beginning, each data point forms a cluster (also called a


node).

• Merge nodes/clusters that have the least distance.

• Go on merging

• Eventually all nodes belong to one cluster


Agglomerative clustering algorithm
Measuring the distance of two clusters
Hierarchical Clustering: Problems and Limitations

 Once a decision is made to combine two clusters, it cannot be undone

 No objective function is directly minimized

 Different schemes have problems with one or more of the following:


Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex shapes
Breaking large clusters
The phenomenon of reinforcement learning
▪ No knowledge of environment
▪ Can only act in the world and observe states and
reward
observation action

Ot At

reward
Rt

▪ At each step t the agent:


▪ Executes action At
▪ Receives observation Ot
▪ Receives scalar reward Rt
▪ The environment:
▪ Receives action At
▪ Emits observation Ot+1
▪ Emits scalar reward Rt+1
▪ t increment at env. step
The phenomenon of reinforcement learning
▪ It is the sequence of observations, actions, rewards

▪ Ht = O1, R1, A1, ..., At−1, Ot , Rt

▪ i.e. all observable variables up to time t


▪ i.e. the sensorimotor stream of a robot or embodied agent
▪ What happens next depends on the history:
▪ The agent selects actions
▪ The environment selects observations/rewards

▪ State is the information used to determine what happens next

▪ Formally, state is a function of the history:


The phenomenon of reinforcement learning
▪ Many factors make RL difficult:

▪ Actions have non-deterministic effects


▪ Which are initially unknown
▪ Rewards / punishments are infrequent
▪ Often at the end of long sequences of
▪ How
actions do we determine what action(s) were responsible
really for reward or punishment?
▪ World is large and complex
Reinforcement learning

▪ Reinforcement learning is the problem faced by an agent that learns behavior through
trial- and-error interactions with a dynamic environment.
▪ Reinforcement Learning is learning how to act in order to maximize a numerical reward.

▪ RL is not a type of neural network, nor is it an alternative to neural networks. Rather,


it is
an orthogonal approach for Learning Machine.

▪ It emphasizes learning feedback that evaluates the learner's performance without


providing
standards of correctness in the form of behavioral targets.

Training Info = evaluations (“rewards”RL


Inputs / “penalties”) Outputs
System
(“actions”)
Objective: get as much reward as
possible
Reinforcement learning
▪ Elements of reinforcement
▪ Temporally situated
learning
▪ Continual learning and planning
▪ Object is to affect the environment
▪ Environment is stochastic and
uncertain Policy
Agent

State st Reward rt
Action t
Next state s t+1 a
Environment

▪ Agent: Intelligent programs


▪ Environment: External condition
▪ Policy

:
Defines the agent’s behavior at a given time
▪ A mapping from states to actions
Lookup tables or simple function
Reinforcement learning
▪ Elements of reinforcement
learning

Policy

Reward
Valu Model of
e environment • Policy: what to do
• Reward: what is good
• Value: what is good because it
predicts reward
▪ Reward function: • Model: what follows what
▪ Defines the goal in an RL problem
▪ Policy is altered to achieve this goal
▪ The task is to learn an optimal policy that maps states of the world to actions of the agent. I.e., if
this patch of room is dirty, I clean it. If my battery is empty, I
recharge it.
Markov Decision Process
▪ MDP is a classical formalization of sequential decision making, where actions influence
not
just immediate rewards, but also subsequent situations, or states, and through those future rewards,
▪ It meant to be a straightforward framing of the problem of learning from
interaction to achieve a goal.
▪ The learner and decision maker is called the agent. The thing it interacts with,
comprising everything outside the agent, is called the environment.
▪ These interact continually, the agent selecting actions and the environment responding to
these
actions and presenting new situations to the agent.
Markov Decision Process
▪ At time step t=0, environment samples initial state s0 ~ p(s0)

▪ Then, for t=0 until done:

▪ Agent selects action at

▪ Environment samples reward rt ~ R( . | st, at)

▪ Environment samples next state st+1 ~ P( . | st, at)

▪ Agent receives reward rt and next state st+1

▪ A policy u is a function from S to A that specifies what action to take each


in state
▪ Objective: find policy u* that maximizes cumulative
discounted reward:
Markov Decision Process
▪ A simple MDP: Grid World
states
actions =

1.{ right
Set a negative “reward” for each

2. left transition (e.g. r = -1)
3. up
4. down
▪ Obje}ctive: reach one of terminal states (greyed out) least number of in
actions
▪ Example: Recycle robot MDP
▪ At each step, robot has to decide whether it should (1) actively search for a can,
(2) wait for someone to bring it a can, or (3) go to home base and recharge.

▪ Searching is better but runs down the battery; if runs out of power while searching, has to be
rescued (which is bad).
Markov Decision Process
▪ Example: Recycle robot MDP
▪ Decisions made on basis of current energy level: high, low.
▪ Reward = number of cans collected
Rsearch = expected no. of cans while searching Rwait =
S = {high,low} expected no. of cans while waiting
A( high ) = {search,wait}
search
wait
Dimensionality reduction techniques

• Another unsupervised learning setting, e.g PCA (Prinicipal component analsysis)


• Dimensionality reduction is simply, the process of reducing the dimension of your
feature set.
• Your feature set could be a dataset with a hundred columns (i.e. features).
Dimensionality reduction techniques

Dimensionality reduction
• Represent the data x1 , . . . , xn ∈ Rd in a subspace of lower
dimension with as little loss of information as possible
• Advantages:
• Visualization
• Lower computation and time complexity
• Avoid overfitting and reduce noise
Dimensionality reduction techniques

• Problem settings:
• Given x 1 , . . . , x n ∈ Rd
• find a k -dimensional subspace
• such that the data projected onto that space
• is as close to the original data as possible

• Common techniques:
• Principal Component Analysis(PCA)
• Factor Analysis
• Linear Discriminant Analysis( LDA) please read the details of each
• Multiple Correspondence Analysis (MCA)
• t-Distributed Stochastic Neighbour Embedding (t-
SNE)
Dimensionality reduction techniques

• The most common and well known dimensionality reduction (PCA,


LDA, FA)
• Principal Component Analysis(PCA): PCA rotates and projects data along
the direction of increasing variance. The features with the maximum
variance are the principal components.
Dimensionality reduction techniques

• Linear Discriminant Analysis( LDA): projects data in a way that the


class separability is maximized. Examples from same class are put
closely together by the projection. Examples from different classes are
placed far apart by the projection.
Dimensionality reduction techniques

• Factor Analysis: a technique that is used to reduce a large number


of variables into fewer numbers of factors.
• The values of observed data are expressed as functions of a number of possible
causes in order to find which are the most important.
• The observations are assumed to be caused by a linear transformation of lower
dimensional latent factors and added Gaussian noise.

You might also like