ARTIFICIAL INTELLIGENCE BUI NGOC DUNG Information (if available)
CHAPTER 5: UNSUPERVISED LEARNING
K-MEANS CLUSTERING UNSUPERVISED LEARNING ❑ Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision. ❑ The aim of an unsupervised algorithm is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. CLUSTER ❑ The organization of unlabeled data into similarity groups called clusters. ❑ A cluster is a collection of data items which are “similar” between them, and “dissimilar” to data items in other clusters. CLUSTERING Clustering is a type of unsupervised learning that automatically forms clusters of similar things. K-MEANS CLUSTERING K-means is an algorithm that find k clusters for a given dataset. The number of clusters k is user defined. Each cluster is described by a single point known as the centroid. Centroid means it’s at the center of all the points in the cluster. K-MEANS CLUSTERING ❑ Pros: Easy to implement ❑ Cons: Can converge at local minimal; slow on very large datasets ❑ Work with: Numeric values PSEUDO-CODE 𝐶𝑟𝑒𝑎𝑡𝑒 𝑘 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑜𝑟 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑠 (𝑜𝑓𝑡𝑒𝑛 𝑟𝑎𝑛𝑑𝑜𝑚𝑙𝑦) 𝑊ℎ𝑖𝑙𝑒 𝑎𝑛𝑦 𝑝𝑜𝑖𝑛𝑡 ℎ𝑎𝑠 𝑐ℎ𝑎𝑛𝑔𝑒𝑑 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑎𝑠𝑠𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑜𝑢𝑟 𝑑𝑎𝑡𝑎𝑠𝑒𝑡: 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑: 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑎𝑛𝑑 𝑝𝑜𝑖𝑛𝑡 𝑎𝑠𝑠𝑖𝑔𝑛 𝑡ℎ𝑒 𝑝𝑜𝑖𝑛𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑠𝑡 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑎𝑡 𝑐𝑙𝑢𝑠𝑡𝑒𝑟: 𝑎𝑠𝑠𝑖𝑔𝑛 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑡𝑜 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 DISTANCE MEASURE ❑ Distance measure determines the similarity between two elements and influences the shape of clusters. ❑ K-Means clustering supports various kinds of distance measures, and the most method is used Euclidean measure to calculate the distance between two points. GENERAL APPROACH TO K-MEANS CLUSTERING 1. Collect: Any method. 2. Prepare: Numeric values are needed for a distance calculation, and nominal val ues can be mapped into binary values for distance calculations. 3. Analyze: Any method. 4. Train: Doesn’t apply to unsupervised learning. 5. Test: Apply the clustering algorithm and inspect the results. Quantitative error measurements such as sum of squared error (introduced later) can be used. 6. Use: Anything you wish. Often, the clusters centers can be treated as representative data of the whole cluster to make decisions. ILLUSTRATION ❑ https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ ❑ http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html THANK YOU INFORMATION (IF AVAILABLE) Information (if available)