0% found this document useful (0 votes)
0 views

Machine Learning Unit3

Machine Learning Unit3

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
0 views

Machine Learning Unit3

Machine Learning Unit3

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 26
UNIT Il Ensemble Techniques and Unsupervised Learning syllabus Combining mutiple learners : Model combination schemes, Voting, Ensemble Learning - bagging, boosting. stacking, Unsupervised learning : K-means, Instance Based Learning KNN, Gaussian enure niodels and Expectation maximization Contents 34 Combining Multiple Learners 32. Ensemble Learning 33 Clustering 34. Instance Based Learning : KNN 35 Gaussian Mixture Models 3.6 Two Marks Questions with Answers oN 3-2 Ensemble Techniques and Unsupervisey oom Machine Leaming El Combining Multiple Learners * When designing a learning machine, we generally en choices ly s of machine, training data, representation, ete. This implies some Parameters ol erformance. For example, in a classification setting, we CaN tig ' a fier or in a multilayer perceptron, we should also decide on ie ing number of hidden units. * Each learning algorithm dictates a certain model that comes with a set, assumptions. This inductive bias leads to error if the assumptions do not hold fp the data, * Different learning algorithms have different accuracies. The no free lunch theorem asserts that no single learning algorithm always achieves the best Performance in any domain. They can be combined to attain higher accuracy. * Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. Fusion of data for improving prediction accuracy and reliability is an important problem in machine learning. * Combining different models is done to improve the performance of deep earning models. Building a new model by combination requires less time, data, and computational resources. The most common method to combine models is by averaging multiple models, where taking a weighted average improves the accuracy, lead to different classifiers, * Different Hyper-parameters : We can use the same learning algorithm but use it with different hyper-parameters, * Different Input Representations: Different Tepresentations make different characteristics explicit allowing better identification, * Different Training Sets ; Another Possibility is to train different base-learners by different subsets of the training set. A Model Combination Schemes * Different methods are used for Benerating final output for multiple base learners are Multiexpert and multistage combination. 1. Multiexpert combination, TECHNICAL PUBLICATIONS® - an up.thrust for knowiedge rg eaming - oct 3-3 Ensemble Techniques and Unsupervised Learning . Mali Sra nee methods have base-learners that work in parallel a) Globa : Ee rf i Samer fusion) : given an input, all base-learners generate an output and all these outputs ate used, such as voting and stacking b) Local approach (learner selection) : in mixture of experts, there is a gating model, which looks at the input and chooses one (or very few) of the learners as responsible for generating the output, + Multistage combination ; Multistage combination methods use a serial approach where the next multistage combination base-learner is trained with or tested on only the instances where the previous base-learners are not accurate enough. «+ Let's assume that we want to construct a function that maps inputs to outputs from a set of known Nyrain input-output pairs. « Let's assume that we want to construct a function that maps inputs to outputs from a set of known N train input-output pairs. Nea Durain = {i-ya)l where xj € X is a D dimensional feature input vector, y; € Y is the output. + Classification : When the output takes values in a discrete set of class labels Y = {c1;€p/...¢,], Where K is the number of different classes. Regression consists in predicting continuous ordered outputs, Y = R. ERE] Voting « The simplest way to combine multiple classifiers is by voting, which corresponds to taking a linear combination of the learners. Voting is an ensemble machine learning algorithm. + For regression, a voting ensemble involves making a prediction that is the average of multiple other regression models. * In classification, a hard voting ensemble involves summing the votes for crisp class labels from other models and Predicting the class with the most votes. A soft voting ensemble _ involves summing the predicted probabilities for class labels and predicting the class label with the largest sum probability. * Fig. 3.1.1 shows Base-learers with their outputs. Fig. 3.1.1 Base-learners with their outputs TECHNICAL PUBLICATIONS® - an up-thrust for knowledge — N Machine Learning a4 Ensomble Techniques and Unsupervised 4, ming 0 multiple classification/ regression del can be created using different using the same dataset «In this methods, the first step is to create using some training dataset. Each base mo of the same training dataset and same algorithm, or els SPlits ; : with different algorithms, or any other method. Learn multiple alternative definitions of a conce t using different training dat, different learning algorithms. It combines decisions of multiple definitions using weighted voting. Fig. 3.1.2 shows general idea of Base- 1, Jearners with model combiner. Training data set [Learner ty I I Model combiner Final model TECHNICAL PUBLICATIONS® ONS” - an upcthrust for inowledge Machine Learning S25) Ensemble Techniques and Unsupervised Leaming + Use a single, arbitrary learning algorithm b oneal ut manipulate training data to make it learn multiple models. [SES Error-Correcting Output Codes «In Error-Correcting Output Codes main classification task is defined in terms of a number of subtasks that are implemented by the base-learners. The idea is that the original task of separating one class from all other classes may be a difficult problem. + So, we want to define a set of simpler classification problems, one aspect of the task, and combining these simpler classifie classifier. each specializing in rs, we get the final + Base-learners are binary classifiers having output - 1/ + 1, and there is a code matrix W of K x L whose K rows are the binary codes of classes in terms of the L base-learners dj. * Code matrix W codes classes in terms of learners © One per class L = K 410-1 -1 41 aH -71-4 Wed a a i A-a-t Aj ee + The problem here is that if there is an error with one of the base-learners, there may be a misclassification because the class code words are so similar. So the approach in error-correcting codes is to have L > K and increase the Hamming distance between the code words. One possibility is pairwise separation of classes where there is a separate base-learner to separate C; from C, for i a Fig. 3.2.4 aL AdaBoost : : / * AdaBoost, short for "Adaptive Boosting", is a machine learning meta - algorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gédel Prize" in 2003 for their work. It can be used in conjunction with many other types of learning algorithms to improve their performance. * It can be used to learn weak classifiers and final classification based on weighted vote of weak classifiers. 3 © It is linear classifier with all its desirable properties. It has good generalization Properties. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge , oF 10 Ensomble Techniques and Unsupervised Lo arg | te prediction rule bY calling tn, raining examples, accural jons over the # the weights of incorrectly tions that the Previously ration. or to form a highly # To use the weak learner fe i ' wwaak leamer repeatedly’ oF different distribu me cet equally, but each ‘tially, atl weights are set equally k a classified ew a ; ‘ classifier poorly prvicts receive greater weight on th Advantages of AdaBoost } to implement 1. Very simple 2. Fairly good generalization The prior error need not be s known ahead of time. Disadvantages of AdaBoost : 1, Suboptimal solution 2 Can over fit in presence of noise. Boosting Steps : Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner CL Draw second random training subset d2 without replacement from the training set that were previously falsely and add 50 percent of the samples lassified/misclassified to train a weak learner C2 Find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak learner C3 4. Combine all the weak learners via majority voting. ‘Advantages of Boosting : 1. Supports different loss function. 2 Works well with interactions. Disadvantages of Boosting : 1. Prone to over-fitting. 2 a ; , Requires careful tuning of different hyper - parameters, Stacking * Stacking, sometimes called stacked eneralization, achine learning 8 ization, is an ensemble machi eaming v 3-11 Ensemble Tochniquos and Unsupervised Leaming model is trai + The an on the nares on the complete training data, and then the meta-model is wane ae ra a tie Of the base models. The advantage of stacking is the ability ‘tion space with different models in the same problem. , The stacking based model can be visualized in levels and has at least two levels of the models. The first level typically trains the two or more base learners(can be heterogeneous) and the second level might be a single meta learner that utilizes the base models predictions as input and gives the final result as output. A sacked model can have more than two such levels but increasing the levels doesn't always guarantee better performance, » Inthe classification tasks, often logistic regression is used as a meta learner, while linear regression is more suitable as a meta leamer for regression-based tasks. » stacking is concerned with combining multiple classifiers generated by different learning algorithms Lj,...,.Ly)y on a single dataset S, which is composed by a feature vector Sj =(x;,t)) « The stacking process can be broken into two phases : 1. Generate a set of base - level classifiers C1,...,Cyy where C; = L\(S) 2, Train a meta - level classifier to combine the outputs of the base - level dassifiers. Fig. 32.2 shows stacking frame. reaining set[ 1] 2 [3 4] | Hypotheses |) 3) G3) se G) | Training observations Predictions on training observations a eee H_ |+—Least observation Final prediction] P- Fig. 3.2.2 Stacking frame TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge — i id Unsupervised L ible Techniques an amir Machine Leaming 3-12 Ensem 7 ted through a leave - one | © The training set for the meta - level classifier is general out cross validation process. Vy = 1, mand V_ Cy = L,(S-si) icti for Ss ‘The leamed classifiers are then used to generate predictions for sj = Chx;) oh yi vel dataset consists of examples of the form ((¥j--- yi) where © The meta - let s of the base - level classifiers and the class is the the features are the prediction: correct class of the example in hand. «Why do ensemble methods work ? * Based on one of two basic observations : : oe 1. Variance reduction : If the training sets are completely independent, it will always helps to average an ensemble because this will reduce variance without affecting bias (e.g. - bagging) and reduce sensitivity to individual data points. 2. Bias reduction : For simple models, average of models has much greater capacity than single model Averaging models can reduce bias substantially by increasing capacity and control variance by Citting one component at a time. E23 Adaboost * AdaBoost also referred to as adaptive boosting is a method in Machine Learning used as an ensemble method. The maximum not unusual algorithm used with AdaBoost is selection trees with one stage meaning with decision trees with most effective 1 split. These trees also are referred to as decision stumps. Stump. * The working of the AdaBoost version follows Fig. 3.2.3 Adaboost the beneath-referred to path : + Creation of the base leamer. = Calculation of the total error via the beneath formulation. * Calculation of performance of the decision stumps. «Updating the weights in line with the misclassified factors. Creation of a new database : AdaBoost ensemble : * In the ensemble approach, we upload the susceptible fashions sequentially and then teach them the use of weighted schooling records. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Learning 3-13 Ensemble Techniques and Unsupervised Learning + We hold to iterate the process till we gain the advent of a pre-set range of vulnerable learners or we can not look at further improvement at the dataset. At the end of the algorithm, we are left with some vulnerable learners with a stage B ifee. "BFAD Difference between Bagging and Boosting Bagging . Boosting Bagging is a technique that builds Boosting refers to a group of algorithms | multiple homogeneous models from that utilize weighted averages to make different subsamples of the same weak learning algorithms stronger training dataset to obtain more accurate _ learning algorithms. predictions than its individual models Learns them independently from each __Learns them sequentially in a very other in parallel adaptative way It helps in reducing variance. It helps in reducing bias and variance. Every model receives an equal weight. _ Models are weighted by their performance. Clustering + © Given a set of objects, place them in groups such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. © Cluster analysis can be a powerful data-mining tool for any organization that needs to identity discrete groups of customers, sales transactions, or other types of behaviors and things. For example, insurance providers use cluster analysis to detect fraudulent claims and banks used it for credit scoring. * Cluster analysis uses mathematical models to discover groups of similar customers based on the smallest variations among customers within each group. * Cluster is a group of objects that belong to the same class, In another words the _ similar object are grouped in one cluster and dissimilar are grouped in other cluster, Clustering is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the sub class shares a common trait. It helps a user Understand the natural grouping or structure in a data set. parious types of clustering methods are partitioning methods, hierarchical lustering, fuzzy clustering, density based clustering and model based clustering. Cluster anlysis is process of grouping a set of data objects into clusters. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Techniques 2nd UNSUPerV0 Lo4m, g-14___ Ensemble ; Machine Leeming . stering algorithm are as fol i clu + Desirable properties of @ Fig. 3.3.1 1. Scalability (in terms of both time and space) 2. Ability to deal with different data types _ ' uirements for domain knowledge to determine input parameters, 3. Minimal req} 4. Interpretability and usability. eee of data is a method by which large sets of data are grouped into , clusters of smaller sets of similar data. Clustering can be considered the most important unspervised learning problem. A cluster is therefore a collection of objects which are "similar" between them and are dissimilar" to the objects belonging to other clusters. Fig. 3.3.1 shows cluster. In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance : two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance), This is called distance-based clustering. Rew data Clustering algorithm Clusters of data Fig. 3.3.2 * Clustering means grouping of data or dividing a large data set into smaller data sets of some similarity. . " ‘ - Centroid A clustering algorithm attempts to find natural groups = components or data based on some similarity. Also, the clustering algorithm finds the centroid of a group of TECHNICAL PUBLICATIONS® - an Up-thrust for knowl owledge Machine Leaming 3-15 Ensemble Techniques and Unsupervised Learning basically a statistical description of the cluster centroids with the number of components in each cluster. + Cluster centroid: The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the cluster. Each cluster has a well defined centroid. « Distance : The distance between two points is taken as a common metric to as see the similarity among the components of population. The commonly used distance measure is the euclidean metric which defines the distance between two points P=(P1/P2---) and q =(q1,q2,...)is given by, k d = Yi-ai)? a + The goal of clustering is to determine the intrinsic grouping in a set of unlableled data. But how to decide what constitutes a good clustering ? It can be shown that there is no absolute "best" criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply criterion, in such a way that the result of the clustering will suit their needs. « Clustering analysis helps construct meaninful partitioning of a large set of objects Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing etc. * Clustering algorithms may be classified as listed below : 1. Exclusive clustering 2. Overlapping clustering 3. Hierarchical clustering 4, Probabilisitic clustering. * A good clustering method will produce high quality clusters high intra - class similarlity and low inter - class similarity. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by it's ability to discover some or all of the hidden patterns. * Clustering techniques types : The major clustering techniques are, a) Partitioning methods >) Hierarchical methods °) Density - based methods. TECHNICAL PUBLICATIONS® - an up-hrust for knowledge ET | Machine Learning means ach cluster is represented 4. Here © it is typicall bY the usters, it is typically a user in, ‘Put matically estimate K. unsupervised Learning * F tho «js heuristic Me K-Means clustering 1 h stands for number of ch center of the cluster. can be used to auto thm; some ; to the algorithm; s°! £ components of the population equa a This method ity take o! this step itself ts fal required numbe E ee De ee al such that the points are mutually farthes' apart. us apart o! each component in the population and assigns it to one of the 8 minimum distance. The centroid’s position j, o the cluster and this continue onent is added t ‘he final required number of clusters, criteria akes the number © f clusters. In Next, it examine: clusters dependi recalculated everytime a CO until all the components are 8} Given K, the K-means algorithm consists of f 1. Select initial centroids at random. 2. Assign each object to the cluster wii 3. Compute each centroid as the mean of the obj ling on the mp: rouped into # four steps : ith the nearest centroid. jects assigned to it. 4. Repeat previous 2 steps until no change. The x1,..,xy are data points or vectors of observations. Each observation (vector x;) will Es assigned to one and only one cluster. The C(i) denotes cluster number for the i observation. K-means minimizes within-cluster point scatter : K wo => 22 cece Ix; K K = ZNK OY linmg |? Kel C=K i IP where mx is the mean vector of the K"®: clustey r. Nx is the numbe er of observations in Kth fons in K cluster. K-Means Algorithm Properties 1. There are always K clusters, 2. There i: is always at least one item in ch each cluster, 3. The clusters are i Non-hierarchical and they d cluster js closer t eee Weis ways j 0 its ch YS involve the “omter’ ofc than any other cluster bea lusters, 4. Every member of a closeness does not al re jing Learning: 3-17 __Ensomble Techniques and Unsupervised Leaning _ Mech K-Means Algorithm Process The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points. For each data point. Calculate the distance from the data point to each cluster. p, If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. The choice of initial partition can greatly affect the final clusters that result, in terms of inter - cluster and intracluster distances and cohesion. Kemeans algorithm is iterative in nature. It converges, however only a local minimum is obtained. It works only for numerical data. This method easy to c - implement. + Advantages of K-Means Algorithm : 1, Efficient in computation 2, Easy to implement. + Weaknesses 1. Applicable only when mean is defined. Need to specify K, the number of clusters, in advance. 2 3. Trouble with noisy data and outliers. 4. Not suitable to discover clusters with non-convex shapes. "© kNearest Neighbour is one of the only Machine Learning algorithms based totally on supervised learning approach. * NN algorithm assumes the similarity between the brand new case/facts and available instances and placed the brand new case into the category that is Maximum similar to the to be had classes. : NN set of rules shops all of the to be had facts and classifies a new statistics Point based at the similarity. This means when new data seems then it may be effortlessly categorised into a properly suite class by using k-NN algorithm. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 18 Ensemble Techniques and Unsupervised Learning Machine Learning SI nas well as for classification howeye, 18801 > .s can be used for regres ane we ification troubles. cause of this it does no longer makes any normally it's miles used for the class K-NN is a non-parametric algorithm, be assumption on underlying data. because it does no lon, learner set of rules ger is also referred to as a lazy pape on - a trom the training set immediately as a substitute it shops ataset ang ese: . at the time of class, it plays an movement at the datase' ‘The KNN set of rules at the schooling section simply stores the dataset and when then it classifies that statistics into a class that is an awful Jot it gets new data, similar to the brand new data. ; Example ; Suppose, we've an picture of a creature that ole oe like cat and dog, but we want to know both it is a cat or dog. So for this identity, we are able to use the KNN algorithm, because it works on a similarity degree. Our kKNN version will discover the similar features of the new facts set to the cats and dogs snap shots and primarily based on the most similar functions it will place it in both cat or canine class. EEE why Do We Need KNN ? * Suppose there are two categories, ile, category A and category B and we've a brand new statistics point x1, so this fact point will lie within of these classes. To solve this sort of problem, we need a k-NN set of rules. With the help of k-NN, we will without difficulty discover the category or class of a selected dataset. Consider the underneath diagram : x, % % oo 0 20 o%o a Q 0° 0 0% 0 Category B ° ° . ‘ \ Category B ° O80 New data point ° AC ee ©0600 New data point oo assigned to Category A 5 cates I oo Category A X | _ Before k-NN x, After k-NN Fig. 3.4.1 Why do we need KNN ? TECHNICAL PUBLicatinuin® chine Learning 3. E Ma 19 Ensemble Techniques and Unsupervised Learning po How Does KNN Work 2 The k-NN working can b is . ‘SB explained on the basis of the below algorithm : step- 1? Select the wide variety k of the acquaintances. B step 2? Calculate the Euclidean distance of k variety of friends. _ step + 3: Take the k nearest neighbors as according to the calculated Euclidean - distance. step - 4: Among these ok pals, count number the number of the data points in each dass. step -5: Assign the brand new records points to that cate; for which th tit "of the neighbor is maximum. P ce ha ee step - 6: Our model is ready. + Suppose we've got a brand new information point and we want to place it in the required category. Consider the under image X2 & 6 ° 4° . @ ° Category B o\ Category A x Fig. 3.4.2 KNN example we are able to select the _ ® Firstly, ok = 5. we are able to pick the number of friends, so Euclidean distance between the facts points. The * Next, we will calculate the which we've got already studied in Euclidean distance is the gap between points, geometry. It may be calculated as : TECHNICAL PUBLICATIONS® - an up-thrust for knowledge and Unsupervised Leary Techniques 9 Ensemble 3-20 Machine Leaming Best hyperplane Euclidean distance between A, and By % +(Y2-Y4)' Fig. 3.4.3 KNN example continue By calculating the Euclidean distance we Bot the nearest acquaintances, as 3 egory A and two Nearest associates in class B, Consider TECHNICAL PUBLICATIONS® . an up.tnrust for knowledge 3-21 ° ° @° 0 \.@ %o Category 8 New data point ° oo ee? Category A Ensemble Techniques and Unsupervised Leaming Category A: 3 neighbours Category B : 2 neighbours % Fig. 3.4.4 KNN example continue « As we are able to see the three nearest acquaintances are from category A, subsequently this new fact point must belong to category A. _ EEE] Difference between K-means and kNN Sr. No. K-means i K-Means is an unsupervised machine learning algorithm used for clustering. as K-Means is an eager learner 3. It is used for Clustering 4 K' in K-Means is the number of clusters the algorithm is trying to identify/learn from the data 5. K-means require unlabelled data. It gathers and groups data into k number of clusters. K-NN is a lazy learner kKNN KNN is a supervised machine learning algorithm used for classification. It is used mostly for Classification, and sometimes even for Regression K in KNN is the number of nearest neighbours used to : classify or predict a test sample | KNN require labelled data and will give new data points accordingly to the k number or the closest data points. semble Techniques 2d UNSUPESe¢ Loon, 0 3-22 En: Machine Learning 7 ixture Models Gaussian Mixtu' | Es . odels is a “soft” clustering algorithm, where each po; © Gaussian Sa ianeT to all clusters. This is different than k-means where g,4, bilistical proba ster. point belongs to one clu « The Gaussian mixture mo : points are generated from a mix of parameters. ; oe . « For example, in modeling human height data, height is pea oaned as a normal distribution for each gender with a mean of approximately 5'10" for males and 5'5" for females. Given only the height data and not the gender assignments for each data point, the distribution of all heights would follow the sum of two scaled (different variance) and shifted (different mean) normal distributions. 4 model making this assumption is an example of a Gaussian mixture model. * Gaussian mixture models do not rigidly classify each and every instance into one class or the other. The algorithm attempts to produce K-Gaussian distributions that would take into account the entire training space. Every point can be associated with one or more distributions. Consequently, the deterministic factor would be the probability that each point belongs to a certain Gaussian distribution. Mean vectors and covariance del is a probabilistic model that assumes all the datg f Gaussian distributions with unknoys * Gaussian mixture model consists of two parts : matrices. + A Gaussian distribution is defined as a continuous probability distribution that takes on a bell-shaped curve. Another name for Gaussian distribution is the normal distribution. * In a one dimensional space, the probability density function of a Gaussian distribution is given by : bow)? f(x|p,02) = where 1 is the mean and o? is the variance . ae mixture models can be used for a variety of use cases, including fying customer segments, detecting fraudulent activity and clustering image * GMMs have a variety of real-wor i I-world applicati ' , a) Used for signal processing applications. Some of them are listed below, b) Used for customer churn analysis ©) Used for language identification eee poring Le0rind 3-23 Ensemble Techniques and Unsupervised Leaming 4) Used in video game industry e) Genre classification of songs, al Expectation - Maximization ein Gaussian ae models, an expectation - maximization method is a powerful tool for estimating the parameters of a Gaussian mixture model. The expectation is termed E and maximization is termed M. «Expectation is used to find the Gaussian parameters which are used to represent each component of Gaussian mixture models. Maximization is termed M and it is involved in determining whether new data points can be added or not. The Expectation-Maximization (EM) algorithm is used in maximum likelihood estimation where the problem involves two.sets of random variables of which one, X, is observable and the other Z, is hidden. «The goal of the algorithm is to find the parameter vector @ that maximizes the likelihood of the observed values of X, L($1X). ‘© But in cases where this is not feasible, we associate the extra hidden variables Z and express the underlying model using both, to maximize the likelihood of the joint distribution of X and Z, the complete likelihood L.(@1X,Z). [EE@] EM Algorithm _« Expectation-Maximization (EM) is an iterative method used to find maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved, also called latent, variables. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step. © The parameters found on the M step are then used to start another E step, and the "process is repeated until some criterion is satisfied. EM is frequently used for data clustering like for example in Gaussian mixtures. In the Expectation step, find the expected values of the latent variables (here you need to use the current parameter values). In the Maximization step, first plug in the expected values of in the log-likelihood of the augmented data. Then maximize this log-likelihood to the latent variables Teevaluate the parameters. X | | | ae | Machine Leaming 9-24 Ensemble Techniques and Unsupervised Leaming tion (EM) is a technique used in point estimation. Given al is Expectation-Maximiz: latent) variables Z we want to set of observable variables X and unknown ( estimate parameters @ in a model. / The Expectation Maximization (EM) algorithm SE ees asim likeli-hood estimation procedure for statistical models when © of the variables in the model are not observed ; The EM algorithm is an elegant and powerful method for finding the maximum likelihood of models with hidden variables. The key concept in the EM algorithm is that it iterates between the expectation step (E-step) and maximization step (M-step) until convergence. In the E-step, the algorithm estimates the posterior distribution of the hidden variables Q given the observed data and the current parameter settings; and in the M-step the algorithm calculates the ML parameter settings with Q fixed. At the end of each iteration the lower bound on the likelihood is optimized for the given parameter setting (M-step) and the likelihood is set to that bound (E-step), which guarantees an increase in the likelihood and convergence to a local maximum, or global maximum if the likelihood function is unimodal. Generally, EM works best when the fraction of missing information is small and the dimensionality of the data is not too large. EM can require many iterations, and higher dimensionality can dramatically slow down the E-step. EM is useful for several reasons: conceptual simplicity, ease of implementation, and the fact that each iteration improves 1(@). The rate of convergence on the first few steps is typically quite good, but can become excruciatingly slow as you approach local optima. Sometimes the M-step is a constrained maximization, which means that there are constraints on valid solutions not encoded in the function itself. Expectation maximization is an effective technique that is often used in data analysis to manage missing data. Indeed, expectation maximization overcomes Some of the limitations of other techniques, such as mean substitution or aca substitution. These alternative techniques generate biased estimates-and, Specifically, underestimate the standard e i 1 rors. Expectati imizati overcomes this problem. ee Two Marks Questions with Answers Q1 What Is unsupervised learning 7 TECHNICAL PUBLICATIONS® « an up.tnnsy for knowled 190 Machine Learning 3-25 Ensemble Techniques and Unsupervised Learning What is semi-supervised learning 7 a2 Semi-supervised learning uses both labeled and unlabeled data to improve Ans. supervised learning. what is ensemble method ? a3 ‘Ans. : Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. It combine the insights obtained from multiple learning models to facilitate accurate and improved decisions. Q4 What is cluster ? ‘Ans. : Cluster is a group of objects that belong to the same class. In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster. a5 Explain clustering. ‘Ans. : Clustering is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait. It helps a user understand the natural grouping or structure in a data set. Q6 What is bagging ? ‘Ans. : Bagging is also known as Bootstrap aggregation, ensemble method works by training multiple models independently and combining later to result in a strong model. Q7 Define boosting. Ans. : Boosting refers to a group of algorithms that utilize weighted averages to make weak learning algorithms stronger learning algorithms Q8 What is k-Nearest Neighbour Methods ? Ans.: © The k-Nearest Neighbor (KNN) is a classical classification method and requires no training effort, critically depends on the quality of the distance measures among examples. * The KNN classifier uses mahalanobis distance function. A sample is classified according to the majority vote of the its nearest k training samples in the feature’ space. Distance of a sample to its neighbors is defined using a distance function, Q9 Which are the performance factors that Influence KNN algorithm ? Ans. : The performance of the kNN algorithm is influenced by three main factors : 1. The distance function or distance metric used to determine the nearest neighbors. 2. The decision rule used to derive a classification from the k-nearest neighbors. 3. The number of neighbors used to classify the new example. TECHNICAL PUBLICATIONS® = an up-thrust for knowledge 6 Ensemble Technigue’ and Unsupenisgg 3-2 Machine Leaming fe Komeans clustering ? heuristic me! kemeans algoti k-clusters $0 shod. Here each cluster is representeg thm takes the input parameter, the that the resulting intracluster aia ang tity is Q.10 What Ii Ans. : k-means clustering is center of the cluster. The n objects into Juster similarity is low. perties of K-Means algorithm. partitions a set of high but the intercl Q.11 List the pro ‘Ans. : 1. There are always k clusters. 2, There is always at least one item in each cluster. 3, The clusters are non-hierarchical and they do not overlap. Q.12 What is stacking ? aa Stacking, sometimes called stacked generalization, is an ensemble mach g method that combines multiple heterogeneous base or component mises a meta-model. Q.13 How do GMMs differentiate from K-means clustering ? Ans.: GMMs and K-means, both i pervis anaes , both are clustering algorithms used ; lamin ng tas. However, the basic difference between the is — ects clusteri: i i ee : gears ing method while GMMs is a distribution based ae : tering Quo

You might also like