0% found this document useful (0 votes)
5 views

FML Unit3

FML Unit3

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
5 views

FML Unit3

FML Unit3

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
Supervised |, ay 2-14 ty del, if bl is 0 1 model, i 5, retable. In the ST ath a 0.05 increase in thet Machine Leaming - interp’ - . istic model is less interPE Ne cociates The logistic mote nit increase in X1 is > Ye never met anyone wig °8 means that a one - unl does that mean! any odds that Y is 1. And what intuition for log odds. ; i ive Mode! EX] Probabilistic Generati statistical models that generate new ga, ate unsupervised machine learning to perf. ‘hood estimation, modelling data points, a hese probabilities. * Generative models are @ class instances. These models are used tasks such as probability and cae istinguishing between classes using : wee Sern rely on the Bayes theorem to find the ioe probabil Generative models describe how data is generated using, Pr a ili noi! They predict P(ylx), the probability of y given x, calculating the P(xy), thy probability of x and y. Ee Naive Bayes * Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. It is highly’scalable, requiring a number of parameters linear in the number of variables (features /predictors) in a learning problem. + A Naive Bayes Classifier is a program which predicts a class value given a set of attributes. * For each known class value, 1. Calculate probabilities for each attribute, conditional on the class value. 2. Use the product rule to obtain a joint conditional probability for the attributes. 3. Use Bayes rule to derive conditional ‘lit robabiliti cable’ * Once this has been done f PI ities for the class variable. probability, Or all class values, output the class with the highet ape Leeming: wore conditional Probability + Let Aand B be two events such that Pj of B given that A has occurre 2-18 Supervised Learning (A) > 0. We denote P(BI A) the probability d. Since A is kr is known to have occurred, it becomes the ew sample space replacing the ori ; ‘ a ne the original S. From this, the definition is P(B/A) = PA) OR P(A B) = P(A) P(B/A) The notation PBI A) is read "the probability of event B given event A”. It is the probability of an event B given the occurrence of the event A. We say that, the probability that both A and B occur is equal to the probability that A occurs times the probability that B occurs given that A has occurred. We call P(BIA) the conditional probability of B given A, ive., the probability that B will occur given that A has occurred. Similarly, the conditional probability of an event A, given B by, P(AN B) P(AB) = oo The probability P(A1B) simply reflects the fact that the probability of an event A may depend on a second event B. If A and B are mutually exclusive AN B= ¢ and P(AIB) =0. Another way to look at the conditional probability formula is : PiSecond/Fi P (First choice and second choice) emcees P (First choice) Conditional probability is a defined quantity and cannot be proven. The key to solving conditional probability problems is to = 1. Define the events. 2. Express the given information and question in probability notation. 3. Apply the formula. Joint Probability * A joint probability is a probability that measures the likelihood that two or more events will happen concurrently. * If there are two independent events A and B, the probability that A and B will occur is found by multiplying the two probabilities. Thus for two events A and B, the special rule of multiplication shown symbolically is : P(A and B) = P(A) P@)- TECHNICAL PUBLICATIONS® - an p-thrust for knowledge Mac Superviseg 2-16 22ming chine Leaming : n is used to find the joint probability thay tion is : iw the general rule of multiplication is, * The general rule of multiplicat events will occur. Symbolically, = P(A) P(BIA) 7 P(A and 2 sia te b) is called the joint probability for tWo events A ang» * The probability s . nape h t in the sample space. Venn diagram will readily shows that which intersect it sa P(An B) = P(A) + PB) - P (AU B) 2 Equivalently : P(AN B) = P(A)+ P(B)— P(AN B)s P(A) + PB) The probability of the union of two events never exceeds the sum of the even + The pr probabilities. * A tree diagram is very useful for portraying conditional and joint Probabilities, 4 tree diagram portrays outcomes that are mutually exclusive. Bayes Theorem * Bayes’ theorem is a method to revise the probability of an event 8iven additional information. Bayes's theorem calculates a conditional Probability called a Posterior or revised probability. * Bayes’ theorem is a result in Probability theory that Telates conditional Probabilities. If A and B denote two events, P(AIB) denotes the conditional Probability of A occurring, given that B Occurs. The two conditional probabilities P(AIB) and P(BIA) are in general different, * Bayes theorem gives a relation between P(A IB) * A prior probability is a n initial probabilit additional information is ty value originally obtained before any obtained, * A posterior probability is a p additional information th; robability value that has been revised by using at is later obtained, * Suppose that By, Ba, Bs ~By partition the outcomes of is another event, For any number, k, with 1 < kK < n we have the formula : P(B/A) = P(A/B, } PIB, ) an experiment and that A X P(A/B}P(B, is] TECHNICAL PUBLICATION: ~ pachine Leeming Generative model Generative models can generate new data instances, Generative model revolves around the Gistribution of a dataset to retum a probability for a given example. Generative models capture the joint probability pO ¥), oF just pOX) if there are no labels, ‘A generative model includes the distribution of the data itself, and tells you how likely a given ‘example is. Generative models are used in unsupervised machine learning to perform tasks such as robability and likelihood estimation + Support Vector Machines (SVMs) are a set of supervised learning methods which learn from the dataset and used for dlassification. SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis. * An SVM is a_ kind of large-margin classifier : I the goal is to find a decision boundary between two classes that _— is maximally far from any Point in the training data Given a set of training examples, each marked as belonging to one of two Classes, an SVM algorithm Class 1 2-17 pu Difference between Generative and Discriminative Models TECHNICAL PUBLICATIONS” Supervised Leaming | Discriminative models Discriminative models discriminate between different kinds of data instances Discriminative model makes predictions based on conditional probability and is either used for classification or regression. Discriminative models capture the conditional probability p(Y | X). ‘A discriminative model ignores the question of whether a given instance is likely, and just tells | you how likely a label is to apply to the instance. ‘The discriminative model is used particularly for supervised machine learning. Example : Logistic regression, SVMs Class 1 Fig. 2.4.1 Two class problem tis a vector space based machine learning method where ° owiess? Fig. 2.4.2 Bad decision boundary of SVM - an up-thrust for knowledge Supervised Leaming Machine Leaming xample falls into one class oy the SVM model as representing the the examples of the Separate icts whether @ new @ can think of an mapped so that each of # s possible. me space and classified to belong tg builds a model that predi other. Simply speaking, W® in ace, examples as points in sp F ie classes are divided by a gap that is as wi New examples are then ma the class based on which sid ypped into the sa eof the gap they fall on. Two Class Problems : Many decision boundaries can choose ? a Perceptron leaming rule can be used to find any decision boundary between class separate these two classes. Which one should we 1 and class 2. The line that maximizes the minimum margin is a good bet. The model class of “hyper-planes with a margin of m" has a low VC dimension if m is big. « This maximum-margin separator is determined by a subset of the data points, Data points in this subset are called "support vectors". It. will be useful computationally if only a small fraction of the data points are support vectors, because we use the support vectors to decide which side of the separator a test case is on. Example of Bad Decision Boundaries * SVM are primarily two-class classifiers with aim to find the optimal hyperplane such that minimized. Instead. of directl the distinct characteristic that they ne suct the expected generalization error is 'Y minimizing the empirical risk calculated from the training data, SVMs perform structural risk minimization t) ~——achieve generalization, om i Confidence Empirical risk ass + Because o distribution p We don't know ib egos 2 Low empirical tisk over a trae "inimize from P. This * 8 training dataset draw) Smal may - This general lear; Hu ae ming techn; a called empirical risk. minimis fechnique is Complexity of function set on, ° Fig. 2.4.3 vs iri ig. shows empirical a Fig. 2.4.3 Empirical risk TE CHNICAL PUBLICATION that maximize margin => B1 is better than 82 chine Learning chine L 2-24 Supervised Loaming 2, They maximize the margin of the deci , Taeeeea techniques which find the optimal ea boundary using quadratic optimization Ability to handle large feature spaces Overfitting can be controlled by soft margin approach 5, When used in practice, SVM approaches frequently map the examples to a higher dimensional space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries which are not hyperplanes in the original space. The most popular versions of SVMs use non-linear kernel functions and map the attribute space into a higher dimensional space to facilitate finding "good" linear decision boundaries in the modified space. pa SVM Applications + SVM has been used successfully in many real-world problems, 1. Text (and hypertext) categorization Be 2 Image classification Bioinformatics (Protein classification, Cancer classification) Peo Hand-written character recognition Determination of SPAM email. o EJ Limitations of SVM 1. It is sensitive to noise. 2. The biggest limitation of SVM lies in the choice of the keel. 3. Another limitation is speed and size. 4. The optimal design for multiclass SVM classifiers is also a research area. E2Y] sott Margin SvM For the very high dimensional problems common in text classification, sometimes the data are linearly separable, But in the general case they are not, and even if they are, we might prefer a solution that better separates the bulk of the data while ignoring a few weird noise documents. What if the training set is not linearly separable ? Slack variables can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. cross into the margin or over the A soft-margin allows a few variables to hyperplane, allowing misclassification. We penalize the crossover by looking at the number and distance of the misclassifications, This is a trade off between the hyperplane violations and the TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Supervise a, 2-22 Loam, Machine Leaming ne set cost. The fay bles are bounded by som the prediction ther they margin size. The slack Vs ess influence they have on Jess soft margin, the . ere frome cone csociated slack variable, ‘ae argin. All observations have an ia anes variable = 0 then all points on th oe ie, Sla ble > 0 then a point in the margin eof 2. Slack variable > a hyperplane ; and tl in. c. the tradeoff between the slack variable penalty and the margin, 3. Cis the EXXX comparison of SVM and Neural Networks See Neural Network Support Vector Machine : | Kemel maps to a very-high dimensional space Hidden Layers map to lower dimensional | eee es spaces E | Search space has a unique minimum Search space has multiple local minima | Classification extremely Ve: Y good accuracy in typical domains Very good accuracy in typical domains Kemel and cost the two parameters to select Training is extremely efficient CEE Fo the followir Support vectors (if any), sla Pariables on wrong side Requires number of hidden units and lay vers ck variables on correct side of Of classifier (j Penalty and why + if any). Mention which point will have maximum ry? | wechine Learning 2-23 Supervised Leaming olution : vor 3. EA Decision Tree Data points 1 and 5 will have maximum penalty, Margin (m) is the gap between data points & the classifier boundary. The margin is the minimum distance of any sample to the decision boundary. If this hyperplane is in the canonical form, the margin can be measured by the length of the weight vector. Maximal margin classifier : A classifier in the family F that maximizes the margin. Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable. What if the training set is not linearly separable ? Slack variables can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ‘A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing misclassification. We penalize the crossover by looking at the number and distance of the misclassifications. This is a trade off between the hyperplane violations and the margin size. The slack variables are bounded by some set cost. The farther they are from the soft margin, the less influence they have on the prediction. All observations have an associated slack variable Slack variable = 0 then all points on the margin. Slack variable > 0 then a point in the margin or ‘on the wrong side of the hyperplane. Cis the tradeoff between the slack variable penalty and the margin. A decision tree is a simple representation for classifying examples. Decision tree learning is one of the most successful techniques for supervised classification learning. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions, Learned trees can also be represented as sets of if-then rules to improve human readability. A decision tree has two kinds of nodes i 1. Each leaf node has a class label, determined by majority vote of training examples reaching that leaf. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Supervised Loam 24 ng 5 out accordin, Machine Learning tion on features. It branche: B to the is a que’ internal node 2 at imating discrete-value imating di d targa, 5 a method for approx answers. tin d by a decision tree. Decision tree learn functions. The learnes .d decision tree can one of ing i i i resentes .d function is rep! also be re-represented as a set of if-then rules, «A leame the most widely used and practical methods j,, Decision tree Iearning is ive il ce. ; inductive inferen es f learning disj isy data and capable o} « It is robust to noisy learning method searches a completely expressive hypothesis n tree le © Decisio EEE Decision Tree Representation Goal ; Build a decision tree for classifying examples as positive or negative instances of a concept a Supervised learning, batch processing of training examples, using a preference bias. A decision tree is a tree where a. Each non-leaf node has associated with it an attribute (feature). b. Each leaf node has associated with it a classification (+ or -). c. Each arc has associated with it one of the possible values of the attribute at the node from which the are is directed. Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf nodes represent class labels or class. distribution. A decision tree is a flow-chart-like tree structure, where each node denotes a test sy an attribute value, each branch represents an outcome of the test, and tree Saves represent classes or class distributions, Decision trees can easily be converted to classification rules, Decision Tree Algorithm * To generate decision Taput; tree from the training tuples of data Partition D. 1. Data partition @M 2 Attribute list Algorithm : 3. Attribute selection method J. Create a node (N) 2. If tuples in D are all of the same cl, 3. Return node ) lass then as a leaf node labeled with the class C. aero ee al ming machine Learning 2-25 Supervised Learning 4, If attribute list is empty then return N as a leaf node labeled with the majority dass in D 5. Apply attribute selection method(D, attribute list) to find the "best" splitting criterion; 6, Label node N with splitting criterion; 7, If splitting attribute is discrete-valued and multiway splits allowed 3, Then attribute list -> attribute list > splitting attribute 9, For (each outcome j of splitting criterion ) 10, Let D; be the set of data tuples in D satisfying outcome j; 11. If Dj is empty then attach a leaf labeled with the majority class in D to node N; 12. Else attach the node returned by Generate decision tree(Dj, attribute list) to node N; 7 13, End of for loop 14, Return N; + Decision tree generation consists of two phases : Tree construction and pruning + In tree construction phase, all the training examples are at the root. Partition examples recursively based on selected attributes. «+ In tree pruning phase, the identification and removal of branches that reflect noise or outliers. + There are various paradigms that are used for learning binary classifiers which include : 1. Decision Trees 2. Neural Networks 3. Bayesian Classification 4. Support Vector Machines Fig. 2.5.1 Decision tree TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 6 Machine Leeming Jes for majority class. it ision Tul PREREETE Using following feature tree, trite decisi combining two Boolean features. Each internal node 4d each edge emanating from a split is labelled with a unique combination of feature derived from the training set. Solution : Left Side : A feature tree or split is labelled with a feature, an a feature value. Each leaf therefore corresponds t values. Also indicated in each Jeaf is the class distribution * Right Side : A feature tree partitions the instance space into rectangular regions, cone for each leaf. ‘Viagra’ Fig, 2.5.3 * The leaves of i i aoe nee of es i in the above figure could be labelled, from left to right, a5 pam, employing a simple decision rule called majority class. a ‘spam: 20 ham: 5 TECHNICAL PUBLICATIONS® IONS® - an y ip-thrust for knowl ledge in achine Learning eae Supervised Learning ide : A feat ein . Left si Mead ute) tree with training set class distribution in the leaves. + Right side : A decision tree obtained using the majority class decision rule. pa Appropriate Problem for Decision Tree Learning « Decision tree learning is generally b i i i Pe acersticel Bs ly best suited to problems with the following 1, Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes take a small number of disjoint possible values. Rp The target function has discrete output values. Decision tree learning is appropriate for a boolean classification, but it easily extends to learning functions with more than two possible output values. 3. Disjunctive descriptions may be required. Decision trees naturally represent disjunctive expressions. 4. The training data may contain errors. Decision tree learning methods are ; robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples. 5. The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values. 6. Decision tree learning has been applied to problems such as learning to classify. Advantages and Disadvantages of Decision Tree Advantages : 1. Rules are simple and easy to understand. 2. Decision trees can handle both nominal and numerical attributes. 3. Decision trees are capable of handling datasets that may have errors. 4. Decision trees are capable of handling datasets that may have missing values. 5. Decision trees are considered to be a nonparametric method. 6. . Decision trees are self-explantory. Disadvantages : 1. Most of the algorithms require that the target attribute will values, 2 Some problem are difficult to solve like XOR. 3. Decision trees are less appropriate for estimation the value of a continuous attribute. I have only discrete n tasks where the goal is to predict TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘Supervis 2-28 Supervised Leeming 9 Machine Lea! s with many class ang ation problem: one to errors in clas! e ig examples. 4. Decision trees are Pp ict ber of trainin: relatively small num EEG Random Forests Random forest is @ famous syste! metho st ised getting to know pian is based totally on th regression issues in ML. It is 9 thare a process of combining multiple classifiers to so to enhance the overall performance of the model. “Random forest is a classifier that incorporates some of sets of the given dataset and takes the average to that dataset.” Instead of relying on one decision arily based on im learning set of rules that belongs to the d. It may be used for both classification ang e concept of ensemble studying, Ive a complex problem and ‘As the call indicates, choice timber on diverse sul improve the predictive accuracy of kes the prediction from each tree and prim and it predicts the very last output. results in better accuracy and tree, the random forest tal most of the people's votes of predictions, The more wider variety of trees within the forest prevents the hassle of overfitting, [EGEI How Does Random Forest Algorithm Work ? Random forest works in two-section first is to create the rando: combining N selection trees and second is to make predictions for each tree created inside the first segment. The working technique may be explained within the below steps and diagram : m woodland by Step - 1: Select random K statistics points from the schooling set. Step -2: Build the selection tr i ‘ 7 . . (Gubsets), n trees associated with the selected information points Step - 3: Choose the wide variety N for selection trees which we want to build. Step - 4: Repeat step 1 and 2. Step - 5: For new factors, locate the predicti 7 predictions of each choice tri i new records factors to the category that wins most people's votes. See oar « The working of the set of rules ma i ‘ y be higher mre wn gher understood by the underneath « Example : Suppose there may be a dataset that includes more than one fruit photo. So, ae dataset is given to the random wooded area classifier. The dataset js divided into subsets and given to every decision tree. During the training section, each decision tree produces a prediction end result and while a brand new TECHNICAL PUBLICATIONS® - en up-thrust for knowledge ine Leamin ectine Leaming 2-29 Supervised Leaming statistics ae occurs, then primarily based on the majority of consequences, the random forest classifier predicts the final decision. Consider the underneath picture = : Treen 4 Class-B Fig. 2.6.1 Example of random forest _ EEE] Applications of Random Forest There are specifically 4 sectors where random forest normally used : 1. Banking : Banking zone in general uses this algorithm for the identification of loan danger. 2. Medicine : With the assistance of this set of rules, disorder traits disorder may be recognized. 3. Land use : We can perceive the areas of comparable land use with the aid of this algorithm. 4. Marketing : Marketing tendencies can be recognized by the usagi algorithm. and risks of the e of this by Advantages of Random Forest Random forest is able to appearing both classification and regression responsibilities. * Itis capable of managing large datasets with high dimensionality. * Ttenhances the accuracy of the version and forestalls the overfitting trouble. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Supervised 2-30 beaming Machine Learning [ERE Disadvantages of Random Forest be used for both class and regression responsibilitie, «Although random forest can : : jt isn’t extra appropriate for regression obligations. Two Marks Questions with Answers 1 What do you mean by least square method ? tical method used to determine a line of best fit by 5 created by a mathematical function. A "square" is between a data point and the regression line or ‘Ans. : Least squares is a statis! minimizing the sum of square determined by squaring the distance mean value of the data set. Q.2 What is linear Discriminant function ? ‘Ans. : LDA is a supervised learning algorithm, which means that it requires a labelled training set of data points in order to learn the Linear Discriminant function. Q.3 What Is a support vector in SVM ? ‘Ans. : Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Q.4 What is Support Vector Machines ? ‘Ans. : A Support Vector Machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. After giving an SVM model sets of labeled training data for each category, they're able to categorize new text. Q5 Define logistic regression. Ans. + Logistic regression is supervised learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Q6 List out types of machine learning. Ans. : Types of machine learning are su ° ipervised, semi-supervised, fised and reinforcement learning. P ee Q.7 What is Random forest 7 oa + Random forest is an ensemble learning technique that combines multiple cision trees, implementing the bagging method and results in a robust model with low variance. Q.8 What are the five popular algorithms of machine learning 7 Ans. : Popular algorithms are Decision Trees, Ne i Pop , Neural Networks (bi ation), Probabilistic networks, Nearest Neighbor and Support vector ae ee TECHNICAL PUBLICATIONS® - an up-thrust for knowledge machine Learning 2-31 Supervised Learning qa What is the function of ‘Supervised Learning’ 7 ans.: Functions of ‘Supervised Learning’ are Classifications, Speech recognition, regression, Predict time series and Annotate strings. q10 What are the advantages of Naive Bayes 7 ans. : In Naive Bayes classifier will converge quicker than discriminative models like Iogistic regression, so you need less training data. The main advantage is that it can't eam interactions between features. a1 What is regression ? Ans, : Regression is a method to determine the statistical relationship between a dependent variable and one or more independent variables. iz Explain linear and non-linear regression model. Ans. : In linear regression models, the dependence of the response on the regressors is defined by a linear function, which makes their statistical analysis mathematically tractable. On the other hand, in nonlinear regression models, this dependence is defined by a nonlinear function, hence the mathematical difficulty in their analysis. 13 What is regression analysis used for 7 ans.: Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. Q14 List two properties of logistic regression. Ans. = and 1. The dependent variable in logistic regression follows Bernoulli Distribution. 2. Estimation is done through maximum likelihood. Q15 What is the goal of logistic regression ? Ans. The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To accomplish this goal, a model is created that includes all predictor variables that are useful in predicting the response Variable. Q16 Define supervised learning. Ans. : Supervised learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs are usually provided by an external teacher. aQ0o0 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like