100% found this document useful (3 votes)
6K views

Machine Learning Spectrum CSE PDF

Uploaded by

spider dog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (3 votes)
6K views

Machine Learning Spectrum CSE PDF

Uploaded by

spider dog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 148
Cees MACHINE LEARNING CSE B.Tech. IV-Year II-Sem INTU - Kakinada ' CONTENTS Introduction to the Subject Syllabus as per R16 Curriculum Unit-wise Important Questions a1 - 103 MID - 1 & Il (Objective Type & Essay Questions with Key) Ma = M42 Mode! Question Papers with Solutions (As per the New External Exam Pattern) Model Paper-1 MP1 - MP1 Model Paper-I! MP2 - MP2 Model Paper-Iil MP3 - MP3 Model Paper-IV MP4 - MP4 UNIT-WISE SHORT & ESSAY QUESTIONS WITH SOLUTIONS Tait No. "Unit Name No. UNIT -1 THE INGREDIENTS OF MACHINE LEARNING, TASKS BINARY CLASSIFICATION AND RELATED TASKS Qt - O30 1.4-1.20 Part-A SHORT QUESTIONS WITH SOLUTIONS a1 - ato 1-42 Part-B ESSAY QUESTIONS WITH SOLUTIONS an - a30 1.3-1.20 4.4 The Ingredients of Machine Learning, Tasks 13 1.4.4. The Problems that can be Solved with Machine Learning an = ats 13 1.4.2 Models : The Output of Machine Learning 3. Features : The Workhorses of Machine Learning Binary Classification and Related Tasks 1.2.1 Classification 1.2.2 Scoring and Ranking 1.2.3 Class Probability Estimation UNIT - Il BEYOND BINARY CLASSIFICATION, CONCEPT LEARNING _ Part-A SHORT QUESTIONS WITH SOLUTIONS Part-B ESSAY QUESTIONS WITH SOLUTIONS 2.1 Beyond Binary Classification 2.1.1 Handling More Than Two Classes 2.1.2 Regression 2.1.3 Unsupervised and Descriptive Learning Concept Learning 2.2.1 The Hypothesis Space 2.2.2 Paths Through the Hypothesis Space 2.2.3 Beyond Conjunctive Concepts UNIT-Ill_ TREE MODELS, RULE MODELS SHORT QUESTIONS WITH SOLUTIONS ESSAY QUESTIONS WITH SOLUTIONS Tree Models 3.1.1 Decision Trees 3.1.2 Ranking and Probability Esti 3.1.3 Tree Learning as Variance 42 Distance Based Models 4.2.1. Introduction 4.2.2 Neighbours and Exemplars 4.2.3. Nearest-Neighbour Classification 4.2.4 Distance Based Clustering 4.255 Hierarchical Clustering 4.13 413 414 445 4.16 419 UNIT-V PROBABILISTIC MODELS, FEATURES AND MODEL ENSEMBLES 5.1-5.22 " Part-A Pate 54 SHORT QUESTIONS WITH SOLUTIONS © ESSAY QUESTIONS WITH SOLUTIONS Probabilistic Models 5.1.1 The Normal Distribution and its Geometric Interpretations 5.1.2 Probabilistic Models for Categorical Data 5.1.3 Discriminative Learning by Optimising Conditional Likelihood 5.1.4 Probabilistic Models with Hidden Variables Features = 5.2.1 Kinds of Features 5.2.2 Feature Transformations 523 Feature Construction and Selection 51-52 5.3-5.22 i + INTRODUCTION TO THE SUBJECT 1 Machine Learning of B.Tech IV-Year I1-Sem. (CSE) JNTU-Kakinada, isa core subject in ever-expanding courses Computer Science Engineering, The Al systems possess some srt of mechanical learine capabilites, which ae refened i ‘Machine Learning’, Various methods of machine leaming are available. Some of them are inductive earings ‘tsi New | daaaaiey (ANN) and genetic algorithms. Typically, the machine ming i diving computers 9 programm themselves, Thy, Tras mowtammning is considered to bean automation, tent could be sid that machine Tearing i auiomatn Ne procdiey aann eran Writing a software code could be a tedious task and due to lack of good software developer it bscomes een ma aera unigeral olution could be making the data work instead of people. Thereby, making the program scalable how machine learning. “The table below illustrates the complete idea about the subject, which will be helpful 9 plan end examinations. and score 00d marks in he The ingredients of machine learning, Tasks: the problems thet can be solve} ‘Models: the output of machine learning, Features, classification and related tasks; | | | | zs + «with the machine learning, | Beyond Binary Classification, Concept Learning “Tree Models, Rule Models SS SyWQDUS UNIT-I The ingredients of machine learning, Tasks: the problems that can be solved with the machine Yearning, Models: the output of machine learning, Features, the workhorses of machine learning. Binary classification and related tasks: Classification, Scoring and ranking, Class probability estimation. UNIT-II Beyond binary classification: Handling more than two classes, Regression, Unsupervised and descriptive learning. Concept learning: The hypothesis space, Paths through the hypothesis space, Beyond conjuctive concepts. UNIT-II ‘Tree models: Decision trees, Ranking and probability estimation trees, Tree learning as variance reduction. Rule models: Learning ordered rule lists, Learning unordered rule sets, Descriptive ruie learning, First- order rule learning. UNIT-IV Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for linear classifiers, Support vector machines, obtaining probabilities from linear classifiers, Going beyond linearity with kemel methods. Distance Based Models: Introduction, Neighbours and exemplars, Nearest Neighbours classification, Distance Based Clustering, Hierarchial Clustering. UNIT-V Probabilistic models: The normal sition, and its geometric interpretations, Probabilistic models for categorical data, Discriminative teaming by optimisir semaaionial likelihood, Probabilistic models with hidden variables. Fear: Kn of fest, Fea tastomaons, Feature constten an QI. What is machine learning? Explain in brief. 1. Write about handling more than two classes, | Ans: Important Question | Ans: - ImPOrtMt Queteg, For answer refer Unit-I, Q11. | Q2. Discuss various ingredients if machine learning Ans: Important Question For answer refer Unit-I, QI2 Q3. Write about looking for structure. ‘Ans: Important Question For answer refer Unit-l, Qi4. Q4, Define model. Explain various types of models. Ans: Important Question For answer refer Unit-1, Q16. QS. Write about features. Illustrate the uses of | _ features. 3 For answer refer Unit. 19.2 a1, Give an introduction about Tree Models Ans: Important Question For answer refer Unit-fl, 9, 2. Explain in detail about decision trees, Ans: Important Question For answer refer Unit-I11, 10, 3. Write about ranking and Probability estimation tre Important Question For answer refer Unit-Il, QU, Q4, Discuss about regression trees. Aes Important Question For answer refer Unit-Ill, Q14. Q5. Explain the process of learning clustering trees. Ans: Important Question For answer refer Unit-IIl, Q15. Q6. Give a brief introduction about rule models. Ans: Important Question For answer refer Unit-ll, 16. Q7. Explain about learning ordered rule lists. Ans: a Important Question For answer refer Unit QI7. || Ans: MACHINE LEARNING [JNTU-KAKINADA] a1. An Important Question Unit-V, QU a2, bout multivariate linear regression, Ans: Important Question For answer refer Unit-1V, Q12. Q3. Explain about the perception. Ans: For answer refer Unit-1V, Q14. Q4. Write in detail about support vector machin Ans: Important Question For answer refer Unit-IV, QUS. Q5. Give a brief introduction on distance models. Ans: Important Question For answer refer Unit-IV, Q19. Q6. Explain about neighbours Exemplars Ans: Important Question For answer refer Unit-IV, Q20. Q7. Discuss about nearest neighbour classification. Ans: Important Question For answer refer Unit-IV, Q21. Q8. Write about distance based clustering. Important Question For answer refer Unit-IV, Q22. Q9. Explain about K-means algorithm Ans: Important Question For answer refer Unit-LV, Q23. Q10. Discuss about hierarchical clustering. Unit-wise Important Questions Q1. Explain about probabilistic models. Important Question For answer refer Unit-V. QU Q2. Write about the probabilistic models for cat- egorical data. ‘Ans: Important Question For answer refer Unit-V, QU. Q3. Write the usage of naive bayes model for classification. Ans: Important Question For answer refer Unit-V, Q14. Q4. Discuss about training a naive bayes model. Ans: Important Question For answer refer Unit-V, QIS. Q5. Explain about kinds of feature. Q1. Discuss in brief about dimensionality reduction Ans: pont vention For answer refer Unit-VI, QU Q2.__ Write short notes on reducing date dimensioraty, ‘porn Quwain, Ans: For answer refer Unit-VI, Q12 Q3. Write about artificial neural network. “important Questing Ans: For answer refer Unit-VI, Q14. Q4. Define neural network and discuss its ‘Ans: Imporaint Queatiar For answer refer Unit-VI, Q16. Setore you buy ( Units - 1,2 & 3) M2 MACHINE LEARNING [JNTU-KAKINADA] OS OE eer MID-I (Objective Type & Essay Questions with Key) 1, The Al systems possess some sort of mechanical learning capabilities, which are referred to as * sty (@) Machine learning (b) Artificial intelligence | (©) Neural networks (@ Cloud computing 23 ‘define language in which relevant objects in the domain are described. ti | (@) Features (b) Task | (©) Models (a) None of the above | 34 isn abe epieitonto role hat to Be te a See tr (@) Features (0) Task (©) Models : (d) None of the above 4, is the output of machine learning algorithm that is applied to training data. le (a) Features (b) Task (©) Models (d) None of the above Le en eS elope ty inns py wing te gee cn pln ed di ; (@) Geometric model (b) Probabilistic mode! (©) Logical model (@) None of the above = The ‘is developed directly in instance space by using the geometric concepts like lines, planes and : Sais if (@) Geometric model wo (Units 12 & 3) ms (a) Binary classification (b) Nearest neighbor classifier <) Linear classifier (a) Naive bayes classifier as is defined as total order on a set of instances. i (a) Seo (b) Classifying ©) Deriving (a) Ranking he performance of a classifier for k classes can be accessed through __contingency table. i (a) keby-k (>) m-by-n © (d) x-by-x ‘To construct a multi class classifier there is only one method ie., totrain two class models called ___f J (a) dayes classifier (b) _ predictive classifier (o) linear classifier. (@) hypothesis classifier yx The concepts in binary classification are completely ‘ ay? (a) unary () binary (©) Tinear (@_ none of the above eed can be defined as a mapping f :%—> R. ka (a) clustering (b) classification (c) regression i (@ concept learning oe it ee oan © (b) bias dilemma (@)_none of the above the desired upto target outputs not distributed across the network. MACHINE LEARNING [JNTU-KAKIN, Acconcept which adds all the implicitly understood conditions is ealled__- (b) closed concept (d)__ external concept (a) open concept (©) internal concept The: is conjunction of clauses. @ L6G () CNF (©) EBF (@ DBF The _ of set of instances is least upper bound of set of generalizations that can be learned from data, [ @ LGG (b) CNF (© EBF (@) DBF ‘The. en i oer wk te eames eat diversity dependent on the task. (a) Grouping (b) Tree (©) Rule (@ Grading A rei feature tre that has both postive and negative nodes (@) Model ©) Clusteritg (©) Decision — @) Regression ~ nw ne (@) Model Fi The purpose of, improve, the homogeneity, Association rule learning (a) Tree based learning & (©) First order rule learning (@) Rule learning ‘The learning process of. ‘model involves determining of rules which identify Sepa meee a C the examples. (b) Subset (@) Classification ‘None of the above (©) Probabilistic @ __ is called inductive logic programming. (a) Tree based learning (b) Association rule learning (©) First order rule learning (@) Rule leaning tree can be tured into a ranker by ordering its eaves on non-increasing empirical probabilities. { (a) Decision (b) Classification (©) Feature @ Rule ‘The purpose of, (a) Tree based learning (©) First order rule teaming MACHINE LEARNING [JNTU-KAKINADA) ‘What is machine learning? Explain in brief. Discuss various ingredients if machine learning Discuss the problems that can be solved with machine learning. Define model. Explain various types of models. Write about geometric models. Write about features. Ilustrate the uses of features. Explain in detail about classification, How ranking performance is accessed and visualised? How rankers are tured into classifiers? eae (Refer Units, a11) (Refer Unit, a12) (Refer Unit-1, a13) (Refer Unit-, a6) G oe ed (Units - 4,5 & 6) MACHINE LEARNING [JNTU-KAKINADA) i a are non-numerical inherently and only handle the numeric features through thresholds for con. verting them into multiple intern ‘ ti (@) Distance based mode! (©) Linear models The (@) Division (©) Least squares (@) division mip - I (Units - 4, 5 & 6) 4 1. The _ ——— re the points in proximity regarding the distance measure represented through exemplars. {J (a) Neighbours (bd) Exemplars: (©) Centroids (@) Medoids — the arithmetic mean that decrease the squared euclidean distance to other points a} (@)__ Neighbours (b) Exemplars (©) Centroids (a) Medoids _____.Are cither centroids or medoids Fd (0) Neighbours (b)Exemplars (©) Centroids (@) Medoids ——*sseciate random variables and probability distributions into model event or phenomenon. [J (®) Probabilistic model (b) Naive bayes model (©) Gaussian mixture model (@ Logistic mode! ‘orrection is a smoothing operation in Bernoulli distribution. 0 (@) Er (b) Laplace (©) Divisive (@_ Scale h areca rot me ovens mati 3. (a) Probabilistic model #8 (b) Naive bayes model (©) Gaussian mixture model (@ Logistic model i z 5 (6) Quantitative Features Categorical Features M.10 MACHINE LEARNING [JNTU-KAKINADA) Tefers to supervised micthods by considering the class labels TT (a) Thresholding (©) Normalisation (© Calibration (6) Diseretisation ___ is a technique similar to that of bagging. It makes use of a practical technique than bootstrap sampling Tor ereating a diverse training sets 11 (@) Learning (b) Upgrading (©) Boosting (Univariate ‘The concept called is modelled as binary random variable whose probability of success is fixed over every independent trial. Cul (@) Bemoulli trial (6) Maximum likelihood (©) Recalibrated likelihood (@)_Natve bayes The purpose of is to improve the utility of feature by adding, changing or removing the information. 1] (@) Feature validation (©) Feature adaption ‘The combinations of models together are called (Model building (© Model ensembles (a) Dimensionality reduction (©) Maximation MID - It (Units 8 40. 8. M.11 1, output layer and one or more hidden layer. et (a) Layered (b) Single layer feed forward (c) Multi layer feed forward (@)_ Recurrent network The activation function employed in these networks is a function, ol (a) Activation (b) Simulation (©) Common (@ Sigmoid he — network is the most well-known and the widely used technique of teaching artificial neural networks bey | (a) Single layer (b) Multi layer (©) Back propagation (A) None of the above ‘ The attribute whose values are real numbers is called attribute. a (a) Discrete (©) Continuous (©) Wholesome @_ Recursive S Possess countably infinite or finite set of values. e 7 (@) Discrete (b) Continuous ; (©) Wholesome (@)_ Recursive The difference between true value and the recorded value is called i Seu (@) Exor (©) Measurement error (©) Collection error @_None of the above, ‘The difference between continuous attributes is called “oe (@) Error Z tax (b) Measurement error (@) None of the above — (b) Measurement error (@® None of the above (e) 10. @) a a. 08. 26 Q7. 28. Qs, ato. oa “ESSAY QUESTIONS WITH KEY — MACHINE LEARNING [JNTU-KAKINADA| Give a brief introduction on linear models and least squares method. (Refer Unitay, Qt) Discuss about multivariate linear regression, (Refer Unit-v, a12) Explain in brief about regularised regression and usage of least-squares regression for classification, Explain about the perception Write in detail about support vector machines. Write short notes on soft margin SVM. Give a brief introduction on distance based models. Explain about neighbours Exemplars Discuss about nearest neighbour classification, Explain about K-means algorithm Explain about probabilistic models. Write about the probabilistic models for categorical data. Write the usage of naive bayes model for classification. . Write about Gaussian mixture models. Explain about kinds of feature Discuss about feature transformations. . Write about thresholding and discretisation. . Explain in brief about model ensembles. ‘What is boosting? Explain. Write short notes on boosted rule learning. . Discuss in brief about dimensionality reduction. Write short notes on reducing data dimensionality. Explain in detail about feature subset selection. Write about artificial neural network. Write short notes on artificial neural network architectures. What are the applications of neural networks? 4 Write about multilayer networks. | 26, ilheremork architecture of mayer fed forward neural network an ce ES the functionality ; eovloe Faplain how travelling salesman problem is solved by Neural Networks. (Refer Unity, a13) (Refer Unity, a14) (Refer Unitv, ats) (Refer Unit-tv, 16) (Refer Unit-tv, 19) (Refer Unity, a20) (Refer Unit-lv, 21) (Refer Unit-V, Q23) (Refer Unit-v, 11) (Refer Unit-v, Q13) (Refer Unit-V, a14) (Refer Unit-V, 218) (Refer Unit-v, a19) (Refer Unit-v, Q21) (Refer Unit-v, 022) (Refer Unit-V, 027) (Refer Unit-V, 29) (Refer Unit-V, Q30) (Refer Unit-Vi, Q11) (Refer Unit-VI, Q12) (Refer Unit-VI, 13) (Refer Unit-VI, 014) (Refer Unit-VI, 018) 7 (Refer Unit-VI, Q18) (Refer Unit-VI, 22) Geter Unit, 23) R16 MODEL Jawaharlal Nehru Technological University Kakinada PAPER , 2 B.Tech. IV Year Il Semester Examination ae MACHINE LEARNING (Computer Science Engineering ) Time: 3 Hours Max. Marks: 70 Note: 1. Question paper consists of two parts (Part-A and Part-B) 2. Answering the question in Part-A is compulsory 3. Answer any Four Questions from Part-B oe Solutions PART-A Qi. (@) Whatis Machine Leaning? (Unit! 01) (b) Define binary classification, Waitt} 01) (c) Whatis a tree model. (Urit-tit / a1) (d) Define linear models. (Uoit-t¥ | 01) (e) Define probabilistic model 7 (Unitv 701) () Define dimensionality reduction. (Unitvis an PART-B 2. (@) Discuss various ingredients if machine learning. (Uniti 012) (b) Write about feature construction and transformation. (Units 1022) 3. (@) Write about handling more than two classes. (Unita | 09) (0) Define concept learning. Write about hypothesis space, (Units 014) Explain in detail about decision trees. (Unitan 1 010) Discuss about regression trees. (nitan 014) Discuss about multivariate linear regression.. (Unit-iv / 012) Explain about neighbours Exemplars. (Unit | 020) Write the usage of naive bayes mode! for classification. Waitv / 014) ‘Explain about kinds of feature. Woitv | 019) ot ‘Write about artificial neural network. - (nieve o14) how travelling salesman problem is solved by Neural Networks. 4 iia be : me MACHINE LEARNING [JNTU-KAKINADA] R16 Jawaharlal Nehru Technological University Kakinada wo v 1 12 B.Tech. IV Year ll Semester Examination a MACHINE LEARNING (Computer Science Engineering ) Time: 3 Hours: Max. Marks: 70 Note: 1. Question paper consists of two parts (Part-A and Part-B) 2. Answering the question in Part-A is compulsory 3. Answer any Four Questions from Part-B PART-A ; 1. (@) List out the ingredients of machine learning. (Wait | 02) (b) Define Regression. oa Nel 02 ' (c) Define decision tree. Waitt | 02) (4) Write in brief about multivariate linear regression. : jini v0) ‘Write in short about Gaussian mixture models. NE Model Question Papers with Solutions MP. R16 Jawaharlal Nehru Technological University Kakinada 4 ie o FE nd — B.Tech. IV Year il Semester Examination mes MACHINE LEARNING (Computer Science Engineering ) Fime: 3 Hours Max, Marks: 70 Note: “1. Question paper consists of wo parts (Part-A and Part-B) 2. Answering the question in Part-A is compulsory 3. Answer any Four Questions from Part-B a Solutions PART-A 4. (@). Define probabilistic model, (Unit) 06) (b) What is unsupervised learning? (Unit-t | 03) (c) Write in brief about regression trees. (Wnit-it | 03) (@ Define perception, : ; (Unitav | 05) (e) What are features? (Unit-V } 04) ; (f) Define artificial neural networks. (Wait-vt } 03) PART-B 2. (a) Write about features. Illustrate the uses of features. (Unit4 | 021) Discuss about scoring and ranking. ee 2 (Unies 026) . (Unit / 010) y : (Uniti / 017) é 2 (Unie 015), Woitttt/a17) MACHINE LEARNING [JNTU-KAKINADa) MODEL Jawaharlal Nehru Technological University Kakinada A p ER |4 B.Tech. IV Year I! Semester Examination _ MACHINE LEARNING (Computer Science Engineering ) Time: 3 Hours Max. Marks: 79 Note: 1. Question paper consists of two parts (Part-A and Part-B) 2. Answering the question in Part-A is compulsory 3. Answer any Four Questions from Part-B Solutions -PART-A : 1. (a) Define classification. (Unit / 10) (b)_ What is descriptive learning? oitiae (Units | 08) (c) Write in short about clustering trees. (@) Illustrate k-means algorithm. Ss (e) Write about normalisation? UNIT tne Ingredients of Machine Learning, Tasks, Binary Classification and Relat- Qi. What is Machine Learning? Ans: Model Papers, Gta) The artificial intelligent systems have the learning capability as humans have. But, the learning capability of Al systems isnot the same as that of human learning capability i., the human capability of learning is higher than the Al systems. The Al systems possess some sort of mechanical learning capabilities, which are referred to as ‘machine learning’. Various methods of machine learning are available. Some of them are, inductive learning, Artificial Neural Networks (ANN) and genetic algorithms ‘Typically, the machine learning is driving computers to program themselves. Thus, when programming is considered to ‘bean automation, then it could be said that machine learning is automating the procedure to automation. Writing a software code could be a tedious task and due to lack of good software developers it becomes even more difficult. So, an universal solution could be making the data work instead of people. Thereby, making the program scalable through machine learning. 2. List out the ingredients of machine learning. Ans: Mode! Paper. Ota) ‘The three ingredients of machine learning are as follows. 1. Features Features define language in which relevant objects in the domain are described. They can be e-mails or complex organic molecules. They play an important role in machine learning because they does not nect to go back to domain objects them selves once a suitable features representation is obtained. 2 Task 7 ‘Task is an abstract representation of problem that is to be solved related to those domain objects. The common form of them would be to classify into two or more classes. These tasks can be represented as mapping from data points to outputs. algorithm that is applied to training data. is learnt from the data to solve a given task. It is chosen from ine learning has number of objectives to solve such as classifies 1.2 MACHINE LEARNING [JNTU-KAKINADA} Q6. Define probabilistic model Ans: Model Papert, ay) probabilistic in nature such a: Bayesian classifier. Consider X that indicates the variables such as instance’s feature values and y indicate: riables suc)» instance’s class. The problem here is to model the relation ship between X and Y. The statistician tain underlying random process that generates values for these variables wity respect to well defined but unkriown probability distribution. To know about the distribution the data must be used. Because X‘g known for a specific instance but ¥ is not, so let the conditional probabilities be P (YX), ‘The probabilistic model \ssume Q7. What are logical models? Ans: The logical models are algorithmic in nature and they are inspired from computer science and engineering. These type ‘of models can be translated easily into rules understandable by humans like if viagara = 1 then class = Y= spam. These type of rules can be organised easily in a tree structure called feature tree, The features are used to spit the instance space iteratively. The leaves represent rectangular areas in instance space called inspace space segments. The leaves can be labeled with probsbifity, real value etc based on the task. The feature threes whose leaves are labelled with classes are called decision trees. ‘The feature trees are versatile and even the models are built on them. Q8. Define features. Ans: AA feature can be a type of measurement performed on any instance. It determines much of success of machine learning application. Mathematically they are functions that map from instance space to a set of feature values called domain of feature. Acommon domain is set of real numbers because measurements are mostly numerical. Other type of domains are set of integers, booleans and arbitrary finite sets. > Q9. List the uses of features. Ans: 1, A-common use of features is to focus on particular area of instance space. Let f be a feature that counts number of ccurrence of word ‘writing’ in an e-mail. Let' x be arbitary e-mail then condition f [x] = 0 will select e-mails that does not have word ‘viagara’, f(x) #0 or f{x>0) selects the e-mails that have the word and f (1x) >? seleets the e-mails ‘that contain word twice or more. These type of conditions are called binary splits since they spit the instance space ino. ‘two groups. The non binary splits are even possible. a rc UNIT-1 (The Ingredients of Machine Leaming, Tasks, Binary Classification and Related Tasks ant ei Model Paper-tV, Q2(a) Machine Learning The artificial intelligent systems have the learning capability as humans have, But, the leaming capability of Al systems is not the same as that of human learning capability ie., the human capability of learning is higher than the AI systems. The Al systems possess some sort of mechanical learning capabilities, which are referred to as ‘machine learning”. Various methods of ‘machine learning are available. Some of them are, inductive learning, Artificial Neural Networks (ANN) and genetic algorithms, Typically, the machine leaming is driving computers to program themselves. Thus, when programming is considered to te an automation, then it could be said that machine leaning is automating the procedure to automation. Writing a software code could be a tedious task and due to lack of good software developers it.becomes even more difficult. So, an universal solution could be making the data work instead of people, Thereby, making the program scalable through machine learning. Example Machine learning is a process of making computers to learn about solving a specific task. Asimple model of machine learning is as follows, 1. Consider a set of examples of a particular task. 2. Assign a teacher to label each example with a label. 3 Develop a program that predicts the label of each example. Learning is better if can be predicted the labels of examples can be predicted easily. For instance, if one wants to learn and identify a particular animal from set of animals ic. to learn whether an animal is cat or dog, Them the learning model is as follows, (2) Teacher is provided with set of animal. (©) Teacher labels each animal, (Each Q12. Discuss various | if machine learning. An: iy 7 Model Paper, G22) Ingredients of Machine Learning. j ‘The three ingredients of machine learning are as follows. Gis. Discuss the problems that can be solved with machine learning, Ans: Binary classification in spam e-mail is a common task ifference would be to consider the cl an two classes for example variation between different types of ham e-mails such as work related e-mails and private massages. This can be considered fication tasks, i.e., to as combination of two binary classi differentiate between ham and spam and to differentiate between work related and private ones among ham e-mails. But with this approach some potential information may be lost because some of the spain e-mails tend to look like the private messages rather tian work related. So, itis beneficial to view multi class classitication as machine learning task in its own way. This is not a big deal because the model to connect the class to feature is yet t0 be learned. It is found to be natural to abandon the notion of discrete classes together and rather predict a real number. {t might be useful to hold an assessment f iheoming ‘e-mails urgency on sliding scale. This task is called as regression and it contains learning of real valued function from training example that are labelled with true function values. For example this type of training set can be developed by selecting number of e-mails from inbox randomly and labelting them through urgency score oa scale of 0 to 10. This involves selection of class of function and development of a function to decrease the difference between prediction and true function values, In the task of regression the notion of decision boundary does not have any meaning. There fore, other ways must be determined for expressing models “dence in its real-valued predictions. ‘The classification and regression consider the availability, of training set of examples that are labelled with true classes or function. values. By providing the true labels for data set proves to be expensive and labour intensive. It is possible to lear to differentiate beween spam and ham as well as wor clustering. | called unsupervised leaning where as learning from labelled data is called supervised learning. A clustering algorithm works bby assessing the similarity between instances and then placing the instances in one cluster and dissimilar instances in MACHINE LEARNING [JNTU-KAKINADA} Other than these, 26 suggestions are also being shown. Such type of associations are shown in data mining algorithms which focus on frequently occuring items together. They work by considering the items that occur minimum number of items, Other type of associations are found by considering multiple items in shopping cart. There are even other type of associations that can be leamed and exploited such as correlations in between real valued variables. G14. Write about looking for structure. Ans: Looking For Structure “The pattems are manifestation ofthe underlying structure in data like other machines learning models. This structure with ‘sometimes take the from of single hidden or latent variable lke unobservable but neverthless explanatory quantities in physics like energy. Consider the below matrix, 1010 0222 ooot 1232 1011 ¥ 0223 Let these ratings be six different people on scale of to 3 of four different films such as The Shawshank Redemption, ‘The Godfather, The Big Lebouski and The Usual Suspects. The Godfather is the popular among four with average rating 1.5 and The Shamshank Redemption is the least appreciated with an average rating of 0.5. If there is no structure in this matrix, observe columns or rows that are combinations of other rows and columns. For example, the third column will be sum of first and second columns. Similarly the fourth row would be sum of first and second rows. This means that the fourth person will combine the'ratings of first and second person. Similarly, ratings of Godfather are sum of ratings of the first and second films. This can be made explicit by writing matrix as follows, : 1010) {100 0222] |o10 0001 ba 001 1232] [110 toi | jio1 (100) (1010) 020|x/oull 001) (ooor “1. (The Ingredi UNIT ihe nec © Machine Learning, Tasks, Binary Classification axed Rotated Tesi) aS soa along with 4 walle ot Of original matric: For example, Whllc @iiag Soa SCS OMEN eed films along with 4 million people voting but only 27 fee calegories. It would be naive to assume that film ratings are spit by genres where the boundaries of them diffuse, Sorce People might only like comedies that are made by coen brothers, Such type of inarix decomposition exhibits useful hidden Sttuenar al thi iam & i gsoolbo oe eels da eels terween supervised learing fiom labelled data and ungupeevised feat” od data ore rere ee co ee ae Varitble oF not can be similarly determined. This can «called as predictive if t bas, other wise its called as descriptive model This generates four different m, = ‘machine learning settings. Predictive Model Descriptive Model Supervised learning Classification, Regression ‘Subgroup discovery Unsupervised learning | Predictive clustering Descriptive clustering, association " rule discovery 1. Acoinmon setting is supervised learning of predictive models. The typical tasks are classification and regression {tis possible to use labelled training to develop a escriptive model tha is nt intended 16 predic tanget variable but ideotiies of data that are different with respect to tenget variable. Such example of supervised learning of descriptive mode! is called subgroup discovery. Descriptive models can be learned in “Uunsupervisc¢’ setting and it is an implied setting. 4 Atypical example of unsupervised leaming of precictive model occurs when the data is clustered by using the clusters to assign class labels to new data. This predictive clustering is used to differentiate it from descriptive form of clustering. ‘Q15. How the performance on a task is evaluated? 5 - i " s does not In machine learning performance on a task is evaluated ina different way for example a perfect spam e-mail filer cxist. Fit exists the spamers will reverse engineer to find ways to trick spam fillers spam e-toail is @ctwally ham, Tn various cases the data is noise such that. be mislabled, or features can contain errors. In such cases it is detrimental to determine 1.6 MACHINE LEARNING [JNTU-KAKINADA) Q17. Write about geometric models. An Geometric Models ‘The instance space contains all possible or describable instances available or not available in the data set. Such type of se has geometric structure. For example all the features are numerical then every feature can be used as a coordinate in cartesing ‘coordinate system. The geometric model is developed directly in instance space by using the geometric concepts like lines, and distances. The geometric models that are applied potentially to high dimensional spaces are prefixed with “hyper- decision boundary that separates two classes, the data is said to be linearly separable. The linear decision boundary is def “a by wrer Ifp is set of n positive examples, then p= +5, <, X can be defined. The decision threshold can be set, the line can be intersected from n to p half way. This is called basic linear classifier. Figure: Basic Linear Classifier ___ Since data is noisy, linear separability will not occur frequently under data is sparse as itis in text classification. Let the instance space has 10,000 dimensions but a document has small percentage of features as non zero. This will create empty space between instances that increases the possibility of linear separability. Because it does not define decision boundary, it is problematic to select a infinite decision boundary. One solution would be to prefer largin margin classifiers. __ The geometric concepts such as linear transformations are helpful in understanding the differences and similarities between ‘machine learning methods. An important geometric concept in machine learning is notion of distance. If it is small between instances then they are said to be similar in terms of their feature values. So the nearby instance are expected to receive samt classification. The distance in cartesian coordinate system is measured by euclidean distance. A simple distance based classifie™ ‘mean of set of points are related with each other. The mean of set of nearby points can be used as represeniative initial calculated | points. To cluster the data into K clusters than there is initial guess of clustering the data, The means of eve") ‘and reassigned to nearest cluster mean. These two steps are repeated until there no 1 algorithm is called K-means, Rather than euclidean distance even other distances can be used like Manhattan distance that sums UNIT-1 (The Ingredients of Mach; ow a and not Consider Xand ¥ variables are known, ray example to classify a new forenample | }eW e-mail the words "vi probability P (Y= spamviagaa, loness re ifrecipe to predict value of Y'p, Pe without knowing all they line Learning, 1, Binary Classification and Related T this is called posterior probability since itis used after observing f en the post 10 various questions of interest. terior distribution helps to provide solutions ra ad ltery are searched for eecurrence check forthe corresponding and the prediet spam when the probability is more than 0.5 and otherwise ham. ‘This “Sed on values of and posterior distribution P (Y]X) is ealled decision rule. This is possible alice aa nd posterior distribution P (YX) The statisticians work frequently with differ example, ifa particular e-mail that ig spam e-mail five times more than itis Now the better one must be determin using Baye’s rule. It states that, Ht conditional probabilities that are given by likelihood funetion P (X1Y). For described by Xhas P (X¥= spam) - 3:51104 and P (XI¥= ham) ~ 7.4 10%, Then X occurs in in ham e-mail. Then prediet spam if likelihood ratio i larger than 1 otherwise predict ham. 4, either posterior probabilities or likelihoods. Its easy to transform one into another by The P(X) is the probability that is not dependent. ‘on Yand it can be ignored in most of the cases. The first decision rule depicts that theclss is predicted with maximum posterior pr obability that can be written by using Baye’s rule in terms of likelihood function, argmax P(X|¥) = argmax P(XIY)P() ¥ i oy > ame POND POD This is called maximum a posterior (MAP) decision rule. A uniform peor distribution reduces to maximum likelihood (ML) decision rule, ia sma PCH) JM Iti found to be convenient while working with ratios of posterior probabilities when there two classes. The posterior odds can be calculated to know favour of data for one of two classes P(¥=spam|X) _ P(x|¥ spam) P(Y = spam) P(Y=ham|x) ~ P(X|¥=ham) ° P(’=ham) The posterior odds are the result of the product of likelihood ratio and prior odds. The likelihood function plays an ‘mporiant role in statistical machine learning. It establishes generative mode] which is the probabilistic model from which values {or variables can be sampled. Learning of probability model involves estimation of model parameters from data. It is possible ‘hough straight forward counting, For example, in coin toss model of spam, recognition, there are two coins for every word w, in ‘vocabulary. One will be called when spam e-mail is generated and other will be called when ham e-mails is generated, The spam coin shows heads with probability 0° and ham coin shows with probability 0°. These parameters will characterise all the likelihood, P(,=1|¥ = spam) = 9° 2 P(™=01¥ = spam) «1-98 P(w,=1|¥ =ham)= 92 P(w,=O1¥ = ham) <1 9° To estimate the parameter Of that training set of e-mails that are labelled'spam or ham must be trained. Consider the ‘pam e-mails and count w, by dividing the total number of spam e-mails. This generates an estimate of O° for spam and 4 for ham. This is visualised in the below figure for a variant for naive Bayes classifier MACHINE LEARNING [JNTU-KAKINADA} 1.8 Tn this variant count of word occurrences in an e-mail is counted. Therefore a parameter Fy, for every likelihood P(w,=j/Y=4). For example, two spam e-mails in which ‘lottery’ occurs twice is observed and a ham e-mail in which ‘pete ‘occurs five times. Combination of two sets of marginal likelihoods generates a tartan-like pattern shown in below figure, vm XSoam ° 2 ‘ For this reason the naive Bayes model is called ‘Scottish classifier’. Q19. Discuss about logical moi Logical Models ‘The logical models are algorithmic in nature and they are inspired from computer: science and engineering. These type of models can be translated easily into rules understandable by humans like if viagara = I then class = Y'= spam. These type of ules Can be organised easily ina tec structure called feature tee. The features are used to split the instance space iteratively. The leaves tance space called inspace space segments. The leaves can be labeled with probability, real value threes whose leaves are labelled with classes are called decision trees. The feature trees ae Versatile and even the models are built on them. Consider the naive Bayes classifier that contains marginal likelihoods. It divides the instance spaces inthe regions equal to combinations of feature values. This generates a complete feature tree containing all feaures one at every level of tree. The decision tree learners use pruning techniques that delete such type of splits. feature lit isa binary feature tree that branches in same direction either left or right consider the below tree, SO UNIT 1 (The Ingredients ot Machine t ing, Tasks, Binary ‘ules in original decision list by means of diqjunction and selects a single non rectangular Viagara = 0 “Jottery * 0 then class = Y= hy senate clunstive condition for opposite class and declares every thing else as spam. The same model can be Viagara = 1 then class = Y= spam | Viagara = 0° tottery = 1 then class = Y= spam FViagara = 0 * lottery = 0 then class = Y'= ham tach path from root to Lea is translated into rule, Every pair of rules have atleast some mutually exclusive conditions akhough in ‘oF same sub tree share conditions. Rules can sometimes have overlap. Such type of rules are incomplete and ie toe Ieaming algorithms work in top down manner, The first task would be to find the good feature to divide on at top of ee The Purpose of if isto find the splits that generate improved purity of nodes on next fevel. One such type of feature is fount, the taining set is divided into subsets one for every node obtained from split. Again good feature is found to split on for cach of the subsets such type of algorithms are called divide and conquer algorithm, ‘Logical models can provide explanations to their predictions, For example, a prediction that is assigned by decision tree it ‘or example, a prediction that is will be explained by reading the conditions that led to prediction from root to leaf, The model can be inspected by humans and forthis reason they are called declarative, The declarative models does not need to be restricted to simple rules, G0. Write about grouping and grading. Grouping models split the instance space into segments. In every ségment a simple mode! is being leamed. They have fixed and finite resolution and they can differentiate between individual instances beyond this resolution. The grouping model assign the majority class to instances that come into segment. The purpose behind training the grouping model is to determine the right segments to avoid this labelling at local segment level. The grading models learn a global model over the instance space. They can differentiate among aibitrary instances even though they are similar. Their resolution is in the form of theory, infinite specifically while working within Cartesian instance space. An example grouping models are tree based models. They repeatedly divide the instance space into smaller subsets. The subsets at leaves divide the instance space using some finite resolution because trees are of limited depth and does not have available features. The instances that are filtered into same leaf of tree are considered as same. Examples of grading vectors are support vector machines and other geometric elassifiers. They can represent and exploit even small difference among instances because they work in Cartesian instance space. It is possible to come with new test instance that receives a score. The difference between ‘10uping and grading models is relative and certain models combine both the features. For example, all though linear classifiers are an example of grading model, itis to assume that instance which a linear mode! cannot differentiate such as instances on line |_orplane parallel to decision boundary, Another difference between these two models isthe way they handle the instances, The ‘regression trees combine grouping and grading features. Q21. Write about features. Illustrate the uses of features. ‘Ans: Model Papers, 2(a) Features ‘A feature can be a type of measurement performed on any instance. It determines much of success of machine learning ‘pplication, they are functions that map from instance space toa set of feature values called domain of feature. ‘foammon domain a ‘real numbers because measurements are mostly numerical. Other type of domains are set of integers, leans and arbitrary finite sets. Uses of features 7, id L ‘A common features is to focus on particular area of instance space. Let f be a feature that counts number (Sere ofr wing inane Ut x beuriny email hn contin /{s] Owl set eat de not have word ‘viagara’,/(x) # 0 or (x > 0)seleets the e-mails that have the word and f (x) 22 selects the e-mails that contain word twice or more. These type are called binary splits since they split the instance space into a 1.10 MACHINE LEARNING [JNTU-KAKINADA) 2. Features are also used in supervised leaming. The linear clasifir consists of decision rule ofthe form 5". w,3, >, The linearity of decision rule is that, every feature participates in instance score, This is dependent upon weight w, ig ifit is lange and positive then positive x, increases the score. If w,-€ 0 then positive x, decreases the score and if w= then 1, '& influence is negligible. Therefore, feature makes precise and measurable contribution to final prediction, The individual features are not theresholded and their full resolution is using for calculating an instance’s score, The two uses, ‘features as splits’ and ‘features as predictors can be combined into single model. Q22. Write about feature construction and transformation. Ans: Mode! Papers, a) Feature Construction and Transformation Features have alot of scope in machine learning. Sometimes the developer of machine learning application need to construc the features, This process is important for the success of the application. Indexing e-mail by words occuring in it isan egineereg representation that amplifies signal and attenuates noise in spam e-mail filtering and related classification tasks. The problems can be easily conceived. For example the classifier can be trained to differentiate between grammatical and ungrammatical sentences ‘The word order is signal other than noise. Developing a model is a natural process in terms of given features. The features can be modified or new features can be create. For example real valued features consist of unwanted details which can be removed by discretisation, To analyse the body weight of small group of such as 100 people by drawing a histogram, the weight of everyone {s measured in kilograms with a position after decimal point. But the histogram will be then sparse and spiky. General conclusion to be drawn from histogram is a complex process. It would be useful to discretise the body, weight measurements into intervals of 10 kilograms. In case of classification context the body weight can be related to diabetes, then every histogram bar can be associated with proportion of people who have diabetes among the people whose weight isin that interval. The intervals can be selected such that this proportion is monotonicall 14 25; 12 a 10) é 1s s 10} 4 4 5 o '30-40:50 60 70 80 90 1001110 120130 O35 $575 90 100130 Figure: Histogram of Body Weight Measurement of Paople with and without Diabetes HEE With diabetes TB Without diabetes For a particular task like classification itis possible to improve the singnal-to-noise ratio of feature, The more extreme cases ‘of feature construction the complete instance space can be transformed. Inthe below figure the data is linearly not separable. UNIT-1 (The Ingredients of Machine Leaming, Tasks, Binary Classification and Related Tasks) 1a But it can be made linearly separable by mapping the 1. Tthas carbon in an aromatic ring with six members. instance space into new “feature space’ containing squares of | 2. jt has carbon with charge of -0.13 partially original features 3. Ithas carbon in aromatic ring with six members with a 5 change of -0.13 partially 45} The third feature is found to be specific than others af because it is true. When first and second features are true then Ray the third might be false. These type of relationships can be exploited while searching for features to add to the logical model. For example, if the third feature is found to be true of specific negative example that is to be excluded, then itis of no use in considering the first and second features since they does not help in excluding the negative. When first feature is false of a specific positive that is to be included, then it is of no use in considering the third feature. Such type of relationships help to structure the search for predictive features. 2} 15} if 0s 9 Ba ae 0051 15 2 253 354 455 Q23. Discuss about the interaction between features. Ans: sific: The feature can interact with each other in various | Q24. Explain in detail about classification. ways. The communication among them sometimes poses | “pode Papert, Cat achallenge or can be ignored or exploited. For example, if'a term ‘professional’ is observed in an e-mail then there is a chance | Classification even for other phrase such as “real” in the e-mail. Ignoring such Cassis te temattteentine tisk in ‘machine type of interaction means overestimating the amount of data ing. Aclassifieris a mapping ¢:%-> G where T= (C,Cj- that is exposed by notising both the phrases in same e-mail. se ; Considering this; couapletety ieesate epson the task: Cnnaisea’ | ps ts mm meanaennte emcees oes = mctimncs another feature interaction example, where features are ‘grad’ | US to represent the st of examples of that class, For example and ‘real’ that assess the extent to which the models are of “hat” can be used to represent that (xx) isan estimate of true grading kind and ability to handle the real valued feature. The | but unknown function ¢ (x). ‘The examples of classifier are in values of them differ by atleast 1 for all but one. The features | the form (x,0(x)) where x €9Cis an instance and c (x) is 2 adobe positively comelated as per statisticians. Other | ie clas of instance. Leming proces of casi consis positively corelated features are logic’ and ‘dise'that indicate | oF constructing {he logical models and the ability to handle the discrete feature. | a5 possible. There are only two classes that are referred as ‘Thereare even negatively corelated features where value of one | Focitive and negative, © and © or * land 1 The true class ‘eereases and value of other decreases. Ths applies to ‘spit’ | ciassijcation is called binary classification. The spam e-mail Es | stonatecatennrmeess ‘models. This is applicable to even ‘logic’ ele oeme ‘considered as positive class and ham is considered as negative Inclassification, the features can be correlated differently based on the class. For example it is conceivable that fora person. with last name Hilton and employee of Paris at council. The emails containing the word ‘Paris’ and ‘Hilton’ indicate ham ‘and emails containing both the words indicate spam. ‘features are positively correlated in spam class and ney in ham class. So, ignoring such type of ‘sdetrimental for classification performance. In other cases, the | Teature correlations might obscure the true model. The feature {Orelation also helps to focus on specific part of instance space. Features can be related even in other ways. Consider the | b ‘atures that are either tree or false of molecular MACHINE LEARNING [JNTU-KAKINADA) n the above figure can be converted to classifier by labelling every leaf with class A sunple wether wag bbe to assign the majority class in every leaf resulting in decision tree ic If the e-mail contains word viagara then if is classified as spam, vi whether it is labelled as spam or ham. From the numbers in the above figure, & obtained. The left leaf will predict 40 ham emails and 20 spam | ‘The middle leaf will classify 10 spam emails correctly and mis and 50 ham emails. This means that 30 out of 50 spam emails are classified correctly and 40 out of $0 ham uN 1 (The Ingredients of Machine Lear and Related Tasks) The true position rate is an estimate of probability such that an arbitrary positive s classified correctly, res an estimate of P, (&x)=@|c(x)=@). The true negative rate is proportion of negatives that are correctly classified and estimates Py (ax)=ele(x)-0) These rates are called sensitivity and specificity. Then they, are visible based on the per-class accuracies. The true positive and negative rates can be computed by dividing the number on descending diagonal by row total. The per class error rate is false negative rate for positives and false positive rate for negatives. They are obtained by dividing the number on ascending diagonal by the row total. For example, a classifiers predictions on a set are shown in below table The true positive rate is tpr = 60/75 = 0.80 and true negative rate is tnt = 15/25 = 0.60. The accuracy is ace = (60415)|100 = 0.75 which is not true positive and negative rates. The proportion of positives pos * 0.75 and negative neg = 1 ~ pos = 0.25 generates ace = pos. tpr + neg.tnr Here, ifthe number of positives and negatives are equal then unweighted average can be obtained. The good performance on the class contributes to good classification accuracy. To achieve good accuracy the classifier need to concentrate on majority class specifically when class distribution is unbalanced. The majority class is also 4 least interesting one, If minority class is class of interest and small then the accuracy and performance on majority class are not right quantities to optimise, In such cases, an alternative to true negative rate called precision is considered. Precision is again a counterpart to true positive rate such that when true positive rate is proportion of predicted positives among actual positives then precision is proportion of actual positives among MACHINE LEARNING [JNTU-KAKINAD,) 1.14 true negative rate, specificity, TNINeg P(E(x)= Ole (x) negative recall false positive rate, false EPINeg=1-tr | P(é (x)= ®lc (x) alarm rate false negative rate FNIPos=1-tpr | P(é (x)= Ole (x) | Levees) precision, confidence wwe rel [6(x) = o(x) =] TPA(TP + FP) P(c(x)= @é (x)= Pree Sxerll€)=8] Various models are visualised on same data Set by several points and fact is used that accuracy is constant along the line ‘segments with slope 1 to rank these classifiers on accuracy 9. The rectangle can be normalised o be a unit with true and false posing Tate on the axes. This is called as ROC 5; Pace that has line segments with slope | and the points are connected with space average | Ranking — a 26. Discuss about scoring and ranking. Anz Model Paper, a2) the class predictions depend upon. Formally, the scoring classifiers Most of the classifiers calculate scores on which ‘o k-vector of real numbers. The bold face notation depicts that scoring | ‘mapping §:%->R¥, i., mapping from instance space Slassifier generates vector s(x)=(4,(x),. §,(x)) (x) is score that is assigned to class Chor instance x. Such score represents the probability of applying class label C, In case of two classes, it suffices to consider score for only one class. Then §(x) is used to Tepresent the score of positive class for instance x. The below figure depicts the Process of turning the feature tree into scoring tree. instead of single.number §, UNIT-1 (The Ingredients of Machine Learning, Tasks, Binary Classification and Related Tasks) 1.15 sat a Oe ee Figure: Loss Functions ‘The loss functions in the above figure are as follows, (@) 0-1 loss LO1@)= 1 ifz <0 and<,(2) = 0itz>0 (i) high loss L,(2) = (1-2) if < 1 and L,@)=Oifz>1 (Gi) logistic loss Z,, (2) = log,(1 + exp(—2)) (iv) exponential oss Z,.(2) = exp (~2) (9) square loss L,,(2) = (1-2? ‘The average loss over a test set Te is Fyre (x)) The simple loss function is O— 1 loss that is defined as L,,(2) = {We goa) -Oifs>0.Themenge 0 1 oss is the ._ Proportion of misclassified test examples, pape MACHINE LEARNING [JNTU-KAKINADA) rank-err = ‘The ranking accuracy would be Lierorctenof i(s)> s(x!) +44 a(2)=4(2')] rep Pos Neg = 1 -rank-err Visualisation of classifier performance is depicted in below figure. ‘are ranked in the order p1- p2- ‘The score can be derived from this linear classifier the distance of an example re sc is linear ¢! by taking i arias Pi pi-nl- p4-n2- n3- pS- n4- nS. The below figure visualises the four ranking errors in top left corner. The. 2s = 0.84, a — Lo vw 1.18 MACHINE LEARNING (JNTU-KAKINADA) “__ Tie curve is dierent om coverage curves Tor Scoring Ws The Fason behind this is abunce often. The icp cae can be generated from ranking as follows, A curve three steps up, one step to right, one step up, two steps right, One Step up ang finally two steps right. The same process can be applied to grouping models when ties are handled. If there is a tie between p Positive examples and n negative examples, it goes p steps up and n steps to right. The grouping model ROC curve have as line segments as instance space segments in model. The grading models have one line segment for every example in dataset The | better performance can be achieved by decreasing the models refinement training a model is not about amplifying the significa distinctions but also about diminishing the effect of misleading distinctions. Q28. How rankers are turned into classifiers? Ans: ‘The difference between rankers and scoring classifiers is that the ranker the high score as strongest evidence for positive class. Otherwise it does not make any assumptions about scale using which the score are expressed or on the value that is to be good score threshold used for separating the positives from negatives. Consider the problem of obtaining threshold from coverage curve or ROC curve. One key concept is of accuracy isometric. In the example of coverage plot, the points of equal accuracy are connected through lines with slope 1. Now draw a line with slope 1 through top Jet pont and then slide down wl the coverage curve is touched at some point. Every point will therefore achieve the possible accuracy with the model. Inthe below figure, this method will identify points 4 and B as points with highest accuracy. This cab be achieved in different ways such as model 4 is conservative on positives. : in detail about class probability estimation. Model Paper.4V, @2{b) Anst class Probability estimation The class probability estimator is a scoring classifier that generates probability vectors over classes i.c., mapping x>{0,1]* It can be written as P( x)= (A (+)... (x), where 2 (x) is probability that is assigned to class C, for instance x nl 3. P(x) =1- In case of two classes, the probability that is associated with 1 class is 1 minus probability of other clas, In uch cases P(.*) can be used to represent the estimated probability of positive class for instance x . The probabilities P,(x) can not be accused directly through scoring classifiers. The probabilities P(x) are estimates of probability P,(e(x')=Gj|4 ~x) where x! ~ x stands for‘ x" is similar to ‘The frequency of similar instances of this class among instances to, x need to be determined. The x belongs to that class hasod on the percentage of frequency. Similarity here depends upon the model used. Consider a situation where two instances arsimilar. Than P, (( =x" ~ x) =P; (e(x!)=@) is estimated by proportion pos of positives in data set, The P(x) is predicted regardless of the knowledge of x's true class. Consider another situation where there are no similar instances unless theyaresameie, 2! ~x if.x!=x,and sx otherwise. Then P(e(x!)=@)x!~ x) = P(e(x)=©) because = is fixed—is Life(x) = @ and 0 otherwise. Then P(x) = 1 is predicted for known positives and P(x) is predicted for known negatives. But this cannot be generalised to unseen instances. Zs Assessing Class Probability Estimates ‘The performance of class probability estimators with classifiers if the problem to be solved. The true probabilities are not accesible, so the binary vector (ife( x) =¢,}..[e( x) =c,]) is classified. Ithas # bit set to 1 if x' s true class is C, and other bits azeseto 0, They are used as true’ 5 The squared ero of predicted probability vector #(x)=(A(x).-A,(x)) con be defined as eee a “ ‘ = 1.20 MACHINE LEARNING [JNTU-KAKINADA} 8 curaey isometric that generates thay Points that are contained in concavities. The, 2 hl of ROC curve is convex curve through outermost points op = orignal ROC curve. I is more AUC than original cuve singe S ‘replaces ranking erors of original curve with half errs, Thy, ge actual ranking incurs 6 out of 24 ranking erors when convex hu Ee ‘urns them into half errors. Ones the convex hulls determing ae {he empirical probabilities ean be used in every segment of Re ‘convex hull as calibrated probabilities, a 0.7, ‘Assume that scores are not probabilities on some unknown | scale so that spam filter i a ranker instead of class probability estimator. Because every test example receives a different score ‘hen the empirical probabilities are a Leading to sequence of | P-values of 1-1-0-1-1-0-1-1-0-0 in decreasing order of scores. In the above figure, P = 1 corresponds to vertical segment of ROC curve and P = 0 to horizontal segment. The problem is is caused by maintaining vertical segment following a horizontal U N IT BEYOND BINARY CLASSIFICATION, CONCEPT LEARNING PARTA SHORT QUESTIONS WITH SOLUTIONS — Qi. Define binary classification. es: ‘Model Papers, 1(b) Binary classification has certain concepts that are completely binary. For example, consider a notation of coverage curve that will not generalize more than two classes. The issues in case of having more than two classes are as follows, (Evaluation of multi-class performance. i)_Building multi-class models out of binary models. Q2,_ Define Regression. Anst Model Papers, a1(b) ‘The function estimator is also called as regressor. It can be defined as a mapping f : &¢—» R. The problem of regression is to leam a function estimator from examples (x,, f(x). When this is natural and innocuous generalisation of discrete classification, itis not without its consequences. Due to one reason the low resolution target variable is switched to one with infinite resolution. Q3. -What is unsupervised learning? Ans: Model Paper-iil, Q1(b) Inunsuperyised learning, while training the network, the desired output or target output is not distributed across the network. During learning process no teacher is required to give the desired patterns. So, the system begin to learn itself by recognizing and Adjusting to different structures in the input patterns. As no teaching input exists, this kind of learning is often called as adaptive ‘vector quantization. As the system arginases by its own, the iinsupervised learning process can be called as self organized wherein ‘ach and every neuron competes and cooperates with one another for weight updation and is based on the present input. In this ‘euronal competition, only the winning neurons undergo learning process, In unsupervised learning system, learning is carried ‘ut in the form of differential equations which are designed to work with the available information in local synapse. ‘Q4. What is descriptive learning? Model Paperv, at(b) It involves description of data to produce a descriptive model. It follows that the task output being a model is similar as ‘earning output. It makes no sense to employ a separate training set to generate descriptive model because the model should ‘plain about actual data rather than some hold out set. In descriptive learning, the task and learning problem will coincide. 22 MACHINE LEARNING [JNTU-KAKINAD Gh Wants predictive ciesterng? Gea Ans: ‘The difference between predictive and descriptive models is observed in clustering tasks, To understand lstring ig ag earning new ‘shelling function fom unlabeled data. Sothe cluster canbe defined inthe same way a classifier namely ax 4X where = fc. 6.6} is set of new label This i the predictive view of clustering as the domain of mapping {Satire instance space and therefore it generalizes to unseen instinces. 6, What is L6G? Ans: The LGG of two instances is nearest concept in hypothesis space where both instances intersect. The. ‘Sethhere s thatthe point i unique and itis special property of many logical hypothesis spaces and can be used in good way in ‘eaming. Such pe of hypothesis space is lattice, a partial order where two elements have least upper bound (Iub) and reste, Jower bound (gib). ‘ 7 ywiT-2 (Beyond Binary Classification, Concept Learning) PART-B ee ESSAY QUESTIONS WITH SOLUTIONS __ 2.1. BEYOND BINARY CLASSIFICATION Pa 24.1 Handling more than Two Classes 9. Write about handling more than two classes. ans: Model Papers, a3(a) Binary classification has certain concepts that are completely binary. For example, consider a notation of coverage curve ill not generalize more than two classes. The issues in case of having more than two classes are as follows, (Evaluation of multi-class performance, ii) Building multi-class models out of binary models. 0 Evaluation of multi-class performance: Generally, the classification tasks have more than two classes. For example, when a patient is diagnosed for rheumatic disease, the doctor needs to classify the patient into several variants. The performance of a classifier for k classes can be accessed through k-by-k contingency table, Performance can be assessed for classifier’s accuracy by calculating sum of descending diagonal by number of test instances. Toconstruct a multiclass classifier there is only one method i, to tran two class models called linear classifiers, They can becombined into one k-class classifier in multiple ways. One method called one-versus-rest scheme trains k-binary classifiers. Infist separates the clas C, from C,... C, and later on separates C, from other classes, The class C nstances are treated as rositive examples while traning * classifier and all the remaining instances are trated as negative examples. The classes sometimes are learned in fixed order, So k-1 models ae learned by separating C, form C,, C, with 1 $<. An alternative {orthis would be one-versus-one, The one-versus-est as well as one-versus-one schemes are the commonly used methods ‘otum binary classifier into multiclass classifies, To force a decision in one-versus-rest scenario, a clas ordering can be settled prio to or after learning, In one-versus-one scheme noting can be used to arive ata decision that is equal to distance based decoding, - MACHINE LEARNING [JNTU-KAKINADA] is to derive scores from coverage counts. The number of examples ‘method is applicable and frequently generates, of every class which are classified as positive by binary classifier. Many issues arise from these approaches, when there are once there are more than two classes - A general method of ad- dressing a k class learning problem with binary classifiers is to, (@_ Break the problem into / binary learning problems. i) Train / binary classifiers on two class versions of actual data. (Gli) Combine the predictions from these / classifiers into single k - class prediction. ‘The common methods to perform the frst and third steps is one-versus-one or. ‘Promotes the choice to implement other schemes. Q10. Explain about regression. UNIT-2 (Beyond Binary Classification, Concept Learning) Ifthe number of parameters of a model are under estimated then lows oF vero Cannot he decie ed. Otherwise the model will be dependent on traning sample with large number of parameters. Even small differences in train ple can Fesult in diferent model. This is called bias - variance dilemma. A less complex model will suffer less from variability hecause of fandom vanations in taining data, bu it introduces a system bias that cannot be resolved by large amounts of traning data. A high complexity model c an eliminate such type of bias but non systematic eror occur du to variance. This can be made ‘more precise by observing that expected Squared loss on training an example x can be decomposed as shown below, (fe) 7)?)] = (s0)- EL Jeo) +8 fe) ELA] ‘The first term will be zero if these function estimators get on average. Else the learning algorithm will depict a systematic bias of Some type. The second term quantifies the variance in function (x) as result of variations in training set. The below figure depicts this through dartboard metaphor, 2.6 MACHINE LEARNING [JNTU-KAKINADA] neuronal competition, only the winning neurons undergo leaming process. In unsupervised learning system, leamingis carried out in the form of differential equations which are designed to work with the available information in local synapse. For instance, ‘The word “Pen” produces an image in our mind which indicates generic properties of all instances of “Pens” that are seen ‘and the image produced indicates the centroid or codebook vector or quantization vector of the “cluster” of images of “Pen” ‘our brain. Here, cluster produces a classification structure within a data set. Both these learning strategies operates iteratively by ‘weight adjustment in the network. ‘On comparing these two learning strategies, itis observed that supervised learning uses pattern classification for each ‘training pattern, whereas unsupervised learning uses clustered patterns to produce decision class codebooks. Secondly, the type of learning is usually off-line in supervised method and on-line in unsupervised method. Thus, learning algorithms define architecture dependent method to convert pattern information into weights for generation of intemal modes. MACHINE LEARNING [JNTU-KAKINADA) Labelled C, Labelled C, Labelled €, Here g,=|{reD| 8) ~ tue « ls) = ¢} and ¢ is shorthand for {xe} Ix) =o) Concept Learning) 11 If all the concepts which does not cover atleast one ‘among the instances, are not ruled out then the hypothesis space is de+ creased to 32 conjunctive concepts. Insisting that any hypothesis cover all the instances decrease this to four concepts. The least general one is that one found in example, Is called their least general generalisation (LGG), The below. algorithm formalises the procedure 10 apply a pair wise LGG operation repeatedly to an instance and present hypothesis because they have same logical form. The structure of hypothesis space ensures that result is independent of order in which the instances are processed. The LGG of two instances is nearest concept in hypothesis space where paths upward from both instances intersect, The fat here i that the point is unique and it isa special property of many logical hypothesis spaces and can be sed in ‘good way in learning. Such type of hypothesis space is lattice, a partial order where two elements have least ‘upper bound (ub) and greatest Jower bound (glb). The LGG of st of instances is least upper bound of set of generalisations of instances. Tha isa posible generalisations sues general as LG, The LGG is most conservative generalisations that can be leamed from data 1, x-+ first instance form D; 2 Hex 3. _iffinstances left then 4, x«—next instance from D; 5. H—LGG(H,x); 6 end 7. return Ht Internal Disfunction | ust be made richer slightly by enabling a restricted form of disfunction called as intemal disfunction. Ifa dolphin is 3 metres [ln sd ote dolphin 4 etre long ten the condones fA ite cab ded tothe concept. This cin be wt ‘ens length = (3, 4] that means length = 3 V length = 4. This makes sense for features that have more than two values ie, for ‘nstance the intemal disfunction teeth = [many, few] is true and can be dropped The below algorithm depicts calculation of LGG of two conjunctions that have intemal ds funtion. © tour For energy feature fdo Iff= y, is conjunct in x and f= »,is conjunct in of then. ‘Add f= combine — 1D (,¥,) 1023 Rat if Seca ite ee =e 288 Pate ThroWah the Hypothente Bpmce Q15. Explain about path through the hypothesis space, Ans: Consider a sea of animals that belong to same species. The length of them is in metres regardiens of gells, prominent bewk and few or many teeth. By using these features, the first animal ean be depicted through below conjunction, Length = 3 4 Gills = no ~ Beak = yes «Teeth = many Even the next has same features but it is a metre large. So drop the length condition and then generalise the conjunction Gills = no » Beak = yes 4 Teeth = many The third nal eth Sg a aah, Vic Liber! che Gills = no 4 Beak = yes. . Alll the remaining animals “ni fy i om in UNIT-2 (Beyond Binary Length = [3, (3, 4) & Beak = yes Length = (3, 4] & Gills = yes Gills = no & Beak = yes Length = [3, 4] & Gills = no & BEak= yes Inthe above figures, there are two general hypotheses. Every concept between least general one and one of most nes is possible hypothesis. That i, i it covers all the postive and no negatives. Mathematically the set of hypothesis which agre with data is convex set, It is possible to interpolate between any two members of set. Ifthe concept that is less general than other and more general than other is found then itis a member of the set. {tis means that, itis possible to describe the set ofall possible hypothesis by its least and most general m ‘The concept is said to be complete ifit covers the positive examples. And itis said to be consistent fit oa negative examples. The version space is set ofall complete and consistent concepts, This set will be convex and fully defined by. 2.14 MACHINE LEARNING [JNTU-KAKINADA) NE Q16. Write in short about the followin; () Most general consistent hypothesis: (ll) Closed concepts. Ans: @ Most General Consistent Hypothesis: ‘Consider five positive examples pl: Length = 3.» Gills = no Beak = yes - Teeth = many (p2:: Length = 4 4 Gills = no a Beak = yes 4 Teeth = many p3: Length =3 Gills = no » Beak = yes a Teeth few | ‘p4: Length = 5 » Gills = no, Beak = yes a #p5:Leagih = 5 Gills = no» Beak ‘will not depict that D and £ are logically equivalent, since %, C %, — the extension of D is a proper subset of extensions or , there are instances in % that are covered by E but not D. However, no ‘witnesses’ are present in data and thus wh is concemed D and £ are not distinguishable. As shown below fig, rerun tention fo clocd concep an A hypothesis space. 17. Discuss about beyond conjunctive concepts. anes Model Papers, 3(b) ‘The conjunetive normal form expression (CNF) is ‘conjunction of clauses. The conjunction of literals are trivially in CNF ‘here every Gifumetion contang single literal. The CNTF expfeelod semua Dect there can occur Hsu Glaus, Amethod is used to learn horn theories where every clause A —» B ishomn clause ..4 isa conjunction of literals and B is single literal. For ease of notation the attention is restricted to Boolean feature and then fforf= true and for f= false is written {the hom theory does not cover a positive then all the clauses that violate the positive need to be dropped where A + B violates a positive when all literals in the conjunction A are true in example and B is false. Thing s are interesting when nega- tives are considered. Then one oF miore clauses nee to be determined toad to theory to exclude the negative. For example if the current hypothesis covers the negative. 2 Many Teeth Gills « Short . Beak, To exclude this, the below Hom clause can be added to the theory. Many Teeth A Gills 4 Short + Beak. ri = . Wire bedi to ageey su e Because there are other clauses that exclude negative, this would be the most specific one and less risky. The most specific clause that excludes a negative is unique if negative has only one literal set to false, For example if the covered negative is Many Teeth A Gills —Short , Beak, ‘Then there is choice between the below Hom clauses a a Many Teeth » Gills > Short . asia ‘ail ‘ Many Teeth «Gills Beak, laa moe sren gai The less literals are set to true in negative example the more general the clauses excluding the negative are consider the Lo hetrue B Brel Ueda abb de s See > = CS Bslailgny tadinest betas) baci 2.18 MACHINE LEARNING [JNTU-KAKINADA} 3 he tue; 10. forall s € Srepeat step 11 and step 12. 11, p =the conjunetion of literals true in S; 12. Q + the set of literals false in s; 13. forallge QdohHha@—a)s 14. end for 1S. end if "16 endif 17. retum h The above algorithm maintains a lists of negative examples from which if builds the hypothesis. Rather than adding new negative examples to list, it tries to find negatives with less literals se to true because it results in more general clauses. This is possible when there is access to a membership oracle Mb that depicts whether a particular example is member of concent that is learned or not. “The above hom algorithm is an active learning algorithm that builds its own training examples and asks the membership 5 ‘oracle to build them. The list of selected negative examples from which hypothesis is rebuilt is core of algorithm. The runtime of algorithm is quadratic in m and n, It earns horn theory equal to target theory. Q18. Write is short about usage of first order logic. ss Ans: Model PapeciV, 036) Usage of First Order Logic eke : ‘One way to move beyond conjunctive concepts defined by simple features is to use a richer logical language. The ealer used languages are proportional i ever litera is a proposition such that ills = yes standing for ‘the dolphin has il fiom which larger expression are bul by wsing logical connections. First order predicate logic or fist ode logic fr sho will generalise this by Constructing more complex literals from predicates and terms. For example, the first order lier can 24, Pair of (Gill). The Dolphin 42 and Pair of (Gill) terms refer to objects. The Bodypart is binary att Model Papers, Q1() sn machine earning the tree models ae popular and easy to understand, The tree models represent high flexibility at some price They can capture the complex non-linear relationships and they are even prone to noise in data set. The trees might have sro muoredeaves among. which the lft leaf provides logic about the parent fade andthe right leaf provides the information Ans: 3. Write in brief about regression trees. Ans: w= DO-7F ver Inthe below equation y indicates the mean of target values in Y. 2 MACHINE LEARNING [JNTU-KAKINADA) GL Write in short about clustering trees. Model Paper, ate) trees area ype of decision trees that split the instances into homogeneous Sieh. ‘They provide symbolic see ee talperyfeed learning algorithm, The regression wove determine (he veo Ae Sea SES TENERS Sais taht clustered around mean value inthe segment. And the ‘variance of target values is nothing but the average euchean distance fo mean. ‘A solution for this would be to use a vet calculated 35, tor of target values, The cluster dissimilarity is of a group of instances D can be Lois) 35 he elit similarity earned bythe weighted average cuter dismilesty over the chtien. Ths ane wed UNIT-3 (Tree Models, Rule Models) 3.3 PART-B Jaggi ESSAY QUESTIONS WITH SOLUTIONS % 3.1. TREE MopELs é : e 34.4 Decision Trees G3. Give an introduction about Tree Models, ans: ‘Model Paper, 4a) chine learning the tree models are They can capture the complex non-linea we leaves among which the node Popular and easy to understand, The tree models represent high flexibility at some ar relationships and they are even prone to noise in data set. The trees might have left leaf provides logic about the parent node and the right leaf provides the information ‘ne concept of themodels can be applied on various tasks suchas classification, probability estimation, ranking, clustering soi gression, The ee models are grouping models whose purpose is to minimise the diversity in leaves where the notion of versity dependent on the task, The diversity can be interpreted as a type of variance. A feature tee is @ tree that has internal node labelled with feature and the edges are connected to internal node is labelled with literal. A set of literals at some node is called a spit. Every leaf determines a logical expression that is conjunetion of literals found in path between that leaf and root. The ension of it is called imstance space segment. It represents various conjunctive concepts in hypothesis space. The learning problem involves the decision related to the suitable concept for solving the task. The rule leamears will learn the concepts one atime. The tree leamers perform top-down search for these concepts. The below algorithm promotes a generic learning method that is common to most of the tree learners. 1. _ if Homogeneous (D) then return Label (D); S< Bestsplit (D, 7); Split D into the subsets D, based on literals S; for every i repeat step 5 ifD, = 6 then g T, —Growtree (D, F) else Tis leaf labelled as Label (D); 6 end for 2 % 7. return tree with root labelled as Sand children 7, above algorithm is based on divide-and-conquer that divides the data into subsets to construct the tree for crys co Cotbaes batons ia et Tews of algorithms are gredy. Another alternative for his would yeep MACHINE LEARNING [JNTU-KAKINADa) For example, in case ofthe boolean features N is divided N, and Nv. Consider two classes, N° for positives and 1° negatives nN. The decision about the utility of feature to spit the examples into positives and negatives depends pon the cng, cases like ‘D® and D° = 4 DP=$ and DP = D®. The 2 children of split are said.to be pure. tis even important nm sneesue the impurity of set of positives negatives, Depending upon the relative magnitude of and It should nor change ‘when both of them are multiplied with same amount, In such case impurity is defined in terms of proportion, P= P1(% + i®). Other than this, impurity should not even change when positive and negative classes swap. In such cas, it should ny change when p is replaced with 1 P. A function that results in O when P =0 or 1 is required. It should reach the maximum for P = 1/2. ‘The below functions can fit the bill, 1. Minority Class min (?,1— P) A ‘It measures the proportion of mis classified examples when leafs have labels as majority class. It is referred as error ra, The error depends upon the purity of examples. The impurity measure can be written as 1/2-/ P-1/2/. 2, GiniIndex2 P(1- P) e It indicates the expected error when the examples in leaf are labeled randomly. It will be positive with probability Poy negative with probability | — ‘P. Further more the probability of false positives will be P (1 ~ P) and the probability of fle negative will be (I~ >) P. ; 3, Entropy— ? log, P-(1- P)log, (1- P) Itindicates the expected information in the form of bts about class of randomly drawn examples. The prediction of message ‘and expected information length depends upon the purity of set of examples. ‘ The above three functions are plotted inthe below figure, i 05, YU \ | LN (Tree Model Rule Models) ally the impurity values such a Kp (Nand Imp (N;) oF children on impurity Curve ae 6 2. Connect these two values by straight line to depict the weighted average of the two. ical 3. The P provides the exact interpolation point because the empirical probability of parent is weighted average of empiri probabilities of children with same weights. ie, bolls) in tay fea ‘The impurity measures can be adapted to k> 2 classes by computing the sum of per-class impurities in one-versus- manner. npleted. . The é-class entrapy can be defined as )”~ log; and k-class Gini index as ya -F). = co Consider the below algorithm, Input = data NV; set of features F. Output ; feature Fto split on Lee 2, foreach fe E repeat step 3 to step 5 p 3, Split into, ....N, based on values v, off 4. iflmp ((N,..-}) Gomsicer the consecutive examples of some other class when task is classification, consider the examples whose tinge, are different when task is regression and consider the examples whose dissimilarity is large when the task is > __ Every potential threshold can be evaluated in the form of distinct binary feature. ‘The supervised rule homogeneous set of and to find the rule bodies for. of rules ie. the rule list. The second met Qi7. Explain about Ans: The purpose of rule lea the homogeneity. A down ward homogeneity can be measure and because of this the weighted avera ‘one of the children that has added literal as true. Any But, all the impurity measures produce th when P< 1/2 and h hm os ee Tn UNIT-3 (Tree Models, Rule Models) 3.11 The above algorithm is also called as covering algorithm and itacts as the base Tor most of the rule learning systems. It An algorithm for learning the single rule is as follows, 1. etme; 2, L€ set if available literals; 3. if (not Homogeneous (D)) repeat step 4 to step 7 4, 1 Best Literal (D, 1); 5. bebak 6. De fee D| beoversx}; 7. L«-L/{I0 € L/D makes use of same feature as 1}; 8. endif 9. Ce label D); 10. rif b then class = C; 11, return The above algorithm makes use of functions like Homogeneous (D) and label (D) in specialization. The function Best Literal (D, L) is used to select the best literal for adding it wi data D. The algorithm terminates when D drops below a particular size, ‘Gi6. How rule lists are turned into ranker or probability estimatotor? Ans: ‘The rule list cab be turned into a ranker or probability estimator ‘similar to that of decision trees. The local class. distributions ofthe rules can be accessed with the help of covering algorithm. The scores can be therefore be maintained based on the probabilities, ‘For example, consider two classes for which the instances ‘cab be ranked based on decreased empirical probability of positive clas thereby generating a coverage curve with one segment for every rule. The ranking order of rules isnot related to their order deciding the requirement of further ith the rule from candidates of L in in rule lst. ‘Consider the concepts X and Y, () Length=4K, iid () Beak=yesK,~5 i -2,i5 ' The rule list XY-can be built by using the above eoncepts as rule bodies, if Length = 4 Then class = © [1+,3-] else if Beak = yes then class = ® (4*,1-] else class = © ‘The coverage curve for the above rule list is as follows, 3.12 MACHINE LEARNING [JN’ ‘The initial segment of curve indicates the Instances Covered by only F Since it has highest proportion of positives, ii considered first in coverage curve. The second coverage segment indicates rule X followed by the third coverage segment (.) indicates the default rule. This segment comes last because it does not cover the positives. ‘The rule list can even be constructed in opposite oder ie, YX, If Beak = yes then class @ (5+,3-] else if length =4 then class = @ else class = @ ‘The coverage cure fortis rule is shown inthe above figure. The ist segment here indicates the frst segment ofr it and second and third segmen’s re ied in between rlex and default. The rule list XY makes less ranking erors than that of and even has berter AUC. The XY is optional and achieves 0.80 accuracy and ¥X’manages to have 0.70 All the rule lst consis of information that is not available than the other lists. For the segment X°Y- the overlap is not accessible for any of lst. This scenario in the above igure is depicted by doted Segment that connects segment ¥ from YX and 1X from rule list XY. Therefore rule overlap can be accessed by ‘wo rule lists, Thus, there are multiple connestions in between the rule lists and decision trees. The rule lists are said to decison tees where the empirical probebltes that are connected ith the ules generate the convex ROC a wel curves on training data. The empirical probabilities can be accessed due to coverage algorithm that deletes the training covered by a rule before the next rule is leat. The rule lists generat the probabilities suitable to taining set. The rule ‘eorderd by Some rule leaing algorithms in literature after rating all the ules. In such cas, the convent is rule coverage is re-evaluated in reordered rule list. - 3 ae 19. Discuss about learning unordered rule sets. Ans: > = Learning Unordered Rule Sets Pa 3.13 NIT-3 (Tree Models, Rule Models) Tbe true; 2. L€set of available literals; 3. _ ifnot Homogencous repeat step 4 to step 7 4, |< Best literal (DLC): 5. be bl; 6 De {{xeD|x is covered by b}; 7. LEL\ {© D\1 use the feature same as 1}; 8, endif 3 9. |e ifb then class = Cs 10. return r Inthe above algorithm the best literal is selected with respect to the learnt class C,, This class C, indicates the head of the se eae Write about rule sets for ranking and probability estimations. = Ans: Generally, the rule set containing a Set rules have 2 different ways to overlap the rules. They even have 2 instance space segments Which are more an ‘rules. And most of these segments are empty since rules are mutually exclusive. ‘Therefore, the coverage of these segments is required to be estimated. Consider the below rules set (©) iflengh=4 then class= © (14341 if eg hen cls = gi r counts canbe used fr the instances hat sre 3.14 MACHINE LEARNING [JNTU-K. “A comparison between the rule set with the below rule list XY Zis as follows, length =4 then class = © [1+,3-] else if Beak = yes then class = © [4¢,1-] else if length = 5 then class = © [0+,1-] “The coverage curve for this rule list is depicted in the above figure with 2" line. The rule set coverage curve is not tobe convex on training set because the coverage counts need to be estimated. For example, ifrule C coverage p, then it does not affect the performance of rule list. But it breaks the tie in between X Y and Z thereby introducing the concavity. In order to tum ‘a ranker into classifier, the best operating point on coverage curve must be determined. ‘ If the performance criterion is the accuracy then the point is said to be optimal, This can be achieved by. instances along with p* >0.5 as Positive andthe remaining as negative. If this found to be problematic then the assigned with highest erage by randomly choosing the tie, $o,1oj fe evaluated as +24] and itis even in trusting as pure al has average recall of 0.5 without concerning the class distribution. Therefore a good measure would be lang-nee-0. lavg-reo-0.5) = pe ~ fet? ‘average recall. And the related subgroup evaluation measure weighted related accuracy can be Another difference between the classification rule leaming and subgroup discovery is that the former doesnot focus on rules, where as the latter focuses on it. This can be dealt by assigning the weights to examples being decreased wwhcr there is an increase in coverage of example by newly leamed rule, One method would be to assign 1 to example weights © half when a new rule covers examples. The search heuristics can then be evaluated with respect 1 the qoulstive weight of covered examples instead of their number. The weight covering algorithm is as follows, Roe 2. Ifexamples im D have weight I repeat step 3 10 step $ 7 LeamnRule (D): 4. add rattheend of & 2 class when evaluation measure (that learns rules) handles multiple class. What is associated rule mining? Explain. Rule Mining ‘Association rule mining is a type of rule based machine learning method that is used to discover the relationship in between variables in huge database. The purpose of itis to find the strong rules in databases by using certain measures of interestingness. MACHINE LEARNING [JNTU-KAKINAD A) Tn the above example, cight customers have bought apples, mangoes, grapes and orange. Every transaction consists of some ‘even possible for pairs or sets of items. These might amount of items. For every item, the transactions can be listed. This Some 16 stem sets by using the subset relation in between the transaction sets as partial order in the form of lattice. Let the supp) denote the number of transactions. The frequent item sets can exceed the specified support threshold fA set of frequent item sey js end to be convex and fully determined by the lower boundary of largest item sets. The frequent item set can be found by the telow given enumerative breadth-frist or level-wise search algorithm because the item set support has monotonicity propeny, 6 Lo Mee 2. initialise priority queue Q that contains empty item set. 3. if Q is not empty repeat step 4 to step 4.1 next item set available at font of; S. maxe me 6. for every posible extension P of I repeat step 7 to step 10 7. if Supp (P= £) go to step & else go to step 10 8. max ¢ false; 9. add P at the back of O; 10. endif I. end for 12, ifmax = true then M<— MU {D}; a Fcc UNIT 3 (Tree Models, Rule Models) puild the association rules. These rules are of form x. The above algorithm is used to ether, Th Algorithm is used to mine the frequent item sets and there after the bodies B and heads # are selected from Av frequent item sets m by discarding the rules that have oo a i the rules that have confidence less then the specified confidence threshold. Consider the L Ree 2, M& Frequent items (D,); | 3. foreach m © M repeat step 410 6 4, foreach H Cm and B.C m such that H > B=} repeat step 5t0 step 6 5. if'supp (8 U H)/Supp(B) 2 C, then R <~ RU(. if B then H.) 6 endif 7. end for 8. retumR ‘When the above algorithm is run using the threshold 3 and confidence threshold 0.6 the below association rules are generated, If Mangoes then grapes Support 3, confidence 3/3 If grapes then Mangoes ‘Support 3, confidence 3/5 ftrue them grapes Support $ confidence 5/8 ‘The associated rule mining also has post processing stage with superflows rules filtered out. There is a quantity called lift which is mostly used in post-processing. It can be defined as follows, Lit (it 8 then #2) = se SOE =I ‘The heads of the association rules consist of multiple items. Q23. Explain about first order rule learning. Ans: First order rule learning is nothing but earning about the nodes in a graph. There are various approaches based on logic ng language Prolog, Learning of fist order rues is called inductive logic programming, The notations in itare writen Rules are written back-to-front that is in ‘head-if-body’ fashion. ‘Variables are said tobe implicily universally quantified. “The variables begin with capital leter; constants, predicates andthe funetion symbols begin wit lower case UNIT i4 LINEAR MODELS, DISTANCE BASED MODELS QUESTIONS WITH SOLUTIONS 1. Define linear models ied Model Papers, Q1(@) The linear models are non-numerical inherently and only handle the numeric features through thresholds for converting them into multiple internals, They are opposite diagrammatically in dealing the numerical features directly» But they need to reprocess the non-numerical features. The linear models geometrically make use of lines and planes for constructing the model. [Aparticular increase or decrease in features will have the same feature without concering the value or other feature. They ae sad be simple and portable to variations in the taining data. “The linear models are parametric and have a fixed form with less number of parameters required to be learned from the data. Tis not same as the tree or rule models which have doesn't have their structure fixed. They are stable where as tree models are not. G2. Write is short about least squares method Ans: ‘The least squares method is used to leam the linear modes for clasification and regression. The regression problem learns function estimator 7 : > R from the examples (x, f(x)). Here = ¥ Rt The difference between actual and estimated function values on training examples is called residuals ,= f'x)~ f (x). ‘The least squares method is introduced by Carl and has j such that >" ,€ } is reduced. A simple case of single feature called univariate regression. Q3. Write in brief about multivariate linear regression amet ; Model Papers, Nia) “The multivariate linear regression can be written in matrix form as shown below b 4) In the above equation ya and € are the m-vectors and bis said be scalar, If there are features, x becomes n-by-d matrix ad. becomes d-vector of regression coefiiients. They canbe simplified by using the homogeneous coordinates a BE Gh ccf U N IT LINEAR MODELS, DISTANCE BASED pS SIA GROUP : PART-A SHORT QUESTIONS WITH SOLUTIONS 1. Define linear models Ans: Model Papers, Q1(@) The linear models are non-numerical inherently and only handle the numeric features through thresholds for converting them into multiple internals. They are opposite diagrammatically in dealing the numerical features directly. But they need to preprocess the non-numerical features. The linear models geometrically make use of lines and planes for constructing the model. ‘Aparticular increase or decrease in features will have the same feature without concerning the value or other feature. They are said be simple and portable to variations inthe training data. ‘The linear models are parametric and have a fixed form with less number of parameters required to be learned from the data. Itis not same as the tree or rule models which have doesn’t have their structure fixed. They are stablé where as tree models are not Q2. Write is short about least squares method Ans: “The least squares method is used to leam the linear models for classification and regression. The regression problem leas function estimator j': ¥—>R from the examples (x, f(x). Here = XR’. The difference between actual and ‘estimated function ‘values on training examples is called residuals €, = lx) — FG). ‘The least squares method is introduced by Carl and has / such that 7,7 isreduced. simple case of single feature is called univariate regression. Q3. Write in brief about multivariate linear regression Ans: Modal Paper-tt, Q1(a) “The multivariate linear regression can be written in matrix form as shown below :) yratkb+e * In the above equation, y,a,x and ¢ are the n-vectors and b is said be scalar. If there are d features, x becomes n-by-d matrix and b becomes d-vector of regression coefficients. They can be simplified by using the homogeneous coordinates GG JG) ————— ~y 4.2 MACHINE LEARNING [JNTU-KAKINADA) Q4. Define regression Ans: Regression is a method used to avoid the overfitting by applying the additional constraints to weight vector. A comimon | ‘method isto assure thatthe weights are small in magnitude on average. This is called shrinkage, This can be illustrated by writing the least squares regression problem as optimisation problem we = arg, min (y— XJ" (y= X,) ‘The regularised version of this optinisation is as follows, w* = arg, min (y —X,)' (y —X) +2 pol? Here, |hw’= 0,1? is squared norm of vector w or dot product ww. And 2. is scalar that determines the amount of regularization. This has a closed form solution as shown below Ww =OX+ AD XY Here indicates the identity matrix that has 1's on diagonal and 0's on all the other places. Q5. Define perception ~ Ans: : . Model Papers, (6) The perception is a linear classifier that achieves perfect separation based on linearly separable data. It is proposed by a simple neural network. It iterates over training set by updating the weight vector for every incorrectly classified example. For example, consider x, as misclassified positive example that has y,=+1 and w.x,<¢. It is required to determine w’ such that wx, >w.x, that can move the decision boundary towards past x, This is possible by determining the new weight vector as w”=w +n x, where 0 < 1, $ 1 is learning rate, There is w’.x,= w.x,+ my,.x,> wx, as per the requirement, Ifx is misclassified example then there is y,=—1 and w.x,> 1. ‘The new weight vector in such cases canbe calculated as w' w—nx,and wx = way m2, < 1.x, These two cases can be combined into single rule. wow + nya, _ Q6. Discuss in brief about distance based models Ans: Distance based models are second type of models that have strong geometric intuitions. This algorithm ‘will detect the outliers to consider the distance between points in dataset. For example, consider a metric to compute the distance between two instance x, atid x, called d(x, x,). These type of models work on the concept of distance. In machine learning, the concept of distance isnot related to the physical distance between two points, rather it can be distance between two points by considering mode of transport between them. The travelling distance between two cities by plane is less than compared to train. Similarly in chess the distance completely depends upon the picee used. The concept of distance is different based onthe entity and mode of tel The commonly ued distance mis aeEueliem, Manhntan, Mabalanbis and Minko Write about distance based-clustering TRednunec ase cusrog meds sina distance bse casi A distance mec ia etd bul eral and distance based decision rule, The distance metric might indirectly encode the learning target inthe absence of explicit target the objective can be compat clusters determination corresponding to distance merc. Fo 10 work as optimisation criteria, ae 4.2 MACHINE LEARNING [JNTU-KAKINADA] Q4. Define regr An sion Regression is « method used to avoid the overfitting by applying the additional constraints to weight vector. A comimon method is to assure that the weights are small in magnitude on average. This is called shrinkage. This can be illustrated by writing the least squares regression problem as optimisation problem w* = arg, min (y= X,)" =X) The regularised version of this optinisation is as follows, w = ang, min (~X,)/ (9X) 4 A hw? Here, [hu >, w? is squared norm of vector w or dot product ww. And 2 is scalar that determines the amount of regularization. This has a closed form solution as shown below W RONHAD IN Here I indicates the identity matrix that has 1's on diagonal and 0's on all the other places. QS. Define perception Ans: : 2 Model Paper, a1(s) ‘The perception is a linear classifier that achieves perfect separation based on linearly separable data. It is proposed by a simple neural network, It iterates over training set by updating the weight vector for every incorrectly classified example, For example, consider-x, as misclassified positive example that has y,=-+1 and w.x, < ¢, It is required to determine w" such that w’.x, > wx, that ean move the decision boundary towards past x, This is possible by determining the new weight vector as w"= Ww +n +, where O< 71$ 1 is learning rate, There is w'.x,= w.x,+ -4,> Wx, a8 per the requirement. Ifx, is misclassified example then there is y,=-I and w., > 1. The new weight vector in such eases ean be calculated as w'= Ww — mx, and w'x)= W.2j— 13) < vx, These two cases can be combined ino single rule. wisw+ nya, Q6. Discuss in brief about distance based models Distance based models are second type of models that have strong geometric intuitions. This algorithm will detect the ‘outliers to consider the distance between points in dataset. For example, consider a metric to compute the distance between two instance x, aiid x, “called d(, x,). These type of models work on the concept of distance, In machine learning, the concept of distance i not related tothe physical distance between two pins, rather it can be distance between two points by considering ‘mode of transport between them. “The travelling distance between two cities by plane is less than compared to train. Similarly in chess the distance completely depends upon the piece used. The concept of distance is different based on the entity and mode of travel, The commonly used distance metrics are Euclidean, Manhattan, Mahalanobis and Minkouski. Q7. Write about distance based:clustering actngrnu apa a pee Ch ate The distance based clustering methods is similar to distance based classifier. A distance metric isa method to build examplars ‘based decision rule, The distance metric ‘might iaoe cope eneraec nomena Models, Distance based Models) 7, Initialize randomly Kv 3 By hy © RM Repeat step 3 to step 6 for Hy, 4: Assign x € D to argmin j Dis, (x, 1); 4, for/= 110k repeat step 5 to step 6, 5, _D,« {x € D| xis assigned to cluster}; 64 FOtUET Hy oy By Q9. Whatis silhouettes Ans: A technique called silhouettes is used to detect the poor quality of clustering. For a data point x, let d(x,, D) indicate the average distance of x, to data points in cluster D,. And let (0) indicate the index of cluster to which x belongs. Let a(x) = 4: (i) be average distance of x, to point to cluster Di) and let (x) = mink , hx, D,) denote the average distance to points in neighhouring cluster, Incase of a(x) > B(x), the different between B(x) ~ a(x) will be negative. The members of neighbouring cluster are close to than members of its cluster on average. To obtain normalised value, it is beter to divide by a (x). Then the below equation i sonal bea) als) SG) rax (a6), BOD) ‘Assithouette will sort and plot s(x) for instances that are grouped by cluster. Q10. What is hierarchical clustering? Ans: ‘The clustering methods make use of examplars for representing the predictine clustering, These are even methods that represent clusters by using trees. They use the features for navigating the instances space, Tree's here are called dendograms, that are defined in terms of distance measure. For this reason, they divide the given data and represent the descriptive clustering. ‘A dendogram for a given set D isa binary tree wit elements of D as leaves The intemal node is subset of elements in leaves of subtree, The node level isthe distance between the clusters represented by children of node. Leaves have level 0. This definition works by means of a method to measure the distance between two clusters. Here, a linkage function is required to turn ‘the pair wise point distances into pair wise cluster distances. UNIT-4 (Linear Models, Distance based Models) The regression finds a solution such that w w+ thf) a+bh The regression coefficient for a feature x and target, variable y would be On riate linear regression can be understood through the below given steps, 1. Normalisation of feature by dividing the values by feature’s variance. * Calculation of covariance of target variable and normalised feature ‘The sum of residuals of least squares solution is zero Lie G+5x))=n(p-a-53)=0 ‘The results follows because a =j~6 . This property makes linear regression susceptible to outliers. Even though itis, susceptible to outliers, the least squares method works well for | simple method. There are even vatiants of least squares method. | The ordinary least squares method assumes that y-values are contaminated with random noise. The total least squares might generalise this to a situation where x and y values are noisy. Q12. Discuss about multivariate linear regression. ‘Ans: Multivariate Linear Regression 4 ‘The multivariate linear regression can be written in matrix form as shown below ai) aOR pee ".C Jo(")o() (3) (eG : yratXb+e : Ini shove sont aba issaidbe scalar. Ifthere are d features, mby-e mat. and b becomes d-vector of regression coefficients. They can b simplified by using the homogeneous coordinates ¢ C-EDRr) yerwre ‘Consider another form of expressi Mode! Papers, a5(a) jon X*) that is n-vector Gyo d= With Op 2) j-4 entry which is product of row of x" (the /* column of 4.5 and 27 is Trevery feature is zero ~ centred the n-vector containing the required covariance. The features in "univariate need to be normalised to contain unit variance. And this can be achieved in ease of multiv scans of d-by-4 scaling matrix (a diagonal matrix with diagonal entries I/ng,). ‘The scaling matrix can be obtained by inveriing § when Sis the diagonal matrix with diagonal entries no, i The solution for multivariate regression problem would be wes xy Amore elaboration matrix than Sis XXy! Xty ‘The features in case of (X“X)" are uncorrelated rather than being zero centered. The covariance matrix Lis said to be diagonal with entries 6, ‘The matrix is diagonal with entries ng, — because of the following reasons > MXen(L+M) <> Theenities of M=0, since columns of X are zer0 centred. f By assuming these, the (X7X)"' decreases to scaling matrix S*!, It acts as transformation that decorrelates the centres: and normalises the features. A multivariate regression problem ‘can be decomposed into d univariate problems by assuming the uncorrelated features. The Correlation features is considered because it can harm in certain situations if ignored. Ifa problem {s decomposed into two univariate regression problems then learning of nearly constant function may occur. eS Q13. Explain in briof about regularised regression and usage of least-squares regression for classification. | Regression is a method used to avoid the overfitting by applying the additional constraints to weight vector. A common ‘method is to assure that the weights are small in magnitude on average. This is called shrinkage. This can be illustrated by ‘writing the least squares regression problem as optimisation art 7 a : 4.6 ‘Another form of least squares regression is called ridge regression that improves the numerical stability of matrix inversion by adding A to diagonal of X"X. Lasso provides an alternative for regu. ised regression. It stands for least absolute shrinkage and selection operator’. The ridge regulations term &,w? is replaced by sum of absolute weights ¥, |w|. The result rrunk Weights and others set to 0. It needs isation techniques to be used due to lack of some certain nutnerical opti of closed form solution, consis Usage of Least Squares Regression for Classification ‘The technique called linear regression can be used for learning a binary classifier by encoding the two classes as real numbers. For example, the Pos positives examples can be labelled with y°=+ I and negative examples can be labeled as _y°= 1. Further more, the X"y = Pos ®— Negu®. Here p® and 1° are d-vector holding mean values of features for positive and negative examples. For example in univariate case, consider the below ‘equations,, 3x9, =Pos w®—Neg p° Sny=n(,+ #5) = Pos p°—Neg p°-% 7 Since %= Pos u®—Neg »°and j= pos —neg then the covariance between x and y can be rewritten as 6,,=2 pos.neg (u®~ n°) ‘The slope of regression lin i a follows, b=2posneg # a eects veers 5+b(x+3) can be used for determining the decision boundary. The point (x. ) must be 0 0 ifx=0 MACHINE LEARNING [JNTU-KAKINADA] Heuristic Learning Q14, Explain about the perception. Ans: Model PapertV, Q5(a) ‘The perception is a linear classifier that achieves perfect separation based on linearly separable data. It is proposed by a simple neural network. It iterates over training set by updating the weight vector for every incorrectly cl ‘example, consider x, as misclassified p y,=+1 and w.x, <1. Tis required to determine w’ such that w’-x > w.x, that can move the decision boundary towards past x. This is possible by determining the new weight vector as w’=w + x, where 0 <7 $ Vis learning rate, There is wx, ~w ox + ny, ,> wax, as per the requirement. If-x is misclassified example then there is y,=—1 and w.x,> ¢. The new weight vector in such cases can be calculated as w/ = w — n x, and w’.x = wx ~ nr. x, 1 —&, and the sum of slack variables is added to objective function. This results in the below soft margin optimisation problem, +> amin 5 Iwi +e 38, wt, Subject to y, (w. x,—)21-E and &20,1 yyy ‘The S-shaped ‘sigmoid function is called as logistic function to find applications in various areas. P@|d@)= 7® =—® without assuming the magnitude of mean distances or of 6 there the below equation is obtained. oa 2 20-Pi eo) oe see) exp (rd) __ By considering scaling factor , the assumptions is unit vector isnt required, The assumption that 2° and 7” a a NIT-4 (Linear Models, Distance based Models) nm The d, has effect of moving decision boundary from w-r™=1t0x= (u®+ yO)? halfway between ‘wo class | ceistic mapping then > | gice mapping then becomes d and effect of two parameters is shown in below figures 1 1 +exp(-y(d dy) SBR ane MACHINE LEARNING [JNTU-KAKINAD, Tasted whether if i lassified by evaluating y The above algorithm is a simple counting algorithm. Where example % example x, = (py) and 2)" Cp) fog DP ce, y,, The dot product x, is the key component. By assuming bivariate aettional simplicity the dot pret can be Son as 2, 3, = 5 San eae arent es (22.2) . med prducrotheseis (32,72). («293 )=¥9 <9 709 The above equation is equa to (35) (213) £9194) = (35s) 2+ (124) #23 31% ‘This erm canbe seized by extendending feature vector with third feature 127 ‘This generates the following feature space 6 (5) = (+22 vai s-(sjo}) sUsheels)abaderbo3 ranapnn = ba} Define &(a9,) = (52%) and replace 5x, with & (93) in dual perception algoritn to generate kemel perpen that leams a type of non linear decision boundaries 1. a O fori si<|Dj : 3. _ if converged = false repeat step 4 to step 7 ; 2. converged € false 4, converged <= true 5. fori=110|D| repeat step 6 6 whiteyi 7, ay k(4)) sOdoqea, +! converged «false UNIT-4 (Linear Models, Distance based Models) 4a3s 42 DISTANCE BASED MODELS R 4.2.1 Intreduetion ale Q19. Give a brief introduction on distance based models. Ans: adet Papert 8) Distance Rased Models Distance based models are second type of models that have strong geometric i! caters © consider the distance between point in dataset. For example, consider a metric to compute the distance between 60 instance x, and x, called d(x, x,). These type of models work on the concept of distance. In machine learning, the concept of distance is not related to the physival distance between two points, rather it can be distance between two points by considering mode of transport bet veea them. The travelltiag distance between two cities by plane is less than compared to train. Similarty in chess the distance comletely depends upon the piece used. The concept of distance is different based on the entity and mode of travel. The commonly used distance metrics are Eucli¢ean, Manhattan, Mahalanobis and Minkouski. ‘The Euclidean distance is ordinary distance between two points in euctidean space. 40.04) VGA ERY OBS = Siew fa ons. This algortim wall detect the P Figure: Euclidean distance . + Manhattan distance is rectilinear distance that can be defined as sum of lengths of projection of lune segment between points on coordinate axes. 4.14 MACHINE LEARNING [JNTU-KAKINADA} —— ES NTUKAKINADA} ‘Afier determining the exemplars th Q20. Explain about neighbours Exemplars Ans: Model Papers, 05(0) Neighbours and exemplars Developing a model in terms of number of prototypical instances or exemplars is one of the key idea for distance based model. Another key idea is to define the decision rule in terms of nearest exemplars or neighbours. A basic property of mean of a set of vectors is to minimize the sum of squared euclidean distances to those vectors. Decrease in sum of squared euclidean distances of given set of points is same as decreasing average squared euclidean distance. Consider a point called geometric median to reduce the total euclidean distance as exemplar. It corresponds to median or middle value of set of numbers. In case of multivariate data there is no closed form expression for geometric median that must be calculated through successive approximation. For this reason, the distance based methods use squared euclidean distance. In some cases the exemplar is restricted to act as a data point. In such cases, medoid is used to differentiate from centroid. A medoid can be determined by calculating the total distance for every data point to other point inorder to select a point that can reduce it, This is called O (2°) operation for n points without concem of distance metric. ‘Therefore medoids does not have any computational reason to ‘opt a distance metric over another. The below figure illustrates 10 data points where the exemplars provide different results. can create decision boundary as perpendicular bisector of line segment that connects two exemplars, An alternative for this would be to use decision rule, Ifx is ne ‘must be classified otherw to 1 then positive sified. Orelag an instance to class of nearest exemplar must be classified, The e negative must b same decision boundary is obtained even if euclidean distance is used as closeness measure. This is shown in below figure, Figure: Data sot of 10 points for distance metrics ‘The basic linear classifier is allowed to be interpreted {from distance based perspective as a way to build exemplars to. reduce squared euclidean distance in all the classes and thereby using nearest exemplar decision rule, With this perspective ‘many new possibilities get opened. The structure of decision boundary can be known when manhattan distance for distance for distance rule is used. This is shown in below figure, through fixed angle in 45 degrees. ‘The distance based perspective is sent ions exemplar decision rule works better incase Of multiple exemplars. This again provides a multiclass vetsion of basi linerlassifie. This stated forts exemplars in blow 4.16 ng process of Kenearest neighbour involves ions that are possible, They increase initially with in k and later on decrease, When the k increases, the bias also increases and variance decreases. It is not easy to decide the type of value for k suitable for given data set. But this question can be avoided by using distance weighting. The note count of an exemplar depends upon the distance between the Figure: 7-Nearest Neighbour ‘The figures uses the reciprocal of distance to an exemplar ‘as weight of its note with this the decision boundaries are "represented because the mode! applies grouping with respect ‘to voronoi boundaries and grading. The effect of increase in k _ more quickly for lage distance. So with distance weighting k-n | cam be used to generate a model to make predictions in various ‘The distance weighting makes k-nearest MACHINE LEARNING [JNTU-KAKINADA} Q22. Write about distance based clustering, Ans: Mode! Papert, a5) ‘The distance: based clustering methods is similar to distance based classifier. A distance metric is a method to build examplars and distance based decision rule. The distance metric might indirectly encode the learning ‘target in the absence of explicit target variable so that the objective can be compact clusters determination corresponding to distance metric. For this, a notion of cluster compactness is required to work a ‘optimisation criteria. A sscatter matrix for the given data matrix X’is as follows, $= (Ap (X —1n)= 2%, - HY (X-H) a ‘The scatter of X can be defined as, SeatX)= YF 1X,— wll? ‘The above equation is equal to tracing of scatter matrix, Let D is split in K subsets such as D, ®... ® D,=D and y, indicate mean of D,, Let S be the scatter matrix of D and 4, be scatter matrices of D,. These matrices are said to have the below relationship. is S= zs +B Here, B indicates scatter matrix that is generated when Dis replaced by respective jy. Every S, are referred as within cluster scatter matrix that define compaciness of /* cluster B is said to be between cluster scatter matrix that defines the spread of cluster centroids. The traces of these matrices can decomposed as follows, = seat )+ E10, hw, wl \K-Means Algorithm et . ‘The K-means problem finds the partition to reduce the total within cluster scatter. It is NP-complete and does not have efficient solution to find the global minimum. The algorithm for K-means problem is as follows, Initialize randomly K vectors jy € RS Repeat step 3 t0 step 6 f0F Hy.» bys Assign x © Dt argmin, Dis, (x, 4); for = 1 to k repeat step 5 to step 6. D, © (x € D|xis assigned to cluster); UNIT-4 (Linear Models, Distance based Models) The above algorithm iterates to split the data using nearest - centroid decision rule and recalculates the centroids from partition. The below figure indicates that the algorithm on small data set containing three clusters. Figure: The First Iteration of 3-Means on Gaussian Mixture Data ‘The k-means algorithm must be run multiple times and then a solution with smallest scatter must be selected. 24, Write about clustering around medoids and silhouettes. f ‘The K-means algorithm can be adapted to make we ves an alternative for the above algorithm. It is called as partitioning around medoids (PAM). hu, € D randomly; 2, repeat step 3 to step 7 until no further improvement. 1. Pick & data points 1, 3. assign each x € D to argmin, Dis (x, 4); 4. ifj=1 then D,« {x € D/xis assigned to cluster /}; 5 Q+d, % Dewey 1, 6. For every medoid m and non medoid 0 calculate the improvement in Q by smapping m with 0. q Select the pair that has maximum improvement and swap. 8. ret py os ys ‘The above algorithm attempts to improve the clustering bby smapping the medoids with other data points. The quality of the clustering Q can be caltulated as total distance over the points to their nearest medoid. There are k (n ~ A) pairs of one ‘medoid and one non-medoid. The evaluation of Q needs iterating ‘over n ~ k data points. Therefore the computational cost of an iteration would be quadratic in number of data points. The PAM ‘can be run on a sample for large data sets and evaluate Q on the data set. This must be repeated multiple times on various “The chung methods represent thelist by means of -examplars, It ignores the shape of clusters and leads to counter- MACHINE LEARNING [JNTU-KAKINADA) Figure: After Rescaling the Y-axis, this Configuration hes Higher between Cluster Scatter Than Intended One ‘The two data sets in the above diagram are idéntical rather than rescaling of y-axis. The k-means finds the different clusterings. In the above figure, the two centroids are away in intended solution and then represent ¢ more better solution. ‘The real issue is to estimate the ‘shape’ of clusters and cluster medoids. Therefore even consider the trace of the scatter ‘matrices. : Silhouettes ‘Atechnique called sithouettes is used to detect the poor quality of clustering. For a data point x, let d(x, D) indicate ‘the average distance of x, to data points in cluster D,. And let (0 indicate the index of cluster to which x, belongs. Let a(«) = d(x,, D()) be average distance of x, to point to cluster D() and let b(x,) = mink _,,, d(x, D,) denote the average distance 0 pm heim Incase ofa (x) > 6 (x), the different between b(s)-4 () will be negative. The members of neighbouring cluster «© close to x, than members of its cluster on average. To obit ‘normalised value, itis better to divide by a (x). Then the below UNIT-4 (Linear Models, Distance based Models) 4.19 Figure: Sithouette for Clustering by using Squared Eucidean Distance gure, the squared Euclidean distance is used to build the silhouette. The method can be applied to other gistance metrics. Its even possible to Calculate the average silhouette values for every cluster and even over whole data se. 4.2.5 Hierarchical Clustering Q25. Discuss about hierarchical clustering. Ans: Model Paper-V, 05(0) The clustering methods make use of examplars for representing the predictine clustering, There are even methods that represent clusters by using trees. They use the features for navigating the instances space. Tree's here ae called dendograms, that are defined in terms of distance measure. For this reason, they divide the given data and represent the descriptive clustering ‘A dendogram for a given set D is a binary tree with elements of D as leaves. The internal node is subset of elements in caves of subtree, The node level is the distance between the clusters represented by children of node. Leaves have level 0 This definition works by means of a method to measure the distance between two clusters, Here, a linkage function is required to turn the pair wise point distances into pait wise cluster distances. x 2¥ — R calculates the distance between the arbitrary subsets of instance space, given & ‘A linkage function L : 2* distance metric Dis : x ¥ —R. The common linkage functions are as follows, 1. Single Linkage It defines distance between two clusters as pairmise distance between the elements from each cluster. sys (A,B) = _ min , Dis(x, ») seayes 2 Complete Linkage Ii defines the distance between the two clusters s largest pointwise distance, max _ Dis(x, ») : 4% nA eR Average Linkage It defines the distance between two clusters as average point wise distance. : ; + ncaa aga tag Lm P aa 4.20 MACHINE LEARNING [JNTU. (Centroid Linkage =) defines the cluster distance as point distance between cluster means, Lemurs (4, BY = Dis[ ace, Zar] l4] 1B ‘An algorithm to create dendogram is as follows. |. Initialise the elusters to singleton data points. 2. Create leaf at level O forever singleton elster. 3. Repeat step 4 and step'$ until all the data points are in one cluster. 4. Find the pairs of clusters X, ¥ with lowest linkage / and merge. 5. Create parent of X, Yat level / 6. Return the created binary tree with linkage levels. ‘The hierarchical clustering by using the single linkage can be accomplished by calculating and sorting the pairwise dstaney between the data points. Tenceds O(n time for x points. The single as well as complete linkage define the distance between cle, 8s pair of points, A centroid linkage leads to non-intuitive dendograms, While creating dendograms the hierarchical clustering the said to be deterministic to create clustering. Like other the models, the dendograms have high variance in which small chang, ; in data points might lead to large changes in dendograms. ‘Hierarchical clustering does not need number of clusters to be fixed in advanced. The distance measure as well as te : linkage fiunction need to selected in advance. U N IT PROBABILISTIC MODELS, FEATURES AND MODEL ENSEMBLES PART-A Ries :SHORT QUESTIONS WITH SOLUTIONS - Q1. Define probabilistic model. Ans: Model Papers, a1(e) The probabilistic models associate random variables and probability distributions into model event or phenomenon. The probabilistic model generates probability distribution and the deterministic model provides single possible outcome for an event ‘The probabilities are useful for expressing the models expectation about class of given instance. For example, « probability estimation tree adds a class probability distribution to all the leaves of tree and instances that are filtered to specific leaf in tree model arc labelled with specific class distribution. In similar way a calibrated linear model translates the distance from the decision boundary into class probability. They are discriminative probabilistic models. They will model the posterior probability distribution P (IX) where ¥ indicates target variable and X indicates the features. They return probability distribution over Y for give X Q2. How naive bayes model is trained? Ans: ‘Training of probabilistic model is done by estimating parameters of distributions in model. The parameter of Bernoulli “distribution are estimated by count of number of successes d in n trials and 6 = dn. For every class the number of e-mails containing word in question are counted. These type of frequency estimates are smoothed by adding pseudo counts by depicting "the result of virtual trials based on fixed distributions. Laplace correction is a smoothing operation in Bernoulli distribution. it hhas two virtual trials, one results in success and other results in failure. The relative frequency estimate is changed to (d+ 1)/(n _ +2). In case of categorical distribution, smoothing adds a pseudo count to every categories to generate smoothed estimate (d + 1)/(n +4). The mestimate bill generalize by making the number of pseudo counts m and method of distribution over categories “ino parameters. The estimate for i-th category can be defined as (d+ pym)/(n +m). Q3. Write in short about Gaussian mixture models. Mode! Papersl, Qe) ts ‘An application of expectation maximisation is to estimate the parameters of Gaussian mixture model from data. In this data points are produced by K normal distributions with their own mean p,and covariance matrix >, The proportion 5 obtained from Gaussian is governed by prior T= (7, .. 7,). The data points in sample are labelled with index ofits aT} raight classification problem solved by estimating Gaussians and 5, independently from data points related i method to model is tohave a Boolean vector Z,= (2... 2) fr every dats poo ih that one is sare set to by indicating / data point derives fom * Gaussian. By using this the expression can be distribution o obtain general expression for Gaussian mise wodel, me : 5.2 MACHINE LEARNING [JNTU-KAKINADA) J Q4.” What are features? ‘G7. What is discretisation? Ans: ode Paper. 21¢) | anes ; Features are the workhorses of machine learning. They : are also called as attributes and defined as mappings > Discretisation is a method of transforming the from instance space % to feature domain %. Consider two types of features representing the age and house number of a person. They both map into integers but they are handled differently Calculation of average age of group of people is meaningful but average house number is not useful. The domain of feature 4s well as possible operations are considerable, This will again depend upon if feature values are expressed on meaningful scale. Even though house numbers are not integers but they are ordinals. They can be used to depict that number 10 has neighbors 8 and 12. The distance between 8 and 10 may not be ‘same as the distance between 10 and 12. It is not productive to ‘add or subtract the house numbers that preclude operations like average, in the absence of linear scale. Q5. List out various types of features. Ans: L Quantitative Features ‘They have meaningful numerical scale. They consist of mapping into reals. Even though feature maps into subset of reals like age such as in years, different types of statistics like ‘mean or standard deviation need the full scale of reals. 2 Ordinal Features ‘They are features without scale but have ordering. The domain of ordinal feature is totally ordered set like set of characters or strings. Even though domain of feature is set of integers indicating feature as ordinal means that need to be dispense with scale. The ordinal features allow mode and median 10 be central tendency statistics and quantiles to be dispersion Statistics. 3. Categorical Features The feature that does not have ordering or scale are fe | categorical features. They does not allow any statistical : sis quantitative feature into ordinal feature. Every ordinal feanee is referred as bin and this corresponds to internal of quantitative feature. The unsupervised discretisation need to decide the number of bins before hand. Select the bn, such that each of it has almost same number of instances. Th | is called equal frequency discretisation. SSE EE ee Q8. Write about normalisation? | Ans: Model Paper. at) Normalisation is a process of adapting the scale of quantitative feature or adding scale to ordinal oF categoria} feature in unsupervised fashion. The feature normalisation needs ‘to neutralise the effect of various quantitative features that are ‘measured on multiple scales. When features are normal then they can be converted into z-scores by centering on mean and dividing by standard deviation. The feature normalisation can bbe understood by expressing the feature on (0, 1] scale. This can accomplished in multiple ways. The linear scaling f+f—), | (4-0 can be applied if the highest and lowest values h and / of features are known. Q9. Write about calibration? Ans: Calibration refers to supervised methods by considering the class labels. The feature calibration can defined as supervised feature transformation by, adding meaningful scale coniaining class information to arbitrary features. This has numerous features For example it enable the models that need scale lke linear classifies to handle the categorical and ordinal features. It even enable the learning algorithm to select the process of considering the feature such as categorical, ordinal o Quantitative. The problem of feature calibration can be as Given a feature F : %X-».7, develop a calibrated feature | ‘> [0,1] such that F(x estimates probability F* (x) =? @W) i UNIT-5 (Probabilistic Models, Features and Model Ensembles) nm Q11. Explain about probabilistic models. Ans: Model Papert, 4a) Probabilistic Models ‘The probabilistic models associate random variables and probability distributions into model event or phenomenon, The probabilistic model generates probability distribution and the deterministic model provides single possible outcome for an event ‘The probabilities are useful for expressing the models expectation about class of given instance. For example, a probability estimation tree adds a class probability distribution to all the leaves of tree and instances that are filtered to specific leaf in tree model are labelled with specific class distribution. In similar way, a calibrated linear model translates the distance from the ecision boundary into class probability. They are discriminative probabilistic models. They will model the posterior probability distribution P (YX) where ¥ indicates target variable and X indicates the features. They return probability distribution over ¥ for give X ‘The main class of probabilistic models are called generative models. They will model the joint distribution PCY, X) of target Yand feature vector X. When its accessible, itis possible to derive any conditional or marginal distributidh containing the variables. Because P(X) = E, P/Y = y, x) it follows that posterior distribution is achieved as follows, POX) POV) SAY =) The generative models are described by likelihood function PUYY) because P(X) P(1). The target or previous distribution is easily estimated or postulated. These type of models are called generative since itis possible to sample from joint distribution, to get new data points with labels. : : ‘An attractive feature of probabilistic perspective is to allow observation of learning as uncertainly process. Por example, ‘uniform process demonstrates before knowing anything about instance that is to be classified. If posterior distribution is less uniform after observing instance then it is understood that the uncertainty is minimized with respect to a class or other, This can be repeated when new data is received by using the posterior. This method is applied to unknown quantity that is faced. The ‘modelling process of posterior distribution over parameter © has multiple advantages associated with Bayesian perspective, The probabilities does not have to be interpreted inthe form of estimates of relative frequencies and can carry out general

You might also like