0% found this document useful (0 votes)
159 views

Data Warehousing & Data Mining

Data Warehousing & Data Mining Makaut Organizer of 2023-24

Uploaded by

Bittu Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
159 views

Data Warehousing & Data Mining

Data Warehousing & Data Mining Makaut Organizer of 2023-24

Uploaded by

Bittu Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 97
DATA WAREHOUSING AND DATA MINING Introduction to Data Warehousing 2 Classification and Prediction of Data Warehousing 20 Mining Time Series Data 35 Mining Data Streams 48 Web Mining 66 . Recent Trends in Distributed Warehousing 72 NOTE: MAKAUT course structure and syllabus of 6” semester has been changed from 2021. Previously DATA WAREHOUSING AND DATA MINING was in 7° semester. This Subject has been redesigned and shifted in 6” semester as per present curriculum. Subject organization has been changed slightly. Taking special care of this matter we ‘are providing the relevant MAKAUT university solutions and some model questions & @nswers for newly introduced topics, so that students can get an idea about university GUuestions patterns, POPULAR PUBLICATIONS INTRODUCTION TO DATA WAREHOUSING 1. A data warehouse is an integrated collection of data because [WBUT 2009, 2015] a) Itis a collection of data of different types b) Itis a collection of data derived from multiple sources ¢) {tis a relational database d) Itcontains summarized data Answer: (b) ‘subject oriented’ collection of data 2. A data warehouse is said to contain a [WBUT 2009, 2013] because a) Its contents have acommon theme b) Itis built for a specific application ¢) It cannot support multiple subjects d) Itis a generalization of ‘object-oriented’ Answer: (a) 3. A Data warehouse is said to be contain in time-varying collection of data because [WBUT 2010, 2013, 2015] a) Its content vary automatically with time b) Its life-span is very limited c) Every key structure of data warehouse contains either implicitly or explicitly an element of time 4) Its content has explicit time-stamp Answer: (c) 4, Data Warehousing is used for {WBUT 2010, 2012] a) Decision Support System b) OLTP applications . c) Database applications d) Data Manipulation applications Answer: (a) 5. Which of the following is TRUE? [WBUT 2010, 2012] 3 vee Jarshouse can be used for analytical processing only ata warehouse can be used for information processii and analytical processing Processing (query, report} ? Data warehouse can be used for data mining only ) Data warehouse can be used for informatior ort), analytical processing and data mining @ processing (query, rep! Answer: (d) 6. Data warehouse architecture is just an over guideline. it is not a blueprint for the data warehouse [WBUT 2014] a) True b) False Answer: (b) DWM-2 TA WAREHOUSING AND DATA MINING. 7. The most distinguishing characteristic of DSS data is [WBUT 2011) a)Granularity —_b) Timespan ¢) Dimonsionality _d) Data currency Answer: (c) g. A data warehouse is built as a separate repository of data, different from the operational data of an enterprise because [WBUT 2012, 2013] a) It is necessary to keep the operational data free of any warehouse operations b) A data warehouse cannot afford to allow corrupted data within it c) A data warehouse contains summarized data whereas the operational database contains transactional data d) None of these Answer: (c) f 9. Dimension data within a warehouse exhibits which one of the following propertie [WBUT 2042, 2015] a) Dimension data consists of the minor part of the warehouse b) The aggregated information is actually dimension data c) It contains historical data d) Dimension data is the information that is used to analyze the elemental transaction Answer: (b) 10. A data warehouse is said to contain a ‘time-varying’ collection of data because a) its content has explicit time-stamp (WBUT 2014, 2015, 2016, 2017] .b) its life-span is very limited ¢) it contains historical data 4) it contents vary automatically with time Answer: (c) 14. The important aspect of the data warehouse environment is that data found within the data warehouse Is [WBUT 2016, 2018) a) subject-oriented b) time-variant c) integrated d) all of these Answer: (a) 12. Data warehousing is used for [BUT 2016) a) decision support system b) OLAP applications ¢) Database applications d) Data manipulation applications Answer: (a) 13... is a subject-oriented, integrated, time-variant, non-volatile Collection of data [WBUT 2017] a) Data Mining b) Data Warehousing ¢) Document Mining d) Text Mining ~ Answer: (b) DWM-3 1. Define Data Marts. [WBUT 2008, 2010, 2011, 2015, 204; Define the types of Data Marts. [WBUT 2009, 2010, 2011, 2015) Answer: "Part: . A data mart is a group of subjects that are organized in a way that allows them to assist departments in making specific decisions. For example, the advertising department will have its own data mart, while the finance department will have a data mart that is separate from it. In addition to this, each department will have full ownership of the software, hardware, and other components that make up their data mart. 2™ Part: There are two types of Data Marts: . © Independent data marts — sources from data captured form OLTP system, external Providers or from data generated locally within a particular department or geographic area. Dependent data mart —sources directly form enterprise data warehouses. 2. Define data mining. What is the advantages data mining over traditional approaches? IWBUT 2009) Answer: 1" Part: Data mining, which is also kriown as knowledge discovery, is one of the most popular topics in information technology. It concems the Process of automatically extracting Useful information and has the promise of discovering hidden relationships that exist in large darahase.. These relationships represent valuable knowledge that is crucial for maily applications. Data mining is not confined to the analysis of data stored in data warehouses!) mav analyze data existing at more detailed granularities ‘than the ssauanized data vided in a data warehouse. It may also analyze transactional, textual; Spatial, and mu:timedia data which are difficult to model with current multidimensional data’. echnolopy. 2° Part: With the help of data ming, organizations are in a better position to predict the future Heat. ow Dusiltess tren . the possible amount of revenue that could be generated, the Orders tha could he experts: and the type of customers that could be approached. The iucnal approaches will not be able to generate such accurate results as they use Simpler algorithms. One major advantage of data mining over a traditional statistical preoaeh fe fy uw ge -cuy with heterogeneous data fields. The advantuyes of data mining helps the businesses grow help the customers be happy, and help in a lot of other areas like data management, 0WM-4 w. DAT. 3, What is the importance of Association Rules in Data mini: 0 ‘mining? [WBUT 2009] Explain support, confidence, frequent item sesociation rule. Set and give a formal definition of or, [WBUT 2013] What is an Association Rule? Define Support, Confide item set in Assodiation Rule Mining? nce, Item sat err Answer: To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I-= {milk, bread, butter, beer} and a small database containing the items is shown in Table below. Milk, bread Bread, butter Beer Milk, bread, butter Bread, butter An example supermarket database with five transactions. An example rule for the supermarket could be {milk, bread} {butter} meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database in Table I, the itemset {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). ~ The confidence of a rule is defined conf{X ) Y) = supp(X [ Y)supp(X). For example, the rule {milk, bread}) {butter} has a confidence of 0.2/0.4 = 0.5 in the database in the Table, which means that for 50% of the transactions containing milk and bread the rule is cofrect. Confidence can be interpreted as an estimate of the probability P(Y |X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. In many (but not all) situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a 00d marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support, ie., they appear together in many baskets. We then find association rules or causalities only involving a high-support set of items must appear in at least a certain percent of the baskets, called the support threshold. We use the term frequent itemset for “a set S that appears in at least fraction s of the baskets," where s is some chosen constant, typically 0.01 or 1%, Association rules are statements of the form {X,,X:, ....Xo} BY , meaning that if we find all of X,,Xz,...X» in the market basket, then we have a good chance of nding Y . The Probability of finding Y for us to accept this rule is the condence of the rule. We normally Would search only for rules that had confidence above a certain threshold. DWM-5 POPULAR PUBLICATIONS 4. State the differences between Data Mart & Warehouse. [WBUT 2010, 2015) Answer: ‘A data warehouse has a structure which is separate from a data mart, even though they may look similar. Because of this, it is difficult to coordinate the data across multiple departments, Each department will have its own view of how a data mart should look. The data mart that they use will be specific to them. In contrast, a data warehouse is designed around the organization as a whole. Instead of being owned by one department, ‘a data warehouse will generally be owned by the entire company. While the data contained in data warehouses are granular, the information contained in data marts is not very granular at all. Another thing that separates data warehouses from data marts is that data warehouses contain larger amounts of information. The information that is held by data marts are often summarized. Data warehouses will not contain information that is biased on the part of the department. Instead, it will demonstrate the information that is analyzed and processed by the organization. Much of the information that is held in data warehouses is historical in nature, and they are designed to process this information. 5. When is a Data Mart appropriate? [WBUT 2011] Answer: . . Data Marts are created for: a) Speeding up queries by reducing the volume of data to be scanned b) Structuring data in a form suitable for user access tools c) Partitioning data to impose access control strategies d) Segmenting data into different hardware platforms. Data marts should not be used in other cases as the operational costs of data marting can be high and once a strategy in place, it can be difficult to change it without incurring substantial cost. 6. What is sequence mining? [WBUT 2011] Answer: Sequence mining is a type of structured data mining in which the database and administrator look for sequences or trends in the data. This data mining is split into two fields. Itemset sequence mining typically is used in marketing, and string sequence mining is used in biology research. Sequence mining is different from regular trend mining, because the data are more specific, which makes building an effective database difficult for database designers, and it can sometimes go awry if the sequence is any different from the common sequence. 7. Differentiate among Enterprise Warehouse, Data mart and Virtual warehouse. [WBUT 2013] Answer: Enterprise warehouse - collects all of the information about subjects spanning the entire organization DWM-+4 PATA WAREHOUSING AND DATA MINING Data Mart - a subset of corporate-wide data that is of value to specific groups of users. Its scope is ‘confined to specific, selected groups, such as mmaketing d tn oat. < Virtual warehouse - A set of views over operational databases. Only some of the possible summary views may be materialized. 8. How is data warehouse different from a database? [WBUT 2013, 2016, 2018] Answer: A database is used to store data while a data warehouse is mostly used to facilitate reporting and analysis. Basically, database is just where the data is stored; in order to access this data or analyze it a database management system is required. However, a data warehouse does not necessarily require a DBMS. The purpose of a data warehouse is for easy access to the data for a user. The data warehouse may also be used to analyze the data; however the actual process of analysis is called data mining. Some differences between a database and a data warehouse: ¢ A database is used for Online Transactional Processing (OLTP) but carr be used for other purposes such as Data Warehousing. eA data warehouse is used for Online Analytical Processing (OLAP). This reads the historical data for the Users for business decisions. © In a database the tables and joins are complex since they are normalized for RDMS. This reduces redundant data and saves storage space © In data warehouse, the tables and joins are simple since they are de-normalized. This is done to reduce the response time for analytical queries. © Relational modeling techniques are used for RDMS database design, whereas modeling techniques are used for the Data Warehouse design © A database is optimized for write operation, while a data warehouse is optimized » for read operations. © In a database, the performance is low for analysis queries, while in a data warehouse, there is high performance for analytical queries. ¢ A data warehouse is a step ahead of a database. It includés a database in its structure, ‘9. What are the issues relating to the diversity of database types? [WBUT 2016) Answer; © Handling of relational and complex types of data— The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. © Mining information from heterogeneous databases and global information systems — The data is available at different data sources which may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining, 10. Introduce the idea of Data Mining with example(s). What are the steps involved In Knowledge discovery in Database (KDD) process? (WBUT 2016) . DWM-7 POPULAR PUBLICATIONS Answer: . Data Mining: Refer to Question No. 2(1" Part) of Short Answer Fype Questions, Data mining example: 7 ; / Mobile phone and utilities companies use Data Mining and Business Intelligence to predict ‘churn’, the terms they use for when a customer leaves their company to get their phone/gas/broadband from another provider. They collate billing information, customer services interactions, website visits and other metrics to give each customer a probability score, then target offers and incentives to customers whom they perceive to be at a higher risk of churning. Steps involved in KDD process , 1. Identify the goal of the KDD process from the customer’s perspective. 2. Understand application domains involved and the knowledge that's required 3. Select a target data set or subset of data samples on which discovery is be performed. 4. Cleanse and pre-process data by deciding strategies to handle missing fields and alter the data as per the requirements. 5. Simplify the data sets by removing unwanted variables. Then, analyze useful features that can be used to represent the data, depending on the goal or task. 6. Match KDD goals with data mining methods to suggest hidden patterns. 7. Choose data mining algorithms to discover hidden patterns. This process includes deciding which models and parameters might be appropriate for the overall KDD process. Search for patterns of interest in a particular representational form, which include classification rules or trees, regression and clustering. 9. Interpret essential knowledge from the mined patterns, 10. Use the knowledge and incorporate it into another system for further action. 11. Document it and make reports for interested parties. 8. 11. Does a data warehouse involve a transaction? Explain. [WBUT 2018) Answer: A data warehouse is @ relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. Answer estions 1. Define Data Warehouse and briefly describe its characterictics. . [WBUT 2009, 2018] Answer: A data warehouse consists of a computer database responsible for the collection and storage of information for a specific organization. This collection of information is then used to manage information efficiently and analyze the collected data. Although data DWM-8 DATA WARE! warehouses vary in overall design, majority of them are subject ori ing that the stored information is connected to objects or events hat cow nae The data provided by the data warehouse for analysis provides information on a specific subject, rather than the functions of-the company and is collected from varying sources into one unit having time-variant. A data warehouse has significantly different features from other enterprise-wide systems, particularly in how data is stored, managed arid manipulated. ‘There are four key characteristics which separate the data warehouse from other major ‘operational systems: . 1. Subject Orientation: Data organized by subject 2. Integration: Consistency of defining parameters 3. Non-volatility: Stable data storage medium 4, Time-variance: Timeliness of data and access terms 2. a) What are the different tiers in a typical 3-tier data warehousing architecture? b) What do you mean by warehousing schema? Explain. {WBUT 2009] Answer: a) The bottom tier is a warehouse database server which is almost always a relational database system. The middle tier is an OLAP server which is typically implemented using either (i) a Relational OLAP (ROLAP) model, ie., an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a Multidimensional OLAP (MOLAP) model, i.e., 2 special purpose server that directly implements multidimensional data and operations. The top tier is a client, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on). b) The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities or objects, and the telationships between them. Such a data model is appropriate for online transaction Processing. Data warehouses, however, require a concise, subject-oriented schema which facilitates on-line data analysis. 3. a) Introduce the concept of data mining and cite two application areas. [WBUT 2010, 2012] b) What are the different steps of a data mining task? [WBUT 2010, 2012) C) Suppose that the data mining task is to cluster the following eight points (with (%, y) representing location) into 3 clusters. A (2,10), 4, (2.5). (8,4)sB (5.8) B,(7,5),B (64).C,(1L2).C, (4.9) The distance function is Euclidian distance. Initially we assign 4, B, and C, as the Centre of each cluster. Use k-means algorithm to determine the three clusters. [WBUT 2010, 2013, 2014, 2016, 2018} Answer: a) t" Part: Refer to of Question No. 2(1" Part) of Short Answer Type Questions. DWM-9 POPULAR PUBLICATIONS 2" Part: An application of data mining is market segmentation. With market segmentation, a company will be able to find behaviors that are common among its customers. One can look for pattems among customers that seem to purchase the same products at the same time. Another application of data mining is called customer churn. Customer churn will allow a company to estimate which customers are the most likely to stop purchasing its products or services and go to one of its competitors. b) The Following Steps Are Usually Followed In Data Mining. These steps are iterative, with the process moving backward whenever needed. ° 1, Develop an understanding of the application, of the relevant prior knowledge, and of the end user’s goals. 2. Create a target data set to be used for discovery. 3. Clean and preprocess data (including handling missing data fields, noise in the data, accounting for time series, and known changes). 4. Reduce the number of variables and find invariant representations of data if possible. 5. Choose the data-mining task (classification, regression, clustering, etc.) 6. Choose the data-mining algorithm. 7. Search for patterns of interest (this is the actual data mining). 8. Interpret the pattern mined. If necessary, iterate through any of steps 1 through 7. 9. Consolidate knowledge discovered and prepare a report. gr The given eight points were: C2: (4,9) The points Al, B2, and C1 were assumed to be the initial three cluster centres. There are two possible ways of clustering the given points using the k-means algorithm, depending on to which cluster we assign the point B1: ‘The first way: 1.1. Round One: . , = Compute for each of the eight points its distance from each cluster centre, and assig" itto the cluster to which it is most similar. => The clustering after the first round execution is: Cluster 1: Al, B1, C2 Cluster 2: B2, A3, B3 Cluster 3: Cl, AZ DWM-10 USIN iG - update the cluster centres: ‘The new centre of Cluster 1: (3.67, 9) The new centre of Cluster 2: (7, 4.33) The new centre of Cluster 3:(1.5, 3.5) 1.2, Round Two: -No changes ‘The second way: 2.1. Round One: = Compute for each of the eight points its distance from each cluster centre, and assign it to the cluster to which it is most similar. => The clustering after the first round execution is: Cluster I: Al, C2 . Cluster 2: B2, A3, B1, B3 Cluster 3: C1, A2 - update the cluster centres: The new centre of Cluster 1: (3, 9.5) ‘The new centre of Cluster 2: (6.5, 5.25) The new centre of Cluster 3: (1.5, 3.5) 2.2, Round Two: -recompute for each of the eight points its distance from each cluster centre, and assign it to the cluster to which it is most similar. => The clustering after the second round execution is: Cluster 1: Al, B1,C2 Cluster 2: B2, A3, B3 Cluster 3: C1, A2 - update the cluster centres: The new centre of Cluster 1: (3.67, 9) The new centre of Cluster 2: (7, 4.33) The new centre of Cluster 3: (1.5, 3.5) 2.3. Round Three: No changes 4. a) Write down the design steps for a typical data warehouse. (WBUT 2015] b) Explain the flowchart for KDD process. ¢) Define Roll-up and Drill-down process with a suitable example. Answer: ) In general, building any data warehouse consists of the following steps: 1. Extracting the transactional data from the data sources into a staging area - The Extract step covers the data extraction from the source system and -makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking. DWM-11 PU! UBLICATIONS 2. Transforming the transactional data - The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating Surrogate keys, Sorting, deriving new calculated values, and applying advanced validation rules. 3. Loading the transformed data into a dimensional database - During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency. b) An Outline of the Steps of the KDD Process —\ et The overall process of finding and interpreting pattems from data involves the repeated application of the following steps: 1. Developing an understanding of © the application domain © the relevant prior knowledge © the goals of the end-user 2. Creating a target data set: selecting a data set, or focusing on-a subset of variables, or data samples, on which discovery is to be performed, 3. Data cleaning and preprocessing © Removal of noise or outliers, © Collecting necessary information to mode! or account for noise. © Strategies for handling missing data fields. © Accounting for time sequence information and known changes. 4. Data reduction and projection. © Finding useful features to represent the data depending on the goal of the task. Target Dats DWM-12 DATA WAREHOUSING,AND DATA MINING o Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. 5. Choosing the data mining task. o Deciding whether the goal of the KDD process is classification, regression, clustering, etc. .6, Choosing the data mining algorithm(s). © Selecting method(s) to be used for searching for patterns in the data. o Deciding which models and parameters may be appropriate. o Matching a particular data mining method with the overall criteria of the KDD process. 7. Data mining. Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth. 8. Interpreting mined patterns. 9. Consolidating discovered knowledge. c)Roll-up Roll-up performs aggregation on a data cube in any of the following ways: ¢ By climbing up a concept hierarchy for a dimension © By dimension reduction The following diagram illustrates how roll-up works. POPULAR PUBLICATIONS e Roll-up is performed by climbing up a concept hierarchy for the dimension location. Initially the concept hierarchy was “street < city < province < country", On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. ‘© The data is grouped into cities rather than countries. When roll-up is performed, one or more dimensions from the data cube are removed. Drill-down Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: «By stepping down a concept hierarchy for a dimension «By introducing a new dimension. ‘The following diagram illustrates how drill-down works: La ZZ LZ MMA AA Time (months) gyi eeeagt ‘Mobile ModemPhone Security : . . itemitypes) . ‘ Fal-iown is performed by stepping down a concept hierarchy for the dimension ime. . Initially the concept hierarchy was "day < month < quarter < year." © On drilling down, the time dimension is descended from the level of quarter ! the level of month. . re drill-down is performed, one or more dimensions from the data cube a? «It navigates the data from less detailed data to highly detailed data. DWM-14 DATA WAREHOUSING AND DATA MINING 6. Write short notes on the following: a) Strategic information rl 1 b) Data Mart Oe aur zr] b) Issues and challenges in data mining (WBUT 2008] c) Text mining . [weur. d) DBMS vis-a-vis Data Mining ee all mal Answer: a) Strategic information: 7 Strategic information is needed for the day to day operations, meeting of government requirements as well long and short range planning. Strategic information systems are used by organizations to achieve a competitive advantage for the organization. A Strategic Information System (SIS) is a system that helps companies change or otherwise alter their business strategy and/or structure. It is typically utilized to streamline and quicken the reaction time to environmental changes and aid it in achieving a competitive advantage. Strategic information systems relies on data gathered at all organizational levels and from all information systems whether it was transaction processing systems, management information systems, decision support systems or executive information systems. Key features of the Strategic Information Systems are the following: i) Decision support systems that enable to develop a ‘strategic approach to align Information Systems. (IS) or Information Technologies (IT) with an organization's business strategies if) Primarily Enterprise resource planning solutions that integrate/link the business processes to meet the enterprise objectives for the optimization of the enterprise resources iii) Database systems with the "data mining" capabilities to make the best use of available corporate information for marketing, production, promotion and innovation. The SIS Systems also facilitate identification of the data collection strategies to help optimize database marketing opportunities. iv) The real-time information Systems that intend to maintain a rapid-response and the quality indicators. b) Data Mart: Refer to Question No. I of Short Answer Type Questions. ¢) Issues and challenges in data mining: One of the key issues raised by data mining technotogy is not a business or technological ‘one, but a social one. It is the issue of individual privacy. Data mining makes it possible to analyze routine business transactions and glean a significant amount of information about individuals buying habits and preferences. Another isstie is that of data integrity. Clearly, data analysis can only be as good as the data that is being analyzed. A key implementation challenge is integrating conflicting or redundant data from different sources. For example, a bank may maintain credit cards accounts on several different databases. The addresses (or even the names) of a single DWM-15 PI IOI cardholder may be different in each. Software must translate data from one system to another and select the address most recently entered. . A hotly debated technical issue is whether it is better to set up a relational database structure or a multidimensional one. In a relational structure, data is stored in tables, permitting ad hoc queries. In a multidimensional structure, on the other hand, sets of cubes are arranged in arrays, with subsets created according to category. While multidimensional structures facilitate multidimensional data mining, relational structures thus far have performed better in client/server environments. And, with the explosion of the Internet, the world is becoming one big client/server environment. : Finally, there is the issue of cost. While system hardware Costs have dropped dramatically within the past five years, data mining and data warehousing tend to be self. reinforcing. The more powerful the data mining queries, the greater the utility of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained, which increases the pressure for faster, more powerful data mining queries. This increases pressure for larger, faster systems, which are more expensive. 4) Text mining: Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation. In text mining, the goal is to discover heretofore’ unknown information, something that no ‘one yet knows and so could not have yet written down. - The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to Process automatically; text is written for People to read. We do not have programs that can "read" text and will not have such for the forseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do. . e) DBMS vis-a-vis Data Mining: DBMS and data mining answer different whereas DBMS gives reports. Example DBMS Reports + Last months sales for each service type * Sales per service grouped by customer sex or age bracket . + _ List of customers who lapsed their policy Questions answered using Data Mining * What characteristics do customers that lapse their policy have in common and how do they differ from customers who renew their policy? * Which motor insurance policy holders would be potential customers for my House Content Insurance policy? queries. Data mining helps in predicting future DWM-16 DAT) (EHOUSING AND DAT: 6. What aro data mining primitives? [MODEL QUESTION] Answer a Data mining primitives project ip atribute 4 pre: weigh muble |\) Daa mung Daa mining sean stgorhen primitives on SQL select Fig: Data mining primiti Contrary to current belief, data mining is first and foremost a server technology. Until recently, most data mining tools operated on flat files and in-memory databases. With the shift of attention to larger databases, the need for implementing data mining algorithms on top of relational databases has been recognized. However, current relational technology, with its emphasis on efficient updates and exact idertification of records, is ill-sujted to the support of data mining algorithms. Data mining algorithms require: © efficient sampling methods © storage of intermediate results during execution of queries © capacity to handle large amounts of data © geometric indexes that allow users to search in the neighborhood of record and bit-mapped operations. Data mining is supported by client/server technology, with the server technology being of great importance when working with large databases. All these functions can in principle be realized on top of existing implementations, as they are based on generalizations of the same mathematical structure, that is, first-order logic and algebra. Can these ideas be brought into a general framework for the revision of the relational Model? Following figure describes a number of fundamental data mining routines in pation to SQL primitives. In general, data mining routines can be divided into four Ips: 1. Pre-processing: Routines in this group take samples, compress attribute values into Clusters, and code tables into appropriate input formats for algorithms (that is, DWM-17 ‘ Pt \R PUBLICATIONS transform historical files into time series, multi-valued attributes into sets of binary attributes, tables into bitmaps, etc.). 2. Nearest neighbor search: Neighbor search can be interpreted as a generalization of exact identification of records by means of traditional keys. 3._N table scan algorithms: This is a class of algorithms that need to access a table several times during execution. It is not yet clear whether it is possible to speed up the database performance of these algorithms, although enriched user-defined functions may prove to be of help here. 4. Zoom scan algorithms: This is a group of algorithms that also access a table repeatedly. but in this case it is clear that an adaptation of relational technology could improve the speed of execution dramatically. These algorithms zoom into interesting sub-parts of the database during their execution. This basically involves the repeated execution of SQL statements that access similar or overlapping selections of records. This short overview clearly indicates that it is already possible to draw a rough outline of an enriched relational model. A number of details still have to be worked out, however, botti from a formal perspective (neighborhood topologies, tabular projections, zoom scan queries. etc.) and from an implementational view (that is, how to implement this model on a parallel database server capable of handling billions of records). Data mining and data warehousing represent radical technological advances which will necessarily alter our fundamental approach to database technology. 7. Explain scalable methods for mining sequential patterns. [MODEL QUESTION) Answer: Sequential pattern mining is computationally challenging because such mining may generate and/or test a combinatorically explosive number of intermediate subsequences. “How can we develop efficient and scalable methods Sor sequential pattern mining?” . Recent developments have made progress in two directions: (1) efficient methods for mining the ful set of sequential patterns, and (2) efficient methods for mining only the set of closed sequential patterns, where a sequential pattern s is closed if there exists no Sequential pattern s’ where s ‘where sis a proper super-sequence of s, and s’has the same (frequency) support as s. because all of the subsequences of a frequent sequence are also frequent, mining the set of closed sequential patterns may avoid the generation of unnecessary subsequences and thus lead to more compact results as well as more efficient methods than mining the full set. We will first examine methods for mi ing the full set and then study how they can be extended for mining the closed set. In addition, we discuss modifications for mining multilevel, multidimensional sequential patterns (i.e. with multiple levels of granularity). We can give example of three such approaches for sequential pattern mining, represented by the algorithms GSP, SPADE, and PrefixSpan, respectively. GSP adopts a candidate generate-and-test approach using horizongal data format (where the data are represent as , as usual, where each itemset is an event)- SPADE adopts a candidate generate-and-test approach using vertical data format (where DWM-18 DATA WAREHOUSING AND DATA MINING the data are represented as . The vertical data format can be obtained by transforming from a horizontally formatted sequence database in just one scan. Prefix Span is a pattern growth method, which does not require candidate generation. All three approaches either directly or indirectly explore the Apriori property, stated as follows: every nonempty subsequence of a sequential pattern is a sequential pattern. (Recall that for a pattern to be called sequential, it must be frequent. That is, it must satisfy minimum support.) the Apriori property is antimonotonic (ar downward-closed) in that, if a sequence cannot pass a test (e.g., regarding minimum support), all of its super- sequences will also fail the test. Use of this property to prune the search space can help make the discovery of sequential pattems more efficient. DWM-19 POPULAR PUBLICATIONS CLASSIFICATION AND PREDICTION OF DATA WAREHOUSING Multiple Choice Type Questions 4. Which of the following techniques are appropriate for data warehousing? a) Hashing on primary keys (WBUT 2009, 2013, 2018) b) Indexing on foreign keys of the fact table c) Bit-map indexing d) Join indexing Answer: (c) 2. of descriptive type of data mining. a) Association Rule, Clustering ¢) Classification, Clustering Answer: (c) 3. What is Metadata? a) Summarized data ¢) Data about data Answer: (c) 4. The full form of OLAP is a) Online Analytical Processing ¢) Online Advanced preparation Answer: (2) 5. The apriori algorithm is a a) top — down search ¢) depth first search Answer: (d) 6. Classification rules are extracted from a) Root node b) Decision tree Answer: (b) 7. Which of the following is a predictive model? a) Clustering ¢) Summarization Answer: (b) is an example of predictive type of data mining whereas ...... is an example [WBUT 2010, 2012] b) Association Rule, Classification q) Clustering, Classification [WBUuT 2017] b) Operational data d) None of these . (WBUT 2017] b) Online Advanced Processing d) Online Analytical Performance [WBUT 2017] b) breadth first search d) bottom-up search [WBUT 2017] ¢) Siblings d) Branches (WBUT 2017] b) Regression d) Association rules 8. All set of items whose support is greater than the user-specified minimum Support are called as [WBUT 2017] DWM-20 DATA WAREHOUSING AND DATA MINING a) Border set b) Frequent set c) Maximal frequent set 4) Lattice ‘Answer: (b) 4. Explain the use of Dynamic itemset counting algorithm. IWBUT 2009] OR, Describe the principle of Dynamic Itemset Counting technique for Frequent Itemset generation. [WBUT 2010} Answer: Dynamic Itemset Counting is an alternative to Apriori [temset Generation. Here Itemsets are dynamically added and deleted as transactions are read and it relies on the fact that for an itemset to be frequent, all of its subsets must also be frequent, so we only examine those itemsets whose subsets are all frequent. Algorithm stops after every M transactions to add more itemsets. e Train analogy: There are stations every M transactions. The passengers are itemsets. Itemsets can get on at any stop as long as they get off at the same stop in the next pass around the database. Only itemsets on the train are counted when they occur in transactions. At the very beginning we can start counting I-itemsets, at the first station we can start counting some of the 2-itemsets. At the second station we can start counting 3-itemsets as well as any more 2-itemsets that can be counted and so on. Itémsets are marked in four different ways as they are counted: Solid box: [_] confirmed frequent itemset - an itemset we have finished counting and exceeds the support threshold minsupp «Solid circle: () confirmed infrequent itemset - we have finished counting and it is below minsupp © Dashed box . exceeds minsupp _ © Dashed circle: {_.} suspected infrequent itemset - an itemset we are still counting that is below minsupp suspected frequent itemset - an itemset we are still counting that DIC Algorithm 1. Mark the empty itemset with a solid square. Mark all the I-itemsets with dashed circles. Leave all other itemsets unmarked. 2. While any dashed itemsets remain: a) Read M transactions (if we reach the end of the transaction file, continue from the beginning). For each transaction, increment the respective counters for the itemsets that appear in the transaction and are marked with dashes. b) Ifa dashed circle's count exceeds minsupp, tum it into a dashed square. If any immediate superset of it has all of its subsets as solid or dashed squares, add a new counter for it and make it a dashed circle. DWM-21 POPULAR PUBLICATIONS Once a dashed itemset has been counted through all the transactions, make it solid and stop counting it. 2. State Apriori Algorithm for frequent item set generation. (BUT 2019) Answer: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Key Concepts: ¢ Frequent Itemsets: The sets of item which has minimum support (denoted by L, for ith-Itemset). ¢ Apriori Property: Any subset of frequent itemset must be frequent. . ¢ Join Operation: To find Lj, a set of candidate k-itemsets is generated by joining L,.. with itself. Algorithm ° Find the frequent itemsets: the sets of items that have minimum support © A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} isa frequent itemset, both {4} and {2} should be a frequent itemset © Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) © Use the frequent itemsets to generate association rules. Pseudo code ¢ Join Step: C, is generated by joining L,., with itself © Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k- itemset Pseudo-code: Cy: Candidate itemset of size k Ly: frequent itemset of size k L\= {frequent items}; For (k= 1; Ly!=9; k++) do begin C\.1= candidates generated from Ly; for each transaction tin database do increment the count of all candidates in Cys: that are contained int Liy= candidates in C,,, with min_support end return UL; 3. Describe the principle of partitioning technique for Frequent itemset generation and justify how it improves the efficiency of Frequent item set generation compared to a priori Algorithm. ¥ reat [WBUT 2010, 2012 Answ A partitioning techni frequent itemsets. It ique can be used that requires just two database scans to mine the consists of two phases. In Phase I, the algorithm subdivides the DWM.22 \TA WAREHOUSING AND DAT transactions of D into » nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is minsup, the number of transactions in that partition. For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. In Phase I, a second scan of Dis conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase, The following figure describes the phases. Apriori algorithm needs many database scans and for each scan. frequent itemsets are searched by pattern matching, which is especially time-consuming for large frequent itemsets with long patterns. Partition algorithm uses the intersection of transaction ids (tids) to determine the support count. Because it performs a breadth-first search, it partitions the database into blocks in which local frequent itemsets are found to overcome the memory limitations. An exira database scan is needed to determine the global frequency of local frequent itemsets. 4, What is clustering? Discuss two main methods of clustering. IWBUT 2010) Answer: 1" Part: Refer to Question No. 1(a) (1 Part) of Long Answer Type Questions. 2™ Part: The non-hierarchical techniques in general are faster to create from the historical database but require that the user make some decision about the number of clusters desired or the minimum “nearness” required for two records to be within the same cluster. These non-hicrarchical techniques are often run multiple times starting off with Some arbitrary or even random clustering and then iteratively improving the clustering by shuffling some records around. These techniques sometimes create clusters that are created with only one pass through the database adding records to existing clusters when they exist and creating new clusters when no existing cluster is a good candidate for the Biven record. Because the definition of which clusters are formed can depend on these initial choices of which starting clusters should be chosen or even how many clusters these techniques can be less repeatable than the hierarchical techniques and can Sometimes create either too many or too few clusters because the number of clusters is Predetermined by the user not determined solely by the patterns inherent in the database. DWM-23 LIGATION: 5. Suppose that the data mining task is to cluster the following ten points (with (xy)) representing location into two clusters: xt 2 6 x2 3 4 x3 3 8 x4 4 7 X5 6 2 X6 6 4 x7 7 3 x8 7 4 x9 8 5 X10 7 6 The distance function is defined as [ x, -x, ]+[y,-y, ] Use k-means or k-medoid algorithm to determine the two clusters. [WBUT 2012) Answer: Step 1 Initialize & centers. Let us assume c, = (3,4) and c; = (7,4) So here c, and c, are selected as medoids. Costs to the nearest medoid are shown bold in the table. 7 3 Data objects Cast ae (distance) 1/3742 ]6 3 3 [3,4 t 3 & 4 4 3 4 4 7_/ 4 s [3a a |e 2 5 6,3 ,4aefs 3 713 ]4 7173 5 a9 Tatas 5 6 10 [3 [47 6 6 1 i & Data leet Cost (distance) 1 7 4 2 me 6 7 3 [7,43 8 3 4 7 4 4 7 6 s[7]4].6 | 2 3 6[7/4 fe. 4 1 7/7/14 7 [3 1 9/7] 4 8 [5 2 lof 74 716 2 Then the clusters become: Cluster; = {(3,4)(2,6)(3,8)(4,7)} Clusters = {(7,4)(6,2(6,4)(7,3\(8,5\7,6)} DWM-24 Ts sING AND DAT: IG Since the points (2,6) (3,8) and (4,7) are closer to ¢, hence they form one cluster whilst remaining points form another cluster. So the total cost involved is 20. Where cost between any two points is found using formula matea Sed where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Total cost is the summation of the cost of data object from its medoid in its cluster so here: - Total cost = {cost((3,4), (2,6) + cost((3,4), (3,8)) + cost((3,4), (4,7))} + {cost((7,4), (6,2)) + cost((7,4), (6,4)) + cost((7,4), (7,3) + cost((7,4), (8,5)) + cost((7,4), (7,6))} =(344+4)+(34+1+1+242) =20 Select one of the nonmedoids O' Let us assume O' = (7,3) So now the medoids are c,(3,4) and O(7,3), Ifcl and O' are new medoids, calculate the total cost involved By using the formula in the step 1 Data objects i * a Cost (distance) alulalul— ela 1 Cost (distance) g Total cost=344+4+2+2+14+3+3=22 DWM-25 POPULAR PUBLICATIONS So cost of swapping medoid from c; to O' is 'S = current total cost — past total cost = 22-20 =2 > 0 ; ‘So moving to O' would be a bad idea, so the previous choice was good. So we try other nonmedoids and found that our first choice was the best. So the configuration does not change and algorithm terminates here (i.e. there is no change in the medoids).. 6. What is metadata in Data warehousing? What is metadata catalog? Discuss the different categories of metadata used in data warehouse. IWBUT 2014) a) What is metadata? IWBUT 2015) 'b) What is the typical content of metadata of a data warehouse? [WBUT 2015) OR, What is Metadata? Explain different types of Metadata. [WBUT 2017] Answer: a Metadata is simply defined as data about data. The data that is used to represent other data is known as metadata. In terms of data warehouse, we can define metadata as follows. * Metadata is the road-map to a data warehouse. * Metadata in a data warehouse defines the warehouse objects. © Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse. A collection of metadata records combined with data management and search tools forms a data catalogue. The architecture of these data catalogues can vary from tightly held internal data indexes to widely accessible distributed catalogues spread across the Internet. Metadata can be broadly categorized into three categories: * Business Metadata: It has the data ownership information, business definition, and changing policies. Technical Metadata: It includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. © Operational Metadata: It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it. 1. a) What is clustering? Why is clustering? Distinguish between partitio difficult to handle categorical data for n clustering and hierarchical clustering. [WBUT 2009) OR, What is clustering? What are the features of good cluster? What do you mean by hierarchical clustering technique? [WBUT 2013, 2014, 2016, 2018] DWM-26 DATA WAREHOUSING AND DATA MINING. Answer: 1" Part: | Clustering is a data mining (machine leaning) technique used to Place data elements into related groups without advance knowledge of the group definitions, There are two main types of clustering techniques, those that create a hierarchy of clusters and non-hierarchy of clusters. The hierarchical clustering techniques create a hierarchy of clusters from small to big. With a hierarchy of clusters defined it is possible to choose the number of clusters that are desired. At the extreme it is possible to have as many clusters as there are records in the database. In this case the records within the cluster are optimally similar to each other (since there is only one) and certainly different from the other clusters. The advant tage of hierarchical clustering methods is that they allow the end user to choose from either many clusters or-only a few. 2™ Part: Since, the process of grouping a set of | physical or abstract objects into classes of similar objects is clustering, similarity metric is important here because it is used for outlier detection. The clustering algorithm which is main memory based can operate only on the following two data structures namely, a) Data Matrix b) Dissimilarity Matrix So it is difficult to handle categorical data. In partition clustering, every data sample is initially assigned to a cluster in some (possibly random) way. Samples are then iteratively transferred from cluster to cluster until some criterion function is minimized. Once the Process is complete, the samples will have been partitioned into separate compact clusters. Examples of partition clustering methods are k-means and Lloyd's method. In hierarchical clustering, each sample is initially considered a member of its own cluster, after which clusters are recursively combined in pairs according to some predetermined Condition until eventually every point belongs to a single cluster. The resulting hierarchical structure may be represented by a binary tree or "dendrogram", from which the desired clusters may be extracted. Examples of hierarchical clustering methods are the single-link, Ward's, centroid, complete-link, group average, median, and parametric Lance Williams methods. ') Describe the working of the PAM algorithm. Compare its performance with CLARA and CLARANS. (WBUT 2009) Answer: The PAM k-medoids clustering algorithm, for example, evaluates a set of k objects Considered to be representative objects (medoids) of k clusters within T objects such that the non-selected objects are clustered with the medoid to which it is the most similar (i.e. Closest in terms of the provided distance metric). The Process operates by swapping one Of the medoids with one of the objects iteratively such that the total distance between Non-selected objects and their medoid is reduced, The algorithm can be depicted as follows: DWM-27 POPULAR PUBLICATIONS Step 1: Initialization - choose k medoids from T objects randomly. Step 2: Evaluation - calculate the cost D’, ~ D, for each swap of one medoid with one object, where D, is the total distance before the swap and D’, is the total distance after the swap. Step 3: Selection - accept the swap with the best cost and if the cost is negative, go to step 2; otherwise record the medoids and terminate the program. The computational complexity of the PAM algorithm is O((1 + B)K(T — k)’) which is based on the number of partitions per object, where B is the number of successful swaps, It can also be expressed as O'((1 + f)k'(T — k)*) based on the number of distance calculations, ie. one partition per object is equivalent to k distances calculations, Clearly, this is time consuming even for the moderate number of objects and a small number of medoids. The computational complexity of the CLARA algorithm is O(q(ks’ + (T — k)) + Bks’) based on the number of partitions per object or O'(q(k’s” + k(T —k)) + Bk’s’) based on the number of distance calculations, where q, s, k, B and T are the number of samples, object size per sample, number of medoids, the number of successful swaps for all samples tested and the total number of objects, respectively. Clearly, the CLARA algorithm can deal with a larger number objects than can PAM algorithm ifs « T. The computational complexity of CLARANS is O((B + numlocal)(T ~ k)) based on the number of partitions per object or O'((f + numlocal)k(T — k)) based on the number of distance calculations, where B is the number of test moves between nodes. 2. a) What is Clustering? Why is it required for data warehousing and data mining? What are the differences between partitioning -clustering and Hierarchical clustering? . b) Suppose the following table shows the distances between different cities. Construct a hierarchical tree using Hierarchical clustering method. BA | FA | MI | NA | RM | TO BA | 0 | 662 | 877 | 255 | 412 | 996 Fi | 662 | 0 | 295 | 468 | 268 | 400 mi_| 677 | 295 | 0 | 754 | 564 | 138 NA | 255 | 468 | 754 | 0 | 219 | 869 RM_| 412 [268 | 564 | 219 | 0 | 669 To | 996 | 400 | 138 | 869 | 669 | 0 c) Discuss K-mean algorithm with a suitable example. [WBUT 2017] Answer: a) 1" Part: Refer to Question No. I(a) (I" Part) of Long Answer Type Questions. 2° Part: Clustering is required for data warehousing and data mining for the following reasons: © Scalability - We need highly scalable clustering algorithms to deal with large databases. DWM-28 TA, HOt iG AND Dy IG Ability to deal with different kinds of attributes — Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data. eDiscovery of clusters with attribute shape — The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes, ¢ High dimensionality — The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space. Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. ‘Some algorithms are sensitive to such data and may lead to poor quality clusters. Interpretability — The clustering results should be interpretable, comprehensible, and usable. 3" Part: Refer to Question No. I(a) (2 Part) of Long Answer Type Questions. b) The nearest pair of cities is MI and TO, at distance 138. These are merged intoa single cluster called "MI/TO". The level of the new cluster is L(MUTO) = 138 and the new sequence number is m= 1. Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO” to RM is chosen to be 564, which is the distance. from MI to RM, and so on. After merging MI with TO we obtain the following matrix: BA FL [MUTOT NA RM BA 0 662 877 255 412 FI 662 0 295_| 468 268 MVTO | _877 295 | 0 754 364 NA 255 468__| 754 0 219 RM. 412 268 | 564 [| 219 0 min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NARM) = 219 m=2 NAIRM BA 255 FI 662 0 295_| 268 MUTO | _ 877 295 0 564 NARM| 255 208 | sa] 0 min d(ij) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM LBANA/RM) = 255 m=3 | BaNARM [FI | MUTO | DWM-29 UBLICATION: BA/NA/RM 0 268 | «564 FI 268 0 295 MUTO 564 295 0 min (i,j) = ((BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FINA/RM) = 268 m=4 [__BA/NA/RM MUTO | BA/NAIRM_| 9 295 MuTO | 295 298, Finally, we merge the last two clusters at level 295. The process is summarized by the following hierarchical tree: BA NA RM FI OM TO ¢) K-means is an unsupervised learning algorithms that solve the clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Suppose that we have n sample feature’ vectors x,, X2, ..., Xq all from the same class, and we know that they fall into k compact clusters, k A, confidence(c) =0(A,C,EVa(C,E)=3/4=0.75 4. Write short notes on the following: a) Dimensional Modeling [WBUT 2018] b) Generalized Association Rule [WBUT 2010] c) PAM Clustering Technique [WBUT 2010) 4) CLARANS clustering algorithm vis-a-vis PAM and CLARA e) K-Medoid Algorithm [WBUT 2017] Answer: a) Dimensional Modeling: A dimensional model is a data structure technique optimized for Data warehousing tools. The concept of Dimensional Modelling was developed by Ralph Kimball and is comprised of "fact" and "dimension" tables. A Dimensional model is designed to read, summarize, analyze numeric information i values, balances, counts, weights, etc. in a data warehouse. In contrast, relation m DWM-32 DATA WAREHOUS! IND are optimized for addition, updating and deletion of data in a real-time Online Transaction System. ‘These dimensional and relational models have their unique way of data storage that has specific advantages. For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the contrary, dimensional model arranges data in such a way that it is easier to retrieve information and generate reports. Hence, Dimensional models are used in data warchouse systems and not a good fit for relational systems. b) Generalized Association Rule: In most real-life applications taxonomies (is-a hierarchies) over the items are available. In general, a taxonomy can be represented as a directed acyclic graph (DAG). Given a set of transactions T each of which is a set of items, anda taxonomy Tax, the problem of mining generalized association rules is to discover all rules of the form X > Y, with the user-specified minimum support and minimum confidence. X and Y can be sets of items at any level of the taxonomy, such that no item in Y is an ancestor of any item in X. For example, there might be a rule which says that "50% of vansactions that contain Soft drinks also contain Snacks; 5% of all transactions contain both these items” c) PAM Clustering Technique: PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters. Objects that are tentatively defined as medoids are placed into a set S of selected objects. If O is the set of objects that the set U = O - Sis the set of unselected objects. The goal of the algorithm is to minimize the average dissimilarity of objects to their closest selected object. Equivalently, we can minimize the sum of the dissimilarities between object and their closest selected object. The algorithm has two phases: (i) In the first phase, BUILD, a collection of k objects are selected for an initial set S. (ii) In the second phase, SWAP, one tries to improve the quality of the clustering by exchanging selected objects with unselected objects. The goal of the algorithm is to minimize the average dissimilarity of objects to their closest selected object. Equivalently, we can minimize the sum of the dissimilarities between object and their closest selected object. 4) CLARANS clustering algorithm vis-a-vis PAM and CLARA: ‘The mediods searching in PAM or CLARA is abstracted as searching k subgraphs from n Points graph, and based on this understanding, a PAM-like clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) is defined. While PAM searches the whole graph and CLARA searches some random sub-graphs, CLARANS randomly samples a set and selects k medoids in climbing sub-graph Mountains, CLARANS selects the neighboring objects of medoids as candidates of new medoids. It samples subsets to verify medoids in multiple times to avoid bad samples, DWM-33 POPULAR PUBLICATIONS Obviously, multiple time sampling of medoids verification is time consuming which limits CLARANS from clustering very large datasets in an acceptable time period, ¢) K-Medoid Algorithm: Partitioning Around Medoids or the K-medoids algorithm is a partitional justerng algorithm which is slightly modified from the K-means algorithm. The basic idea of thi algorithm is to first compute the K representative objects which are called as medoids, After finding the set of medoids, each object of the data set is assignc. *o the nearest medoid. That is, object i is put into cluster v,, when medoid my, is nearer than any other medoid m,. The algorithm proceeds in two steps: BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased. The algorithm is as follows: 1. Initially select k random points as the medoids from the given n data points of the data set. Associate each data point to the closest medoid by using any of the most common distance metrics. For each pair of non-selected object h and selected object i, calculate the total swapping cost TC. If TCj, <0, i is replaced by h Repeat the steps 2-3 until there is no change of the medoids. 2. 3. DWM-34 : DATA WAREHOUSING AND DATA MINING MINING TIME SERIES DATA 1. To optimize data warehouse design, which one is done? [WBUT 2011] a) Normalization of fact tables and demoralization of dimension tables: b) Normalization of fact tables and dimension tables c) Demoralization of fact tables and dimension tables d) Normalization of dimension tables and demoralization of fact tables Answer: (d) 1/8 the value of an attribute is examined as it varies over time. Analysis b) Classification [MODEL QUESTION] 4) Prediction Long Answer Type Questions 4. a) Define decision tree. [WBUT 2009, 2010) b) What are the advantages and disadvantages of the decision tree approach over other approaches for data mining? (WBUT 2003) ¢) Discuss briefly the tree construction principle. [WBUT 2009] OR, Describe the basic algorithm for decision tree induction. {WBUT 2013] d) What is a classification problem? What is the difference between supervised and unsupervised classification? (WBUT 2009) Answer: a) Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is due to the fact that, in contrast to neural networks, . ision trees represent rules. ion tree is a classifier in the form of a tree structure (see Figure 1), where each node is either: © aleaf node- indicates the value of the target attribute (class) of examples, or © adecision node - specifies some test to be carried out ona single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting af the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain (S, A) of an attribute A, relative to a collection of examples S, is defined as . : — {s, Gain(S, A) =Entropy(S)-- ¥ Slenropy(s,) vet [5] Where Values(A) is the set of all possible values for attribute A, and S, is the subset of S$ for which attribute A has value v (i.e., S, = {8 6 S| A(s) = v})- Note the first term in the DWM-3S POPULAR PUBLICATIONS equation for Gain is just the entropy of the original collection S and the second term js the expected value of the entropy after S is partitioned using. attribute A. The expected entropy described by this second term is simply the sum of the entropies of each subset Sv, weighted by the fraction of examples |S,|//S| that belong to S,. Gain (S,A) is therefore the expected reduction in entropy caused by knowing the value of attribute A. Put another way, Gain(S,A) is the information provided about the target attribute value, given the value of some other attribute A. The value of Gain(S,A) is the number of bits saved when, encoding the target value of an arbitrary member of S, by knowing the value of attribute A. A decision tree can be constructed top-down using the information gain in the following way: 1. begin at the root node 2. determine the attribute with the highest information gain which is not already used as an ancestor node 3. add achild node for each possible value of that attribute 4, attach all examples to the child node where the attribute values of the examples are idemtical to the attribute value attached to the node 5. if all examples attached to the child node can be classified uniquely add that classification to that node and mark it as leaf node 6. go back to step two if there are unused attributes left, otherwise add the classification of most of the examples attached to the child node b) Advantages of Decision Tree model * Decision tree methods tend to produce models that are easy to interpret. At each non- terminal node, a decision is based upon just one predictor variable and this makes it €asy to follow. For example, to explain particular classification one need only look at the series of simple decisions that led to it. The final tree model can in-fact be cast into a set of rules one can follow to classify a given case. In comparison, generalized linear models use linear combinations of variables that can be difficult to interpret or explain. 7 Tree models make no assumptions about the distribution of the underlying data and they are thus a non-parametric procedure. This can be especially useful if the distribution of the data is indeed unknown, something that happens a lot in practice. * Decision tree methods are easily able to handle both categorical and’ continuous variables. ¢ Decision tree methods have a built-i feature selection method that makes them immune to the presence of useless variables. Such variables are ignored and they 40 no affect the tree building process. This is a common problem with over-paramctized datasets. ¢ Tree models are very adept at revealing complex interactions between variables. Each branch of a tree can contain different combinations of variables and the same variable can appear more than once in different Parts of the tree. This can reveal how a variable can depend on another and in what specific context this dependency exists- DWM-36 . [AREHOUSING AND DATA MININ( Decision tree models are extremely robust to the effect of outliers. This is so because the models are constructed in a frequency-based technique where one is counting the instances in a split. Outliers that occur in the independent variables do not affect the tree growing process because the values used to split each node are not likely to be on the outlier values. Outliers in the dependent variable go into their own nodes and do not affect the rest of tree. Tree models offer several ways of dealing with missing values that can often minimize or eliminate the effect of such values on model performance. In many other methods the whole set of instances with missing values would be omitted from analysis. This advantage is present in both the when training the tree model or when applying the model on new data for class prediction. Disadvantages of Decision Tree Models: . Decision trees are indeed powerful data mining tools, but they are not perfect and are _ Suited to certain types of problems. The main weaknesses in decision tree modeling are listed below: Classification trees can be unstable and small variations in the data (such as that made by randomization) can cause very different looking trees being generated. This is especially so in cases where the goodness of splits for the different variables are close to each other in value. In this case, a small variation in the data is enough to influence the picking of on split over another. © Many of these problems are dealt with by using stratified sampling when creating testing and training datasets. In this case, the training and testing datasets have the same class distribution. © Such problems are also a symptom of a dataset that is too small. In such cases, it is important to use the entire dataset to train and test the model. This is the idea behind the cross-validation technique. Since the final model is built on the whole dataset, randomization of this same data will not cause tov much instability. ‘Some the tree models generated may be very large and complex. This can make the model difficult to interpret, understand or justify. Tree models really shine when applied to classification problems: unfortunately, they are not very good at estimation tasks(as in regression}, i.e. having numerical dependent variable. Decision tree models are computationally expensive to train €) Most algorithms that have been developed for learning decision trees are variations on | @ core algorithm that employs a top-down, greedy search through the space of possible decision trees. Decision tree programs construct a decision tree T from a set of training cases, Construvting Decision Tree using 1D3 function ID3 . (R: a set of non-target attributes, C: the target attribute, oo S: a training set) returns a decision tree; DWM-37 Input: POPULAR PUBLICATION: begin . ii S is empty, return a single node with value Failure; If S consists of records all with the same value for the target attribute, return a single leaf node with that value; If R is empty, then return a single node with the value of the most frequent of the values of the target attribute that are found in records of S; [in that case there may be be errors, examples that will be improperly classified], Let A be the attribute with largest Gain(A,S) among attributes in R; Let ({aj| j=1,2, .., m) be the values of attribute a; Let (Sj| j=1,2, .., m) be the subsets of S consisting respectively of records with value aj for A; Return a tree with root labeled A and arcs labeled al, a2, .., am going respectively to the trees (ID3(R-{A}, C, $1), ID3(R-(A}, C, 82), sates ,ID3(R-{A}, C, Sm); Recursively apply ID3 to subsets (Sj| until they are empty 12, .., my ond 4) Classification is the process of dividing a dataset into mutually exclusive groups such that the members of ea ich group are as "close" as possible to one another, and different groups are as “far" as possible from one another, where distance is measured with respect to specific variable(s) one is trying to predict. For example, a typical classification Problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad." With supervised classification, we identify examples of the Information classes (i.e. land mage(say). These are called "training sites". The image cover type) of interest in an ‘i Processing software system is then used to develop a statistical characterization of the Ss stage js often called “signature analysis" and reflectance for each information class. Thi tion as simple as the mean or the rage of may involve developing a characterizat reflectance on cach bands, or as complex as detailed analyses of the mean, variances and ‘atistical characterization has been achieved for each covariance over all bands. Once a st information class, the image is then classified by examining the reflectance for each pixt! ignatures it resembles most. and making a decision about which of the s Unsupervised classification is a method which examines a large number of unknown Pixels and divides into a number of classed based on natural groupings present in the fication, unsupervised classification does image values. unlike supervised classi require analyst-specified training data. The basic premise is that values within a give? DWM-38 DATA WAREHOUSING AND DATA MINING in the measurement space (ie. have similar gray asses should be comparatively well separated (i.e. cover type should be close together levels), whereas data in different cl have very different gray levels) 2. a) What is the use of Regression? What May be the reasons for not using the linear regression made! to estimate the output data? [WBUT 2016] Answer: Regression analysis is widely used for prediction and forecasting, whert its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to-explore the forms of these relationships. Linear Regression Disadvantages: ¢ By its nature, linear regression only looks at linear relationships between dependent and independent variables. That is, it assumes there is a straight-line relationship between them. Sometimes this is incorrect. ¢ Linear regression looks at a relationship between the mean of the dependent variable and the independent variables. © Linear Regression is sensitive to Outliers Linear regression assumes that the data are independent. This is often, but not always, sensible. Two common cases where it does t.0t make sense are clustering in space and time. b) How is time series data used in pattern analysis? Give the formula for Pearson's r. . (WBUT 2016] Answer: Time series data have a temporal order that makes analysis distinctly different from other data analysis. The goal of time series analysis can be divided into characterization or Prediction. There is a consistent pattern contaminated with random noise, which typically requires filtering to aid in identifying the underlying pattern. The pattern itself can be divided into the main trend and a seasonal component. Pearson's correlation coefficient when applied to a population is cov(X,Y) Pn ay Oxo, where: ® cov is the covariance © — Gy is the standard deviation of X © — gy is the standard deviation of Y ¢) Explain Bayesian classification. [WBUT 2016] Answer: Bayesian classification technique is based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier ‘assumes that DWM-39 POPULAR PUBLICATIONS of a particular feature in a class is unrelated to the Presence of any other tam For example a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these Properti S independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’, Naive Bayes model is easy to build and particularly useful for very large data seis, Alon, with simplicity, Naive Bayes is known to outperform even highly sophisticateg ification methods. a thestem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below: Likehood Class Prior Probability \ )P(c) _P(xle)P(c Pela) Posterior Probability Ny Predictor Prior Probability P(c|X)= P(x |e) P(x, |e) x...x P(x, |e) x P(e) Above, + Plcix) is the posterior probability of class(c, target) given predictor (x, attributes). * P(c)is the prior probability of class. * P(x(c) is the likelihood which is the probability of predicior given class. * P(x)is the prior probability of predictor. 3. a) What are the uses of training data set and test data set for a decision tree classification scheme? [WBUT 2010, 2012, 2018] b) Discuss the principle of FP- tree Growth algorithm. [WBUT 2010, 2013] R, Discuss the different phases of FP-tree growth algorithm. [WBUT 2012) ol Define FP tree. Discuss the method of computing FP tree. [WBUT 2018] Answer: a) Decision trees need two kinds of data: Training and Testing. . Training data, which are usually the bigger part of data, are used for constructing trees. The more training data collected, the higher the accuracy of the results. The other group of data, testing, is used to get the accuracy rate and misclassification rate of the decision tree. b) FP-Growth algorithm allows frequent itemset discovery without candidate itemset generation. It is a two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. DWM-40 AT, HOUSING Al Step 2: Extracts frequent itemsets directly from the FP-tree ‘Traversal through FP-Tree The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database. 1) One root labeled as “null” with a set of item-prefix subtrees as children, and a frequent-item-header table (presented in the right side of Figure |); 2) Each node in the item-prefix subtree consists of three fields: a) Item-name: registers which item is represented by the node; b) Count: the number of transactions represented by the portion of the path . g the node; c), Node-link: links to the next node in the FP-tree carrying the same item-name. or null if there is none, Each entry in the frequent-item-header table consists of two fields: a) Item-name: as the same to the node; b) Head of node-link: a pointer to the first node in the FP-tree carrying the item- name. Additionally the frequent-item-header table can have the count support for an item. 3 Algorithm 1: FP-tree construction Input: A transaction database DB and a minimum support threshold. Output: FP-tree, the frequent-pattem tree of DB. . Method: The FP-tree is constructed as follows. 1. Scan the transaction database DB once. Collect F, the set of frequent items, and the Support of each frequent item. Sort F in support-descending order as FList, the list of frequent items. 2. Create the root.of an FP-tree, T, and label it as “null”. For each transaction Trans in DB do the following: * Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequent-item list in Trans be [p | P], where p is the first element and P is the remaining list. Call insert tree({ p | P], T ). © The function insert tree({ p | P], T ) is performed as follows. If T has a child N such that N.item-name = p.item-name, then increment N *s count by 1; else create a new node N , with its count initialized to 1. its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link Structure, If P is nonempty, call insert tree (P, N) recursively By using this algorithm, the FP-tree is constructed in two scans of the database, The first Scan collects and sort the set of frequent items; and the second conseructs the FP-Tree. After constructing the FP-Tree it is possible to mine it to find the complete set of frequent Patterns, Algorithm 2: FP-Growth Input: A database DB, represented by FP-tree constructed according to Algorithn 1, and 4 minimum support threshold. Output: The complete set of frequent patterns. DWM-41

You might also like