dwm 2
dwm 2
---------------------------------------------------------------------------------------------------------------------------------------------------
Module 1 - Data Warehousing Fundamentals.
Q1 Compare OLTP And OLAP.
Ans.
OLAP OLTP
1 OLAP Is Stands For Online Analytical Processing. 1 OLTP Is Stands For Online Transaction Processing.
2 It Is Well Known As Online Database Query 2 It Is Well Known As Online Database Modifying
Management System. System.
3 It Makes Use Of A Data Warehouse. 3 It Makes Use Of A Standard Database Management
System.
4 In OLAP Database, Tables Are Not Normalized. 4 In An OLTP Database, Tables Are Normalized.
5 A Large Amount Of Data Is Stored Typically In TB, 5 The Size Of The Data Is Relatively Small As The
PB. Historical Data Is Archived In MB And GB.
6 It Only Needs Backup From Time To Time As 6 The Backup And Recovery Process Is Maintained
Compared To OLTP. Rigorously.
7 This Data Is Generally Managed By CEO, MD, And 7 This Data Is Managed By Clerksforex And
GM. Managers.
8 This Process Is Focused On The Customer. 8 The Process Is Focused On The Market.
9 Design With A Focus On The Subject. 9 Design That Is Focused On The Application.
10 Improves The Efficiency Of Business Analytics. 10 Enhance The Users Productivity.
Module 3 – Classification.
Q1 Explain Decision Tree Based Classification Approach With Example.
Ans.
Decision tree-based classification is a popular machine learning approach used for both predictive modeling and decision
support. The decision tree is a tree-like model where each node represents a decision or a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents the class label or the decision.
Decision Tree-Based Classification Process:
1. Data Collection:
• Collect a dataset with labeled examples. Each example consists of a set of attributes and the corresponding class
label.
2. Data Preprocessing:
• Preprocess the data by handling missing values, encoding categorical variables, and splitting the dataset into
training and testing sets.
3. Decision Tree Construction:
• Use a decision tree algorithm (e.g., ID3, C4.5, CART) to construct the tree. The algorithm selects the best
attribute at each node based on criteria such as information gain or Gini impurity.
4. Decision Tree Training:
• Train the decision tree on the training dataset. The tree is recursively grown by making decisions at each node,
splitting the data based on the selected attribute.
5. Decision Making (Classification):
• Once the decision tree is trained, it can be used to classify new, unseen instances. Starting from the root node,
each instance traverses the tree based on the attribute tests until it reaches a leaf node, which corresponds to the
predicted class label.
Example:
Let's consider a simple example of classifying whether a person will play golf based on weather conditions. The dataset
includes the following attributes: Outlook, Temperature, Humidity, and Wind.
Dataset:
Outlook Temperature Humidity Wind Play Golf
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rainy Mild High Weak Yes
Rainy Cool Normal Weak Yes
Rainy Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rainy Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rainy Mild High Strong No
Decision Tree:
Outlook
/ | \
Sunny Overcast Rainy
/ \ / / \
Humidity Wind Humidity
/ \ /\ / \
High Normal Weak Strong
/ | | | |
No Yes Yes Yes No
Module 4 – Clustering.
Q1 Explain K-Means And K-Medoids Algorithm.
Ans.
K-Means Algorithm:
K-Means is a clustering algorithm that partitions a dataset into K clusters, where each data point belongs to the cluster
with the nearest mean (centroid). The algorithm aims to minimize the sum of squared distances between data points and
their assigned cluster centroids.
Steps Of The K-Means Algorithm:
1. Initialization:
• Randomly select K initial centroids, one for each cluster.
2. Assignment:
• Assign each data point to the cluster whose centroid is the closest (usually using Euclidean distance).
3. Update Centroids:
• Recalculate the centroids as the mean of all data points in each cluster.
4. Repeat:
• Repeat steps 2 and 3 until convergence (when centroids no longer change significantly) or a specified number of
iterations is reached.
5. Output:
• The final clusters and their centroids.
K-Medoids Algorithm:
K-Medoids is a variation of K-Means that, instead of using the mean as the centroid, uses the actual data point from the
cluster that minimizes the sum of distances to other points in the cluster. This makes K-Medoids more robust to outliers,
as the medoid is less sensitive to extreme values.
Steps Of The K-Medoids Algorithm:
1. Initialization:
• Randomly select K initial data points as medoids.
2. Assignment:
• For each data point, assign it to the cluster represented by the closest medoid (using a distance metric such as
Euclidean distance).
3. Update Medoids:
• For each cluster, select the data point that minimizes the sum of distances to other points in the cluster as the new
medoid.
4. Repeat:
• Repeat steps 2 and 3 until convergence or a specified number of iterations is reached.
5. Output:
• The final clusters and their medoids.
Image Segmentation, Customer Segmentation, Social Market Segmentation, Anomaly Detection, Biological
Network Analysis, Document Clustering, Genetics, Classification, Natural Language Processing, Etc.
Genomics, Etc., And Many More.
Module 5 – Mining Frequent Patterns And Association.
Q1 Explain Apriori Algorithm And Steps Of Apriori Algorithm.
Ans.
The Apriori algorithm is a popular algorithm for mining frequent itemsets and generating association rules from
transactional databases. It was proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1994. The Apriori algorithm
works based on the "apriori property," which states that if an itemset is frequent, then all of its subsets must also be
frequent. The algorithm uses this property to efficiently discover frequent itemsets.
Steps of the Apriori Algorithm:
1. Initialize:
• Create a table to store the support count of each itemset.
• Scan the transaction database to count the support of each individual item.
2. Generate Frequent 1-Itemsets:
• Identify frequent 1-itemsets by filtering out items with support below a predefined threshold (minimum support).
3. Generate Candidate 2-Itemsets:
• Create candidate 2-itemsets by combining frequent 1-itemsets. For each pair of frequent 1-itemsets {A} and {B},
generate {A, B} if the first (k-1) items of A are equal to the first (k-1) items of B.
4. Scan Database for Support Count:
• Scan the transaction database to count the support of each candidate 2-itemset.
• Prune candidate 2-itemsets that do not meet the minimum support threshold.
5. Generate Candidate k-Itemsets:
• Create candidate k-itemsets by joining frequent (k-1)-itemsets. For each pair of frequent (k-1)-itemsets {A} and
{B}, generate {A, B} if the first (k-2) items of A are equal to the first (k-2) items of B.
6. Scan Database for Support Count (Repeat):
• Scan the transaction database to count the support of each candidate k-itemset.
• Prune candidate k-itemsets that do not meet the minimum support threshold.
7. Repeat Until No More Frequent Itemsets:
• Repeat steps 5 and 6 to generate candidate k-itemsets and scan the database until no more frequent itemsets can be
found.
8. Generate Association Rules:
• Use the frequent itemsets to generate association rules that meet a predefined confidence threshold.
• An association rule has the form A -> B, where A and B are itemsets, and the rule's confidence is the ratio of the
support of {A, B} to the support of {A}.
What is KDD?
1. Data Cleaning:
○ Removes incorrect or inconsistent data.
2. Data Integration:
○ Combines data from multiple sources.
3. Data Selection:
○ Picks only the data needed for analysis from the database.
4. Data Transformation:
○ Changes data into a format suitable for analysis (e.g., summarizing
data).
5. Data Mining:
○ Uses advanced techniques to find patterns in the data.
6. Pattern Evaluation:
○ Checks the patterns to see if they are meaningful or interesting.
7. Knowledge Presentation:
○ Presents the discovered knowledge using visualizations or reports.
Advantages of KDD
Disadvantages of KDD
Clustering:
1. Choose K centers for the clusters. These centers should be far apart.
2. Group the data into clusters based on which center is closest to each data
point.
3. After grouping, calculate new centers for the clusters based on the data
within them.
4. Repeat the process, grouping data based on the updated centers.
5. Continue until the cluster centers no longer change and no data moves
between clusters.
For example:
1. Antecedent (IF): The first item or group of items bought (e.g., a domain).
2. Consequent (THEN): The related item or group often bought with it (e.g.,
plugins or extensions).
Dimensional Modeling:
Web mining applies data mining techniques to extract knowledge from web data,
like web pages, hyperlinks, and website usage logs. It helps discover useful
patterns from large datasets. Web mining processes involve:
● Collecting data
● Preprocessing data
● Discovering knowledge
● Analyzing patterns
The internet's importance makes web mining a valuable research area. It focuses
on extracting knowledge from web data, using at least one of the following:
● Web content
● Structure
● Usage (web logs)
This method extracts meaningful data, information, or knowledge from web page
content. It scans and analyzes texts, images, and groups of web pages.
Key Points:
1. Agent-Based Approach:
Involves intelligent systems to find and filter information:
○ Intelligent Search Agents: Use user profiles and domain
characteristics to find relevant information.
○ Information Filtering/Categorization: Retrieve, filter, and
categorize documents using retrieval techniques.
○ Personalized Web Agents: Learn user preferences and find web
information based on similar users.
2. Data-Based Approach:
This organizes semi-structured web data into structured formats, enabling
easier analysis using database queries and mining tools.
Web structure mining analyzes the organization and relationships between web
pages through hyperlinks. It uses graph theory to understand the structure of
websites.
1. Hyperlink Patterns:
Analyzes connections between pages, with hyperlinks serving as bridges
between locations.
2. Document Structure:
Examines the tree-like structure of web pages using HTML or XML tags.
Key Terms:
1. Link-Based Classification:
Predicts the category of a web page using its content, links, and HTML
tags.
2. Link-Based Clustering:
Groups similar web pages together and separates dissimilar ones.
3. Link Type Prediction:
Determines the type or purpose of a link between two pages.
4. Link Strength and Cardinality:
Measures the importance or number of links between pages.
Q6)Write shot note on web usage mining,also state its any two
applications(5m)
Web Usage Mining focuses on analyzing user behavior while interacting with the
web. It discovers patterns in user navigation through data like web logs (records
of web activity). These patterns are used to improve personalization, website
design, system performance, business intelligence, and more.
The main data available from users is their path through web pages (the
sequence of pages they visit). While most tools analyze text, they often ignore
valuable link information.
Web usage mining uses four main techniques to analyze user navigation
patterns:
2. Sequential Patterns
3. Clustering
4. Classification Mining
Application