0% found this document useful (0 votes)
12 views33 pages

DMBI Summer 23

The document discusses different types of data sources for data mining including flat files, relational databases, data warehouses, transactional databases, multimedia databases, spatial databases, time-series databases, the world wide web, time-series data, and streaming data. It provides definitions and examples for each.

Uploaded by

Keep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

DMBI Summer 23

The document discusses different types of data sources for data mining including flat files, relational databases, data warehouses, transactional databases, multimedia databases, spatial databases, time-series databases, the world wide web, time-series data, and streaming data. It provides definitions and examples for each.

Uploaded by

Keep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

GUJARAT TECHNOLOGICAL UNIVERSITY

BE – SEMESTER - VI EXAMINATION – SUMMER 2023

Subject Code : 3163209 Date : 18/07/2023 Subject Name : Data Mining and Business
Intelligence Total Marks : 70

Q:1
Q.1 (a) Define: Data Mart, Enterprise Warehouse & Virtual Warehouse. [03]
⮚ Data Mart :

∙ Definition: A Data Mart is a subset of a data warehouse that is focused on a specific area, department, or
function within an organization. It is designed to serve the needs of a particular group of users by
providing them with tailored data and reports.

∙ Key Characteristics:
o Scope: Limited to specific business areas (e.g., sales, finance, marketing).
o Purpose: Provides targeted data to meet the needs of a specific group of users.
o Size: Smaller in size compared to a full data warehouse.
o Data Sources: Often derived from a larger data warehouse, but can also source data directly from
transactional systems.
∙ Example: A retail company might have separate data marts for sales, inventory, and customer service,
each containing data relevant to its respective department.

⮚ Enterprise Warehouse :

∙ Definition: An Enterprise Warehouse is a comprehensive and centralized repository that aggregates data
from across the entire organization. It provides a unified view of enterprise-wide data, facilitating
comprehensive analysis and reporting.

∙ Key Characteristics:
o Scope: Enterprise-wide, covering all business functions and departments.
o Purpose: Provides a holistic view of the organization’s data to support strategic decision-making.
o Size: Large, encompassing all relevant data across the organization.
o Data Integration: Integrates data from various sources, including internal systems (ERP, CRM,
etc.) and external data.
∙ Example: A multinational corporation might use an enterprise warehouse to consolidate data from its
global operations, providing executives with insights into overall business performance.

⮚ Virtual Warehouse :

∙ Definition: A Virtual Warehouse is a logical data warehouse that integrates data from multiple sources
without physically storing the data in a centralized repository. Instead, it uses a virtual integration layer
to provide a unified view of the data.
1
∙ Key Characteristics:
o Scope: Can cover specific areas (like a data mart) or be enterprise-wide (like an enterprise
warehouse).
o Purpose: Provides a consolidated view of data without the need for physical data storage,
enabling real-time access and analysis.
o Data Integration: Utilizes data virtualization or federation techniques to integrate data from
various sources in real-time.
o Physical Storage: Does not require separate physical storage for the integrated data; instead, it
uses pointers to the original data sources.
∙ Example: An organization might use a virtual warehouse to integrate data from various cloud services
and on-premises databases, allowing users to query and analyze data as if it were stored in a single
repository.

Q.1 (b) A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile


collection of data – Justify. [04]

A data warehouse is defined as a subject-oriented, integrated, time-variant, and nonvolatile collection of data.
Here’s a justification for each of these characteristics:

1. Subject-Oriented

Justification:
∙ Focus on Business Subjects: A data warehouse is organized around the key subjects or areas of an
organization, such as customers, sales, products, or finance, rather than being organized around the
applications or functions.
∙ Purpose-Driven Data: This organization allows for better analysis and decision-making as the data is
structured to provide information on specific business aspects, aiding in strategic planning and
performance analysis.

2. Integrated

Justification:
∙ Data Consolidation: Data in a data warehouse is integrated from multiple, heterogeneous sources
such as operational databases, external sources, and flat files.
∙ Consistent Data Format: Integration involves consolidating data into a consistent format, resolving
issues related to naming conventions, data types, and coding structures. This ensures that data from
different sources is uniform and can be used cohesively.

3. Time-Variant

Justification:
∙ Historical Data Storage: A data warehouse maintains historical data, often spanning several years.
This historical perspective is crucial for trend analysis, forecasting, and time-based comparisons. ∙
Timestamped Data: Data in a data warehouse is associated with time dimensions, such as daily,
weekly, monthly, or yearly time periods. This allows users to track changes and analyze data over
time.

2
4. Nonvolatile
Justification:
∙ Stable Data: Once data is entered into the data warehouse, it is not updated or deleted but is instead
appended with new data. This stability is essential for consistent reporting and analysis. ∙ Read-Only
Operations: Data warehouses are optimized for read operations, such as querying and reporting, rather
than transactional operations. This nonvolatility ensures that historical data remains unchanged,
providing a reliable basis for analysis.

Summary

∙ Subject-Oriented: Data warehouses are designed to focus on specific subjects or business areas,
facilitating relevant data analysis and reporting.
∙ Integrated: They consolidate data from multiple sources into a cohesive and consistent format, ensuring
uniformity and completeness.
∙ Time-Variant: They store historical data with time stamps, allowing for comprehensive trend analysis
and temporal comparisons.
∙ Nonvolatile: Data remains stable over time, enabling reliable and consistent reporting and analysis
without the need for frequent updates or deletions.

These characteristics collectively ensure that a data warehouse serves as a robust foundation for business
intelligence, providing accurate, comprehensive, and reliable data for informed decision-making.

Q.1 (c) What are different sources of information? Explain the term data, information and
knowledge with suitable example. [07]

⮚ Data mining involves extracting useful information from various types of data sources. Here are different
sources of information commonly used in data mining:

1. Flat Files
∙ Definition: Simple text files that contain data without structured relationships.
∙ Usage: Often used for data storage, transfer, and simple datasets.
∙ Example: CSV (Comma-Separated Values) files, log files.

2. Relational Databases
∙ Definition: Databases that use a structured query language (SQL) to manage and query data organized
into tables with rows and columns.
∙ Usage: Commonly used in business applications for structured data storage and retrieval. ∙
Example: MySQL, Oracle Database, Microsoft SQL Server.

3. Data Warehouse
∙ Definition: Centralized repositories that store integrated data from multiple sources, optimized for
querying and analysis.
∙ Usage: Used for business intelligence, reporting, and data analysis.
∙ Example: Amazon Redshift, Google BigQuery, Teradata.

4. Transactional Database
3
∙ Definition: Databases that manage real-time transaction data, ensuring data integrity through ACID
(Atomicity, Consistency, Isolation, Durability) properties.
∙ Usage: Used in applications requiring real-time data processing, such as banking, e-commerce. ∙
Example: PostgreSQL, SQLite, IBM Db2.
5. Multimedia Database
∙ Definition: Databases that store and manage multimedia data such as images, audio, video. ∙
Usage: Used in applications involving digital media, like media libraries, social media platforms. ∙
Example: Oracle Multimedia, IBM DB2 Content Manager.

6. Spatial Database
∙ Definition: Databases optimized to store and query spatial data like maps, geographic information
systems (GIS).
∙ Usage: Used in applications involving location-based services, geographic mapping. ∙
Example: PostGIS (extension of PostgreSQL), Oracle Spatial.

7. Time-Series Database
∙ Definition: Databases designed to handle time-series data, which consists of sequences of data points
indexed by time.
∙ Usage: Used for applications involving time-stamped data like financial markets, sensor data. ∙
Example: InfluxDB, TimescaleDB.

8. World Wide Web (WWW)


∙ Definition: A vast collection of interconnected documents and resources on the internet. ∙
Usage: Used for extracting web data through web scraping, APIs, or web mining.
∙ Example: Data extracted from websites like social media platforms, online news portals.

9. Time-Series Data
∙ Definition: Sequences of data points collected or recorded at specific time intervals. ∙
Usage: Used for analyzing trends, seasonal patterns, and forecasting.
∙ Example: Stock prices, weather data.

10. Streaming Data


∙ Definition: Continuous flow of data generated in real-time by various sources.
∙ Usage: Used for real-time analytics and monitoring.
∙ Example: Data from IoT devices, real-time social media feeds.

11. Relational Data


∙ Definition: Structured data stored in relational databases, organized in tables with relationships between
them.
∙ Usage: Used for complex queries and data analysis using SQL.
∙ Example: Data in customer relationship management (CRM) systems.

12. Cloud Data


∙ Definition: Data stored in cloud-based storage solutions, accessible over the internet. ∙
Usage: Used for scalable and flexible data storage and processing.
∙ Example: Data in services like AWS S3, Google Cloud Storage, Microsoft Azure Blob Storage.

Data :

4
∙ Data are any facts, numbers or text that can be processed by computer.
∙ Organizations are accumulating vast and growing data in different formats.
∙ example:
o Operational data : sales, cost, inventory, payroll
o No operational data : forecast data, macro economic data
o Meta data : data about data itself

Information :
∙ This is what you get when you process and organize data. It provides context and meaning to the raw
data.
∙ The pattern, associations or relationship among all this data can provide information. ∙
examples :
o Analysis of retail point of sales transaction data can yield information, about which products are
selling.

Knowledge :
∙ It's what you get when you analyze information and apply your experience and understanding. ∙
Information can be converted into knowledge about historical patterns and full trends. ∙ example :
o summary info about retail supermarket sales can be converted by retailer to determine which
items are most susceptible for promotional efforts.

----------------------------------------------------------------------------------------------------------------------------------------------------------

5
Q:2

Q.2 (a) Differentiate Fact table vs. Dimension table. [03]


Aspect Fact Table Dimension Table

Definition Contains quantitative data for analysis, Contains descriptive attributes related
often numeric to data in fact table

Purpose Stores measurable data and business metrics Provides context to facts, enabling
data slicing and dicing

Content Numeric values like sales amount, Descriptive information like


quantities, performance metrics product details, dates, customer
information

Granularity High granularity, contains detailed Lower granularity, contains


transactional data summarized descriptive data

Example Entries Sales transactions, order quantities, revenue Product names, customer addresses,
time periods

Data Type Mostly numeric and quantitative Mostly textual and descriptive attributes

Normalization Typically denormalized for query Often more normalized to avoid


performance data redundancy

Storage Size Can be very large due to high volume Generally smaller compared to fact tables
of transactions

Example Fact_Sales (columns: DateKey, Dim_Product (columns:


ProductKey, SalesAmount) ProductKey, ProductName,
Category)

Q.2 (b) Define noise. Explain binning methods for data smoothing. [04]
⮚ Noise in data mining :
∙ Noisy data are data with a large amount of additional meaningless information called noise. ∙ This includes
data corruption, also includes any data that a user system cannot understand and interpret correctly.
∙ Noisy data are data that is corrupted, distorted or has a low Signal-to-Noise ratio.
Data = True Signal + Noise

∙ Noisy data unnecessarily increases the amount of storage space required and can adversly affect any
data mining analysis results.
6
⮚ Data Smoothing : Removing noise from a data set is termed data smoothing.
∙ Following ways can be used for smoothing :
o Binning, Regression, Clustering, Cutlier analysis
⮚ Bining (Bucketing) :
∙ Bining is a technique where we sort the data and then partition the data into equal frequency bins. ∙
Then you may either replace the noisy data with the bin mean, bin median or bin boundary. ∙
Smoothing by bin mean method :
o values in bin are replaced by mean value of bin.

∙ Smoothing by bin median :


o values in bin are replaced by median value.

∙ Smoothing by bin boundary :


o using minimum and maximum values of bin values are taken, the closest boundary value replaces
the values.

∙ There are two methods of dividing data into bins :

1) Equal Frequency Binning:


Bins have an equal frequency.

For example, equal frequency:


Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output: [5, 10, 11, 13] : Bin 1
[15, 35, 50, 55] : Bin 2
[72, 92, 204, 215] : Bin 3

2) Equal Width Binning:


Bins have equal width with a range of each bin are defined as,
[min + w], [min + 2w] …. [min + nw]
where w = (max - min) / (no of bins).

For example, equal Width:


Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
range : w = (215-5) / 3 = 70

Bin 1 range : [5+70] = 75


Bin 2 range : [5+ 2(70)] = 145
Bin 3 range : [5+ 3(70)] = 215

Bin 1 : [5, 10, 11, 13, 15, 35, 50, 55, 72]
Bin 2 : [92]
Bin 3 : [204, 215]

7
Q.2 (c) Explain different OLAP operation with example. [07]
⮚ OLAP Operations :

1) Roll-up :

∙ The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data
cube, by climbing down concept hierarchies, i.e., dimension reduction.
∙ Roll-up is like zooming-out on the data cubes.
∙ Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the
location is defined as the Order Street, city, province, or state, country.
∙ The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to
the level of the country.

2) Drill-down :

∙ The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is like
zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can
be performed by either stepping down a concept hierarchy for a dimension or adding additional
dimensions.

8
∙ Figure shows a drill-down operation performed on the dimension time by stepping down a concept
hierarchy which is defined as day, month, quarter, and year. Drill-down appears by descending the time
hierarchy from the level of the quarter to a more detailed level of the month.
3) Slice :
∙ A slice is a subset of the cubes corresponding to a single value for one or more members of the
dimension. For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice operations
perform a selection on one dimension of the given cube, thus resulting in a subcube.

9
4) Dice :
∙ The dice operation describes a subcube by operating a selection on two or more dimension.
5) Pivot :
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions

10
OR Q.2 (c) What is Data Mining? Explain Data mining as one step of Knowledge Discovery
Process. [07]

⮚ Data Mining :
∙ Data mining is concerned with solving problem by analyzing existing data.
∙ “Extraction of interesting ( nontrivial, implicit, previously unknown and potentially useful) information
or patterns from huge amount of data.”
∙ Data mining is process of identifying valid, novel, potentially useful and ultimately understandable
pattern in data.

⮚ Knowledge discovery from databases (KDD) :

∙ With enormous amount of data stored in file database and other repositories, it is increasingly
important.
∙ It’s not necessary to develop powerful means for analysis and perhaps interpretation of such data and
for extraction of interest in knowledge that could help in decision making.
∙ Actually data mining is a part of KDD.

∙ The knowledge discovery in databases process comprises of a few steps learning from raw data
collection to some form of new knowledge.

∙ The Iterative process consists of following steps :

∙ Data cleaning : phase in which noise data and irrelevant data are removed from the collection. ∙ Data
integration : multiple data source, open heterogeneous may be combined in a common source. ∙ Data
selection : data relevant to analysis is decided on and retrieved from data collection. ∙ Data
transformation : also known as data consolidation, phase in which selected data is transformed into
forms appropriate for the mining procedure.
∙ Data mining : crucial step in which, clever techniques are applied to extract patterns potentially useful
information.
∙ Pattern evaluation : strictly interesting patterns representing knowledge are identified based on given
measures.
∙ Knowledge representation : final phase in which, discover knowledge is visually represented to user. o
This essential step uses visualization techniques to help users understand and interpret the data
mining results.

11
∙ It is common to combine some of these steps together, for instance data cleaning and data integration
can be performed together processing phase to generate a data warehouse.

∙ Data selection and transformation can also combined, where consolidation of data is result of selection, is
for case of data warehouse, selection is done on transformed data.

∙ KDD is an iterative process, once the discovered knowledge is presented to user, the evaluation measure
can be enhanced, mining can be further refined.

----------------------------------------------------------------------------------------------------------------------------------------------------------
Q:3
Q.3 (a) List and describe the methods for handling the missing values in data cleaning. [03]

⮚ Data Cleaning :
o Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated or insufficient data from a dataset.

⮚ Missing or incompleted value :


o Incompleted/missing value can be filled by different value of attribute like such techniques can followed :
1) Ignore the tuple
2) Fill in the missing value by human
3) Use of global constant
4) Imputation

1. Ignore the Tuple


∙ Description: This method involves removing any data records (rows) that contain missing values from the
dataset.
∙ Usage:
o Commonly used when the number of missing values is small compared to the overall dataset.
o Applied when the missing data is considered random and not crucial to the analysis.
∙ Example: If a dataset has 1,000 records and 50 of them have at least one missing value, those 50 records
are removed, leaving 950 complete records for analysis.

2. Fill in the Missing Value by Human


∙ Description: A knowledgeable individual manually examines and fills in missing values based on their
understanding and context of the data.
∙ Usage:
o Used when the dataset is small, and domain expertise is required to make informed decisions
about missing values.
o Applied when the missing values are few and the context is critical.
12
∙ Example: In a customer database, a missing address could be filled in by looking up the customer’s
information in another source or by contacting the customer directly.

3. Use Global Constant


∙ Description: Replace missing values with a global constant that indicates missingness. This constant
could be a specific number or string that doesn’t naturally occur in the dataset.
∙ Usage:
o Suitable for categorical data where missing values can be explicitly flagged.
o Useful when you want to retain all records and explicitly identify missing data.
∙ Example: In a dataset of survey responses, missing values for income might be replaced with a constant
like “-9999” or “Unknown”.
4. Imputation
∙ Imputation involves replacing missing data with substituted values. There are various imputation
techniques:

a. Mean Imputation
o Description: Replace missing values with the mean of the observed values for that variable.
o Usage: Commonly used for continuous numerical data.
b. Median Imputation
o Description: Replace missing values with the median of the observed values for that variable.
o Usage: Useful for skewed distributions or when data contains outliers.
c. Mode Imputation
o Description: Replace missing values with the mode (most frequent value) of the observed values for that
variable.
o Usage: Commonly used for categorical data.
d. Predictive Imputation
o Description: Use statistical models or machine learning algorithms to predict and replace missing values.
o Usage: Suitable for both continuous and categorical data, particularly when relationships between
variables can be leveraged.
∙ Example:
o Mean Imputation: For a dataset with missing ages, calculate the mean age of the available data
and replace missing values with this mean.
o Median Imputation: For a dataset with skewed income data, use the median income to replace
missing values.
o Mode Imputation: For a dataset with missing gender values, replace missing values with the most
frequently occurring gender.
o Predictive Imputation: Use a regression model to predict and fill in missing values based on other
variables.

Q.3 (b) What is market basket analysis? Explain the two measures of rule interestingness:
support and confidence. [04]

⮚ Market basket analysis is a data mining technique used to uncover associations between items in large
datasets, typically transactional data from retail settings. This method helps identify patterns of items
frequently purchased together, enabling businesses to optimize product placements, promotions, and
13
inventory management. The analysis is commonly used to understand customer purchasing behavior and
to drive marketing strategies.

⮚ Two key measures used in market basket analysis to evaluate the "interestingness" of discovered rules
are support and confidence:

1. Support :

o Support measures how frequently the items in the rule appear together in the dataset. It is defined as
the proportion of transactions in the dataset that contain all the items in the rule.
o Mathematically, support for an association rule ��→�� (where �� and �� are itemsets) is
given by:

o Support is a useful measure because it helps identify how common a particular combination of items is
within the entire dataset. Higher support values indicate more frequent combinations, which can be
more significant for business decisions.

2. Confidence :

o Confidence measures the likelihood that the items in the consequent (right-hand side) of the rule are
purchased when the items in the antecedent (left-hand side) are purchased. It is defined as the
proportion of transactions containing the antecedent that also contain the consequent. o
Mathematically, confidence for an association rule ��→�� is given by:

o In other words, confidence is a conditional probability: it tells us how often items in �� appear in
transactions that contain ��. Higher confidence values indicate stronger association between the
antecedent and consequent, suggesting that the occurrence of �� strongly implies the occurrence of
��.

o Example to Illustrate :

o Consider a small dataset of transactions:


{bread, milk, eggs}
{bread, butter}
{milk, eggs}
{bread, milk}
{bread, eggs}
{milk, butter}

o Let's analyze the rule: bread→milk


o Support: The rule bread→milk appears in 2 out of 6 transactions.
Support(bread→milk)=26=0.33

o Confidence: Of the 4 transactions that contain bread, 2 also contain milk.


Confidence(bread→milk)=24=0.5

o This means that 33% of all transactions contain both bread and milk, and if a customer buys bread, there
is a 50% chance they will also buy milk.

14
Q.3 (c) State the Apriori Property. Generate large itemsets and association rules using Apriori
algorithm on the following data set with minimum support value and minimum confidence
value set as 50% and 75% respectively. [07]
⮚ The Apriori property, also known as the downward closure property, is a fundamental principle used in the Apriori algorithm for mining
frequent itemsets and association rules in large databases. The Apriori property states:

"If an itemset is frequent, then all of its subsets must also be frequent."

⮚ In other words, for an itemset to be considered frequent, every subset of that itemset must appear in the database at least as frequently as
the itemset itself. Conversely, if an itemset is infrequent, then all of its supersets (itemsets that contain the original itemset) are also
infrequent.

15
16

OR Q.3 (a) Discuss click-stream analysis using data mining. [03]


⮚ Click-stream analysis involves collecting, processing, and analyzing the sequence of clicks made by users
while navigating a website or using a web application. This process captures data on the user's online
behavior, including the pages they visit, the time spent on each page, and the actions they perform (e.g.,
clicking links, making purchases). Data mining techniques are employed to extract meaningful insights
from this vast amount of raw click-stream data.

Key Components of Click-Stream Analysis

1. Data Collection:
∙ Logs: Web servers generate logs containing details of user interactions.
∙ Cookies: Small pieces of data stored on the user's device to track visits and activities. ∙ Tags:
Embedded scripts (e.g., JavaScript) on web pages to capture user actions in real-time. ∙ Tools:
Analytics platforms like Google Analytics, Adobe Analytics, and specialized software for
click-stream data collection.

2. Data Preprocessing:
∙ Cleaning: Removing irrelevant data, dealing with missing values, and filtering out bot traffic. ∙
Sessionization: Dividing the click-stream into distinct user sessions based on predefined inactivity
thresholds.
∙ Transformation: Converting raw data into a structured format suitable for analysis (e.g., time
spent on pages, sequence of page views).
3. Data Mining Techniques:
∙ Pattern Recognition: Identifying common navigation paths and frequent sequences of actions. ∙
Association Rule Mining: Discovering relationships between pages or actions (e.g., users who
view page A often proceed to page B).
∙ Clustering: Grouping users based on similar browsing behaviors to identify different user
segments.
∙ Classification: Assigning predefined categories to users based on their behavior (e.g., new
visitors, returning customers, high-value customers).
∙ Sequential Pattern Mining: Finding patterns in the order of user actions (e.g., a typical purchase
funnel).

Applications of Click-Stream Analysis

1. User Behavior Analysis:


∙ Understanding user preferences and interests.
∙ Identifying popular content and high-traffic areas of the site.

2. Website Optimization:
∙ Improving site navigation and user interface based on user interaction data.
∙ Identifying and resolving bottlenecks or drop-off points in the user journey.

3. Personalization:
∙ Tailoring content, recommendations, and marketing messages to individual users or user
segments.
17
∙ Enhancing the user experience by presenting relevant information and offers.

4. Conversion Rate Optimization:


∙ Analyzing the paths leading to conversions (e.g., purchases, sign-ups) to optimize the
funnel. ∙ Testing and validating changes (e.g., A/B testing) to improve conversion rates.

5. Fraud Detection:
∙ Monitoring unusual patterns of behavior that may indicate fraudulent activities.
∙ Implementing security measures to protect against fraud.

6. Marketing and Campaign Analysis:


∙ Evaluating the effectiveness of online marketing campaigns.
∙ Tracking user responses to different marketing channels and strategies.

Click-stream analysis using data mining involves collecting and analyzing user navigation data
to gain insights into user behavior, optimize websites, personalize user experiences, and
improve conversion rates. By leveraging data mining techniques, businesses can make
data-driven decisions to enhance their online presence and achieve their objectives.
However, it also requires addressing challenges related to data volume, privacy, and
integration to be effective.

OR Q.3 (b) Explain the Min-max data normalization method with suitable
example. [04]

⮚ Min-max normalization is a technique used in data transformation (Data transformation is


the process of converting data from its original format or structure into a format that is
suitable for analysis).

18
OR Q.3 (c) Explain three-tier Data Warehouse Architecture. [07]
⮚ Three-tier Data Warehouse :

1. Bottom-Tier :
o Bottom tier is a warehouse database server.
o It’s always DBMS relation.
o Handling query and materialization.
o That is data partitioning.
o It is a back-end tools.

2. Middle-Tier :
o It is a relation in OLAP server (ROLAP).
o Extended relational DBMS operation on multidimensional.
o Hybrid OLAP (HOLAP) : Its user flexibility.
o Multidimensional OLAP (MOLAP) : Special purpose.
o OLAP servers is support snowflake.

3. Top-Tier :
o It is a front-end client layer.
o Query reporting.
o Data mining tool.
o Client query.
o Interface between middle tier.

⮚ Advantages of 3-tier Architecture :


∙ Focuses on achieving logical separation.
∙ Performance is very high.
∙ Develop business logical layer.
∙ It can solve more complex problem.

----------------------------------------------------------------------------------------------------------------------------------------------------------

19
Q:4
Q.4 (a) Discuss following terms.
1) Supervised learning 2) Correlation analysis 3) Tree pruning [03] 1)
Supervised Learning

Definition: Supervised learning is a type of machine learning where the model is trained on labeled data. This
means that each training example is paired with an output label. The goal is for the model to learn the mapping
from inputs to outputs so it can accurately predict the labels for new, unseen data.

Key Points:
∙ Training Process: The algorithm learns from a training dataset that includes input-output pairs. ∙
Applications: Commonly used in tasks like classification (e.g., spam detection, image recognition) and
regression (e.g., predicting house prices).
∙ Examples: Linear regression, logistic regression, decision trees, support vector machines, and neural
networks.

2) Correlation Analysis

Definition: Correlation analysis is a statistical method used to assess the strength and direction of the linear
relationship between two numerical variables. It quantifies how changes in one variable are associated with
changes in another variable.

Key Points:
∙ Correlation Coefficient: The most common measure is the Pearson correlation coefficient, which ranges
from -1 to 1. A value close to 1 implies a strong positive correlation, -1 implies a strong negative
correlation, and 0 implies no correlation.
∙ Interpretation: Helps in identifying and quantifying the relationship between variables, but it does not
imply causation.
∙ Applications: Widely used in exploratory data analysis to identify potential relationships between
variables.
3) Tree Pruning

Definition: Tree pruning is a technique used in decision tree algorithms to remove sections of the tree that
provide little to no power in predicting target variables. The goal is to reduce the complexity of the model and
enhance its generalization to new data.

Key Points:
∙ Purpose: Prevents overfitting by trimming nodes that add little predictive value.
∙ Methods:
∙ Pre-pruning (early stopping): Stops the growth of the tree early based on certain criteria (e.g.,
maximum tree depth, minimum number of samples required to split a node).
∙ Post-pruning: Removes branches from a fully grown tree that have little importance. This can be
done using techniques such as cost complexity pruning or reduced error pruning.
∙ Applications: Used to improve the performance and interpretability of decision tree models in
classification and regression tasks.

These concepts are fundamental in the field of data mining and machine learning, each playing a crucial role in
building, analyzing, and optimizing models.
20
Q.4 (b) Differentiate Association vs. Classification. [04]
Feature Association Classification

Definition Discovers relationships or Assigns items to predefined categories


associations between variables in or classes based on input features.
large datasets.

Goal To find frequent patterns, correlations, To predict the category or class label of
or associations among a set of items or new observations based on training
variables. data.

Output Association rules (e.g., "If A, then B"). A model that can assign class labels
to new data points.

Type of Learning Typically unsupervised learning. Supervised learning.

Data Requirement Requires transactional or categorical Requires labeled data with


data to find co-occurrences. predefined classes.

Data Often works with large datasets Can work with various types of data
Characteristics with categorical variables. including numerical, categorical, and
text data.

Interpretability Typically provides easily Model interpretability varies (e.g.,


interpretable rules. decision trees are interpretable,
but neural networks are less so).

Example Techniques Apriori, Eclat, FP-Growth Decision Trees, Naive Bayes,


k-Nearest Neighbors, Support
Vector Machines, Neural Networks

Application Areas Market basket analysis, Spam detection, image


recommendation systems, event recognition, medical diagnosis,
correlation. credit scoring.

21
Q.4 (c) Explain Baye’s Theorm and calculate Naïve Bayesian Classification for given example:
[07]

OR Q.4 (a) Draw the topology of a multilayer, feed-forward Neural Network. [03]

[RED FONT : Just for information]

⮚ Multilayer Feed-Forward Neural Network(MFFNN) is an interconnected Artificial Neural Network with


multiple layers that has neurons with weights associated with them and they compute the result using
activation functions.
⮚ It is one of the types of Neural Networks in which the flow of the network is from input to output units
and it does not have any loops, no feedback, and no signal moves in backward directions that is from
output to hidden and input layer.
⮚ This Neural Network or Artificial Neural Network has multiple hidden layers that make it a multilayer
neural Network and it is feed-forward because it is a network that follows a top-down approach to train
the network. In this network there are the following layers:
o Input Layer: It is starting layer of the network that has a weight associated with the signals. o
Hidden Layer: This layer lies after the input layer and contains multiple neurons that perform all
computations and pass the result to the output unit.
o Output Layer: It is a layer that contains output units or neurons and receives processed data from
the hidden layer, if there are further hidden layers connected to it then it passes the weighted
unit to the connected hidden layer for further processing to get the desired result.
22
⮚ Topology of a Multilayer Feed-Forward Neural Network :

OR Q.4 (b) Explain data mining application for fraud detection. [04]
Data mining is highly effective in detecting and preventing fraud by analyzing large datasets to uncover patterns,
anomalies, and relationships indicative of fraudulent activities. Here's an explanation of how data mining is
applied in fraud detection:

Steps Involved in Fraud Detection Using Data Mining


1. Data Collection:
∙ Sources: Transaction logs, financial records, customer profiles, network logs, and more. ∙
Integration: Combining data from multiple sources to create a comprehensive dataset for
analysis.
2. Data Preparation:
∙ Cleaning: Removing noise, duplicates, and correcting inconsistencies in the data.
∙ Transformation: Converting data into a suitable format, such as normalizing values or encoding
categorical variables.
∙ Feature Engineering: Creating new features or variables that can help in detecting fraud, such as
transaction frequency, amount variations, or geographic patterns.
3. Exploratory Data Analysis:
∙ Visualization: Using charts and graphs to identify patterns and outliers in the data. ∙ Statistical
Analysis: Applying statistical methods to understand the distribution and relationships within the
data.
4. Model Building:
∙ Supervised Learning: Using labeled data where fraudulent and non-fraudulent instances are
known to train classification models like logistic regression, decision trees, support vector
machines, or neural networks.
23
∙ Unsupervised Learning: Applying clustering or anomaly detection techniques to identify unusual
patterns without prior labels. Techniques include k-means clustering, DBSCAN, or isolation
forests.
5. Model Evaluation:
∙ Metrics: Evaluating models using metrics like accuracy, precision, recall, F1 score, and Area Under
the ROC Curve (AUC-ROC) to ensure they effectively distinguish between fraudulent and
legitimate activities.
∙ Cross-Validation: Using methods like k-fold cross-validation to ensure the model's robustness and
generalizability.
6. Deployment and Monitoring:
∙ Real-Time Detection: Implementing models to analyze transactions in real-time and flag
suspicious activities.
∙ Continuous Learning: Continuously updating models with new data to adapt to evolving fraud
patterns.

Applications in Various Industries :

o Banking and Finance: Credit Card Fraud, Insurance Fraud


o E-commerce: Transaction Fraud, Account Takeover
o Telecommunications: Subscription Fraud, Call Routing Fraud
o Healthcare: Billing Fraud, Prescription Fraud

Benefits of Data Mining in Fraud Detection

o Efficiency: Automates the detection process, allowing for real-time analysis and quicker response to
fraudulent activities.
o Accuracy: Provides high accuracy in identifying fraud by leveraging complex algorithms and large
datasets.
o Scalability: Can handle vast amounts of data and adapt to new fraud patterns as they emerge. o
Cost-Effectiveness: Reduces the costs associated with fraud by preventing losses and minimizing the
need for extensive manual investigations.

24
OR Q.4 (c) Define linear and nonlinear regression using figures. Calculate the value of Y for
X=100 based on Linear regression prediction method. [07]

⮚ https://youtu.be/zUQr6HAAKp4 (Similar one)

----------------------------------------------------------------------------------------------------------------------------------------------------------

Q:5
Q.5 (a) Describe web mining using example. [03]
Web mining refers to the application of data mining techniques to extract valuable insights and knowledge from
web data, including web content, web structure, and web usage data. It can be broadly categorized into three
main areas: web content mining, web structure mining, and web usage mining.

Types of Web Mining

1. Web Content Mining:

∙ Definition: Involves extracting useful information from the content of web pages, which includes
text, images, videos, and audio.

∙ Example: A search engine like Google uses web content mining to index web pages and retrieve
relevant pages based on user queries. For instance, when a user searches for "best Italian
restaurants in New York," the search engine mines the web content to return a list of relevant
restaurants with descriptions, reviews, and ratings.

2. Web Structure Mining:

25
∙ Definition: Analyzes the structure of the web, which involves understanding the connections and
relationships between different web pages, often represented as a graph.

∙ Example: PageRank, the algorithm used by Google, is a prime example of web structure mining. It
evaluates the importance of web pages based on the number and quality of links pointing to
them. For instance, a web page that is linked by many high-authority pages is deemed more
important and is ranked higher in search results.

3. Web Usage Mining:

∙ Definition: Focuses on analyzing user behavior through the data generated by web interactions,
such as server logs, cookies, and clickstream data.

∙ Example: An e-commerce site like Amazon uses web usage mining to analyze customer behavior. By
tracking the sequence of pages a user visits, items they click on, and their purchase history,
Amazon can recommend products tailored to individual user preferences. For instance, if a user
frequently views and buys electronic gadgets, the site will suggest new gadgets that might
interest them.

Q.5 (b) What is Big Data? What is big data analytic? [04]

⮚ Big Data refers to extremely large and complex datasets that cannot be easily managed, processed, or
analyzed using traditional data processing tools. These datasets are characterized by the following five
V's:

1. Volume: The sheer amount of data generated, often measured in terabytes, petabytes, or even exabytes.
The scale of data is enormous, coming from various sources such as social media, sensors, transactions,
and more.
2. Velocity: The speed at which data is generated, collected, and processed. Big Data often involves real
time or near-real-time data streams, requiring rapid processing to derive timely insights.
3. Variety: The diverse types of data, including structured data (e.g., databases), semi-structured data (e.g.,
XML, JSON files), and unstructured data (e.g., text, images, videos).

4. Veracity: The accuracy and trustworthiness of the data.

5. Value: The potential insights and benefits that can be derived from analyzing the data.

⮚ What is Big Data Analytics in Data Mining?

Big Data Analytics in data mining refers to the process of examining large and varied datasets to uncover hidden
patterns, correlations, trends, and other useful information that can aid in decision-making and strategic
planning. Big Data Analytics leverages advanced data mining techniques, algorithms, and tools specifically
designed to handle the volume, velocity, and variety of Big Data.

Key Components of Big Data Analytics

1. Data Collection and Storage:


∙ Sources: Data is collected from various sources like social media platforms, IoT devices,
transaction logs, sensors, and more.
26
∙ Storage Solutions: Due to the massive size of Big Data, traditional databases are often
insufficient. Distributed storage systems like Hadoop Distributed File System (HDFS) and cloud
storage solutions are commonly used.

2. Data Processing:
∙ Batch Processing: Techniques like MapReduce and tools like Apache Hadoop are used to process
large batches of data.
∙ Stream Processing: Tools like Apache Kafka, Apache Flink, and Apache Storm handle real-time
data streams, allowing for immediate analysis and response.

3. Data Analysis:
∙ Descriptive Analytics: Summarizing historical data to understand what has happened in the past. ∙
Predictive Analytics: Using statistical models and machine learning algorithms to predict future
trends and behaviors.
∙ Prescriptive Analytics: Recommending actions based on the analysis to achieve desired
outcomes.

4. Data Visualization:
∙ Tools and Techniques: Visualizing the results of Big Data Analytics using tools like Tableau, Power BI,
and D3.js to make the insights understandable and actionable.

Applications of Big Data Analytics

1. Healthcare:
∙ Predictive Modeling: Predicting disease outbreaks and patient outcomes based on historical
health data.
∙ Personalized Medicine: Tailoring treatment plans based on individual genetic information and
medical history.

2. Retail:
∙ Customer Insights: Analyzing customer behavior to personalize shopping experiences and
improve customer satisfaction.
∙ Supply Chain Optimization: Improving inventory management and logistics through predictive
analytics.

3. Finance:
∙ Fraud Detection: Identifying fraudulent transactions in real-time using anomaly detection
techniques.
∙ Risk Management: Assessing and mitigating financial risks through predictive models.

4. Manufacturing:
∙ Predictive Maintenance: Predicting equipment failures before they occur to minimize downtime
and maintenance costs.
∙ Quality Control: Analyzing production data to improve product quality and reduce defects.

Example of Big Data Analytics Process

1. Data Collection: A retail company collects data from various sources such as transaction logs, customer
feedback, social media interactions, and IoT devices in stores.
2. Data Storage: The collected data is stored in a distributed storage system like HDFS.
27
3. Data Processing: Using Hadoop MapReduce, the company processes large batches of
transaction data to identify purchasing patterns.
4. Data Analysis: Applying machine learning algorithms, the company predicts customer
buying behaviors and identifies potential new product recommendations.
5. Data Visualization: The insights are visualized using tools like Tableau to create
dashboards that display sales trends, customer segments, and product performance.

Q.5 (c) Define the term “Information Gain”. Explain the steps of the ID3
Algorithm for generating Decision Tree. [07]

28
29
OR Q.5 (a) Define: 1) Data Node 2) Name Node 3) Text mining [03]
1) Data Node
Definition: In the context of Hadoop, a Data Node is a component of the Hadoop Distributed File System (HDFS)
responsible for storing the actual data. Each Data Node manages the storage attached to it and performs read
and write operations upon request from the Name Node.
Key Points:
∙ Function: Stores blocks of data and serves read/write requests from clients.
∙ Communication: Regularly communicates with the Name Node to report the status of data blocks it
stores.
∙ Fault Tolerance: Replicates data blocks to ensure fault tolerance and high availability.

2) Name Node
Definition: The Name Node is a critical component of HDFS that manages the metadata of the file system. It
maintains the directory tree of all files and tracks the locations of the data blocks across the Data Nodes.
Key Points:
∙ Function: Keeps track of the structure of the file system and the mapping of data blocks to Data Nodes. ∙
Responsibility: Handles namespace operations like opening, closing, renaming files and directories, and
determining the mapping of blocks to Data Nodes.
∙ High Availability: Can be configured for high availability with a secondary Name Node or a standby Name
Node to avoid a single point of failure.

30
3) Text Mining
Definition: Text mining, also known as text data mining or text analytics, refers to the process of deriving
meaningful information and patterns from unstructured text data. It involves techniques from natural language
processing (NLP), machine learning, and information retrieval.
Key Points:
∙ Goal: Extract useful insights, identify patterns, and transform text into structured data for further
analysis.
∙ Applications: Sentiment analysis, topic modeling, document classification, information extraction, and
summarization.
∙ Techniques: Includes processes like tokenization, stemming, lemmatization, named entity recognition,
and sentiment analysis.
∙ Tools and Libraries: Common tools and libraries for text mining include NLTK, SpaCy, Apache OpenNLP,
and commercial software like IBM Watson and Google Cloud Natural Language API.
These definitions provide a high-level overview of essential components in data processing frameworks like
Hadoop (Data Node and Name Node) and a key concept in data analysis (Text Mining).

OR Q.5 (b) Explain partitioning and hierarchical methods of clustering. [04]


⮚ Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can
be defined as "A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group."

⮚ Partitioning Clustering :

∙ It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means Clustering
algorithm.
∙ In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.

31
Key Characteristics:
∙ Flat Structure: The result is a flat set of clusters, without any inherent hierarchy. ∙ Predefined Number
of Clusters: The number of clusters �� is often specified in advance. ∙ Optimization Objective:
Commonly aims to minimize within-cluster variance or maximize between cluster variance.

⮚ Hierarchical Clustering :

∙ Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The observations
or any number of clusters can be selected by cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical algorithm.

Key Characteristics:

∙ Tree Structure: Produces a hierarchical tree of clusters.


∙ No Predefined Number of Clusters: The number of clusters can be determined by cutting the
dendrogram at the desired level.
∙ Distance Metrics: Utilizes various distance metrics to measure the similarity between clusters.

OR Q.5 (c) What is distributed file system? Explain HDFS architecture in detail. [07]

⮚ A distributed file system (DFS) is a type of file system that manages files and directories spread across
multiple physical locations (typically on different servers or nodes) and makes them accessible as if they
were on a single local file system. The main purpose of a DFS is to provide a way to store, access, and
manage data efficiently across a network, allowing multiple users and applications to access and share
files concurrently.

⮚ HDFS :
o NameNode(Master)
o DataNode(Slave)

32
∙ NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
∙ Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
∙ DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.

⮚ High Level Architecture Of Hadoop :

∙ File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it manually.

∙ Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the last
one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system, the
size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop
file system. As we all know Hadoop is mainly configured for storing the large size data which is in
33
petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.

∙ Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
∙ By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
∙ This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is
why we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this
is known as fault tolerance.
∙ Now one thing we also need to notice that after making so many replica’s of our file blocks we are
wasting so much of our storage but for the big brand organization the data is very much important
than the storage so nobody cares for this extra storage. You can configure the Replication factor in
your hdfs-site.xml file.
∙ Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster
(maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks
information Namenode chooses the closest Datanode to achieve the maximum performance while
performing the read/write information which reduces the Network Traffic.

⮚ HDFS Architecture :

----------------------------------------------------------------------------------------------------------------------------------------------------------

34
35

You might also like