DMBI Summer 23
DMBI Summer 23
Subject Code : 3163209 Date : 18/07/2023 Subject Name : Data Mining and Business
Intelligence Total Marks : 70
Q:1
Q.1 (a) Define: Data Mart, Enterprise Warehouse & Virtual Warehouse. [03]
⮚ Data Mart :
∙ Definition: A Data Mart is a subset of a data warehouse that is focused on a specific area, department, or
function within an organization. It is designed to serve the needs of a particular group of users by
providing them with tailored data and reports.
∙ Key Characteristics:
o Scope: Limited to specific business areas (e.g., sales, finance, marketing).
o Purpose: Provides targeted data to meet the needs of a specific group of users.
o Size: Smaller in size compared to a full data warehouse.
o Data Sources: Often derived from a larger data warehouse, but can also source data directly from
transactional systems.
∙ Example: A retail company might have separate data marts for sales, inventory, and customer service,
each containing data relevant to its respective department.
⮚ Enterprise Warehouse :
∙ Definition: An Enterprise Warehouse is a comprehensive and centralized repository that aggregates data
from across the entire organization. It provides a unified view of enterprise-wide data, facilitating
comprehensive analysis and reporting.
∙ Key Characteristics:
o Scope: Enterprise-wide, covering all business functions and departments.
o Purpose: Provides a holistic view of the organization’s data to support strategic decision-making.
o Size: Large, encompassing all relevant data across the organization.
o Data Integration: Integrates data from various sources, including internal systems (ERP, CRM,
etc.) and external data.
∙ Example: A multinational corporation might use an enterprise warehouse to consolidate data from its
global operations, providing executives with insights into overall business performance.
⮚ Virtual Warehouse :
∙ Definition: A Virtual Warehouse is a logical data warehouse that integrates data from multiple sources
without physically storing the data in a centralized repository. Instead, it uses a virtual integration layer
to provide a unified view of the data.
1
∙ Key Characteristics:
o Scope: Can cover specific areas (like a data mart) or be enterprise-wide (like an enterprise
warehouse).
o Purpose: Provides a consolidated view of data without the need for physical data storage,
enabling real-time access and analysis.
o Data Integration: Utilizes data virtualization or federation techniques to integrate data from
various sources in real-time.
o Physical Storage: Does not require separate physical storage for the integrated data; instead, it
uses pointers to the original data sources.
∙ Example: An organization might use a virtual warehouse to integrate data from various cloud services
and on-premises databases, allowing users to query and analyze data as if it were stored in a single
repository.
A data warehouse is defined as a subject-oriented, integrated, time-variant, and nonvolatile collection of data.
Here’s a justification for each of these characteristics:
1. Subject-Oriented
Justification:
∙ Focus on Business Subjects: A data warehouse is organized around the key subjects or areas of an
organization, such as customers, sales, products, or finance, rather than being organized around the
applications or functions.
∙ Purpose-Driven Data: This organization allows for better analysis and decision-making as the data is
structured to provide information on specific business aspects, aiding in strategic planning and
performance analysis.
2. Integrated
Justification:
∙ Data Consolidation: Data in a data warehouse is integrated from multiple, heterogeneous sources
such as operational databases, external sources, and flat files.
∙ Consistent Data Format: Integration involves consolidating data into a consistent format, resolving
issues related to naming conventions, data types, and coding structures. This ensures that data from
different sources is uniform and can be used cohesively.
3. Time-Variant
Justification:
∙ Historical Data Storage: A data warehouse maintains historical data, often spanning several years.
This historical perspective is crucial for trend analysis, forecasting, and time-based comparisons. ∙
Timestamped Data: Data in a data warehouse is associated with time dimensions, such as daily,
weekly, monthly, or yearly time periods. This allows users to track changes and analyze data over
time.
2
4. Nonvolatile
Justification:
∙ Stable Data: Once data is entered into the data warehouse, it is not updated or deleted but is instead
appended with new data. This stability is essential for consistent reporting and analysis. ∙ Read-Only
Operations: Data warehouses are optimized for read operations, such as querying and reporting, rather
than transactional operations. This nonvolatility ensures that historical data remains unchanged,
providing a reliable basis for analysis.
Summary
∙ Subject-Oriented: Data warehouses are designed to focus on specific subjects or business areas,
facilitating relevant data analysis and reporting.
∙ Integrated: They consolidate data from multiple sources into a cohesive and consistent format, ensuring
uniformity and completeness.
∙ Time-Variant: They store historical data with time stamps, allowing for comprehensive trend analysis
and temporal comparisons.
∙ Nonvolatile: Data remains stable over time, enabling reliable and consistent reporting and analysis
without the need for frequent updates or deletions.
These characteristics collectively ensure that a data warehouse serves as a robust foundation for business
intelligence, providing accurate, comprehensive, and reliable data for informed decision-making.
Q.1 (c) What are different sources of information? Explain the term data, information and
knowledge with suitable example. [07]
⮚ Data mining involves extracting useful information from various types of data sources. Here are different
sources of information commonly used in data mining:
1. Flat Files
∙ Definition: Simple text files that contain data without structured relationships.
∙ Usage: Often used for data storage, transfer, and simple datasets.
∙ Example: CSV (Comma-Separated Values) files, log files.
2. Relational Databases
∙ Definition: Databases that use a structured query language (SQL) to manage and query data organized
into tables with rows and columns.
∙ Usage: Commonly used in business applications for structured data storage and retrieval. ∙
Example: MySQL, Oracle Database, Microsoft SQL Server.
3. Data Warehouse
∙ Definition: Centralized repositories that store integrated data from multiple sources, optimized for
querying and analysis.
∙ Usage: Used for business intelligence, reporting, and data analysis.
∙ Example: Amazon Redshift, Google BigQuery, Teradata.
4. Transactional Database
3
∙ Definition: Databases that manage real-time transaction data, ensuring data integrity through ACID
(Atomicity, Consistency, Isolation, Durability) properties.
∙ Usage: Used in applications requiring real-time data processing, such as banking, e-commerce. ∙
Example: PostgreSQL, SQLite, IBM Db2.
5. Multimedia Database
∙ Definition: Databases that store and manage multimedia data such as images, audio, video. ∙
Usage: Used in applications involving digital media, like media libraries, social media platforms. ∙
Example: Oracle Multimedia, IBM DB2 Content Manager.
6. Spatial Database
∙ Definition: Databases optimized to store and query spatial data like maps, geographic information
systems (GIS).
∙ Usage: Used in applications involving location-based services, geographic mapping. ∙
Example: PostGIS (extension of PostgreSQL), Oracle Spatial.
7. Time-Series Database
∙ Definition: Databases designed to handle time-series data, which consists of sequences of data points
indexed by time.
∙ Usage: Used for applications involving time-stamped data like financial markets, sensor data. ∙
Example: InfluxDB, TimescaleDB.
9. Time-Series Data
∙ Definition: Sequences of data points collected or recorded at specific time intervals. ∙
Usage: Used for analyzing trends, seasonal patterns, and forecasting.
∙ Example: Stock prices, weather data.
Data :
4
∙ Data are any facts, numbers or text that can be processed by computer.
∙ Organizations are accumulating vast and growing data in different formats.
∙ example:
o Operational data : sales, cost, inventory, payroll
o No operational data : forecast data, macro economic data
o Meta data : data about data itself
Information :
∙ This is what you get when you process and organize data. It provides context and meaning to the raw
data.
∙ The pattern, associations or relationship among all this data can provide information. ∙
examples :
o Analysis of retail point of sales transaction data can yield information, about which products are
selling.
Knowledge :
∙ It's what you get when you analyze information and apply your experience and understanding. ∙
Information can be converted into knowledge about historical patterns and full trends. ∙ example :
o summary info about retail supermarket sales can be converted by retailer to determine which
items are most susceptible for promotional efforts.
----------------------------------------------------------------------------------------------------------------------------------------------------------
5
Q:2
Definition Contains quantitative data for analysis, Contains descriptive attributes related
often numeric to data in fact table
Purpose Stores measurable data and business metrics Provides context to facts, enabling
data slicing and dicing
Example Entries Sales transactions, order quantities, revenue Product names, customer addresses,
time periods
Data Type Mostly numeric and quantitative Mostly textual and descriptive attributes
Storage Size Can be very large due to high volume Generally smaller compared to fact tables
of transactions
Q.2 (b) Define noise. Explain binning methods for data smoothing. [04]
⮚ Noise in data mining :
∙ Noisy data are data with a large amount of additional meaningless information called noise. ∙ This includes
data corruption, also includes any data that a user system cannot understand and interpret correctly.
∙ Noisy data are data that is corrupted, distorted or has a low Signal-to-Noise ratio.
Data = True Signal + Noise
∙ Noisy data unnecessarily increases the amount of storage space required and can adversly affect any
data mining analysis results.
6
⮚ Data Smoothing : Removing noise from a data set is termed data smoothing.
∙ Following ways can be used for smoothing :
o Binning, Regression, Clustering, Cutlier analysis
⮚ Bining (Bucketing) :
∙ Bining is a technique where we sort the data and then partition the data into equal frequency bins. ∙
Then you may either replace the noisy data with the bin mean, bin median or bin boundary. ∙
Smoothing by bin mean method :
o values in bin are replaced by mean value of bin.
Bin 1 : [5, 10, 11, 13, 15, 35, 50, 55, 72]
Bin 2 : [92]
Bin 3 : [204, 215]
7
Q.2 (c) Explain different OLAP operation with example. [07]
⮚ OLAP Operations :
1) Roll-up :
∙ The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data
cube, by climbing down concept hierarchies, i.e., dimension reduction.
∙ Roll-up is like zooming-out on the data cubes.
∙ Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the
location is defined as the Order Street, city, province, or state, country.
∙ The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to
the level of the country.
2) Drill-down :
∙ The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is like
zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can
be performed by either stepping down a concept hierarchy for a dimension or adding additional
dimensions.
8
∙ Figure shows a drill-down operation performed on the dimension time by stepping down a concept
hierarchy which is defined as day, month, quarter, and year. Drill-down appears by descending the time
hierarchy from the level of the quarter to a more detailed level of the month.
3) Slice :
∙ A slice is a subset of the cubes corresponding to a single value for one or more members of the
dimension. For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice operations
perform a selection on one dimension of the given cube, thus resulting in a subcube.
9
4) Dice :
∙ The dice operation describes a subcube by operating a selection on two or more dimension.
5) Pivot :
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions
10
OR Q.2 (c) What is Data Mining? Explain Data mining as one step of Knowledge Discovery
Process. [07]
⮚ Data Mining :
∙ Data mining is concerned with solving problem by analyzing existing data.
∙ “Extraction of interesting ( nontrivial, implicit, previously unknown and potentially useful) information
or patterns from huge amount of data.”
∙ Data mining is process of identifying valid, novel, potentially useful and ultimately understandable
pattern in data.
∙ With enormous amount of data stored in file database and other repositories, it is increasingly
important.
∙ It’s not necessary to develop powerful means for analysis and perhaps interpretation of such data and
for extraction of interest in knowledge that could help in decision making.
∙ Actually data mining is a part of KDD.
∙ The knowledge discovery in databases process comprises of a few steps learning from raw data
collection to some form of new knowledge.
∙ Data cleaning : phase in which noise data and irrelevant data are removed from the collection. ∙ Data
integration : multiple data source, open heterogeneous may be combined in a common source. ∙ Data
selection : data relevant to analysis is decided on and retrieved from data collection. ∙ Data
transformation : also known as data consolidation, phase in which selected data is transformed into
forms appropriate for the mining procedure.
∙ Data mining : crucial step in which, clever techniques are applied to extract patterns potentially useful
information.
∙ Pattern evaluation : strictly interesting patterns representing knowledge are identified based on given
measures.
∙ Knowledge representation : final phase in which, discover knowledge is visually represented to user. o
This essential step uses visualization techniques to help users understand and interpret the data
mining results.
11
∙ It is common to combine some of these steps together, for instance data cleaning and data integration
can be performed together processing phase to generate a data warehouse.
∙ Data selection and transformation can also combined, where consolidation of data is result of selection, is
for case of data warehouse, selection is done on transformed data.
∙ KDD is an iterative process, once the discovered knowledge is presented to user, the evaluation measure
can be enhanced, mining can be further refined.
----------------------------------------------------------------------------------------------------------------------------------------------------------
Q:3
Q.3 (a) List and describe the methods for handling the missing values in data cleaning. [03]
⮚ Data Cleaning :
o Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated or insufficient data from a dataset.
a. Mean Imputation
o Description: Replace missing values with the mean of the observed values for that variable.
o Usage: Commonly used for continuous numerical data.
b. Median Imputation
o Description: Replace missing values with the median of the observed values for that variable.
o Usage: Useful for skewed distributions or when data contains outliers.
c. Mode Imputation
o Description: Replace missing values with the mode (most frequent value) of the observed values for that
variable.
o Usage: Commonly used for categorical data.
d. Predictive Imputation
o Description: Use statistical models or machine learning algorithms to predict and replace missing values.
o Usage: Suitable for both continuous and categorical data, particularly when relationships between
variables can be leveraged.
∙ Example:
o Mean Imputation: For a dataset with missing ages, calculate the mean age of the available data
and replace missing values with this mean.
o Median Imputation: For a dataset with skewed income data, use the median income to replace
missing values.
o Mode Imputation: For a dataset with missing gender values, replace missing values with the most
frequently occurring gender.
o Predictive Imputation: Use a regression model to predict and fill in missing values based on other
variables.
Q.3 (b) What is market basket analysis? Explain the two measures of rule interestingness:
support and confidence. [04]
⮚ Market basket analysis is a data mining technique used to uncover associations between items in large
datasets, typically transactional data from retail settings. This method helps identify patterns of items
frequently purchased together, enabling businesses to optimize product placements, promotions, and
13
inventory management. The analysis is commonly used to understand customer purchasing behavior and
to drive marketing strategies.
⮚ Two key measures used in market basket analysis to evaluate the "interestingness" of discovered rules
are support and confidence:
1. Support :
o Support measures how frequently the items in the rule appear together in the dataset. It is defined as
the proportion of transactions in the dataset that contain all the items in the rule.
o Mathematically, support for an association rule ��→�� (where �� and �� are itemsets) is
given by:
o Support is a useful measure because it helps identify how common a particular combination of items is
within the entire dataset. Higher support values indicate more frequent combinations, which can be
more significant for business decisions.
2. Confidence :
o Confidence measures the likelihood that the items in the consequent (right-hand side) of the rule are
purchased when the items in the antecedent (left-hand side) are purchased. It is defined as the
proportion of transactions containing the antecedent that also contain the consequent. o
Mathematically, confidence for an association rule ��→�� is given by:
o In other words, confidence is a conditional probability: it tells us how often items in �� appear in
transactions that contain ��. Higher confidence values indicate stronger association between the
antecedent and consequent, suggesting that the occurrence of �� strongly implies the occurrence of
��.
o Example to Illustrate :
o This means that 33% of all transactions contain both bread and milk, and if a customer buys bread, there
is a 50% chance they will also buy milk.
14
Q.3 (c) State the Apriori Property. Generate large itemsets and association rules using Apriori
algorithm on the following data set with minimum support value and minimum confidence
value set as 50% and 75% respectively. [07]
⮚ The Apriori property, also known as the downward closure property, is a fundamental principle used in the Apriori algorithm for mining
frequent itemsets and association rules in large databases. The Apriori property states:
"If an itemset is frequent, then all of its subsets must also be frequent."
⮚ In other words, for an itemset to be considered frequent, every subset of that itemset must appear in the database at least as frequently as
the itemset itself. Conversely, if an itemset is infrequent, then all of its supersets (itemsets that contain the original itemset) are also
infrequent.
15
16
1. Data Collection:
∙ Logs: Web servers generate logs containing details of user interactions.
∙ Cookies: Small pieces of data stored on the user's device to track visits and activities. ∙ Tags:
Embedded scripts (e.g., JavaScript) on web pages to capture user actions in real-time. ∙ Tools:
Analytics platforms like Google Analytics, Adobe Analytics, and specialized software for
click-stream data collection.
2. Data Preprocessing:
∙ Cleaning: Removing irrelevant data, dealing with missing values, and filtering out bot traffic. ∙
Sessionization: Dividing the click-stream into distinct user sessions based on predefined inactivity
thresholds.
∙ Transformation: Converting raw data into a structured format suitable for analysis (e.g., time
spent on pages, sequence of page views).
3. Data Mining Techniques:
∙ Pattern Recognition: Identifying common navigation paths and frequent sequences of actions. ∙
Association Rule Mining: Discovering relationships between pages or actions (e.g., users who
view page A often proceed to page B).
∙ Clustering: Grouping users based on similar browsing behaviors to identify different user
segments.
∙ Classification: Assigning predefined categories to users based on their behavior (e.g., new
visitors, returning customers, high-value customers).
∙ Sequential Pattern Mining: Finding patterns in the order of user actions (e.g., a typical purchase
funnel).
2. Website Optimization:
∙ Improving site navigation and user interface based on user interaction data.
∙ Identifying and resolving bottlenecks or drop-off points in the user journey.
3. Personalization:
∙ Tailoring content, recommendations, and marketing messages to individual users or user
segments.
17
∙ Enhancing the user experience by presenting relevant information and offers.
5. Fraud Detection:
∙ Monitoring unusual patterns of behavior that may indicate fraudulent activities.
∙ Implementing security measures to protect against fraud.
Click-stream analysis using data mining involves collecting and analyzing user navigation data
to gain insights into user behavior, optimize websites, personalize user experiences, and
improve conversion rates. By leveraging data mining techniques, businesses can make
data-driven decisions to enhance their online presence and achieve their objectives.
However, it also requires addressing challenges related to data volume, privacy, and
integration to be effective.
OR Q.3 (b) Explain the Min-max data normalization method with suitable
example. [04]
18
OR Q.3 (c) Explain three-tier Data Warehouse Architecture. [07]
⮚ Three-tier Data Warehouse :
1. Bottom-Tier :
o Bottom tier is a warehouse database server.
o It’s always DBMS relation.
o Handling query and materialization.
o That is data partitioning.
o It is a back-end tools.
2. Middle-Tier :
o It is a relation in OLAP server (ROLAP).
o Extended relational DBMS operation on multidimensional.
o Hybrid OLAP (HOLAP) : Its user flexibility.
o Multidimensional OLAP (MOLAP) : Special purpose.
o OLAP servers is support snowflake.
3. Top-Tier :
o It is a front-end client layer.
o Query reporting.
o Data mining tool.
o Client query.
o Interface between middle tier.
----------------------------------------------------------------------------------------------------------------------------------------------------------
19
Q:4
Q.4 (a) Discuss following terms.
1) Supervised learning 2) Correlation analysis 3) Tree pruning [03] 1)
Supervised Learning
Definition: Supervised learning is a type of machine learning where the model is trained on labeled data. This
means that each training example is paired with an output label. The goal is for the model to learn the mapping
from inputs to outputs so it can accurately predict the labels for new, unseen data.
Key Points:
∙ Training Process: The algorithm learns from a training dataset that includes input-output pairs. ∙
Applications: Commonly used in tasks like classification (e.g., spam detection, image recognition) and
regression (e.g., predicting house prices).
∙ Examples: Linear regression, logistic regression, decision trees, support vector machines, and neural
networks.
2) Correlation Analysis
Definition: Correlation analysis is a statistical method used to assess the strength and direction of the linear
relationship between two numerical variables. It quantifies how changes in one variable are associated with
changes in another variable.
Key Points:
∙ Correlation Coefficient: The most common measure is the Pearson correlation coefficient, which ranges
from -1 to 1. A value close to 1 implies a strong positive correlation, -1 implies a strong negative
correlation, and 0 implies no correlation.
∙ Interpretation: Helps in identifying and quantifying the relationship between variables, but it does not
imply causation.
∙ Applications: Widely used in exploratory data analysis to identify potential relationships between
variables.
3) Tree Pruning
Definition: Tree pruning is a technique used in decision tree algorithms to remove sections of the tree that
provide little to no power in predicting target variables. The goal is to reduce the complexity of the model and
enhance its generalization to new data.
Key Points:
∙ Purpose: Prevents overfitting by trimming nodes that add little predictive value.
∙ Methods:
∙ Pre-pruning (early stopping): Stops the growth of the tree early based on certain criteria (e.g.,
maximum tree depth, minimum number of samples required to split a node).
∙ Post-pruning: Removes branches from a fully grown tree that have little importance. This can be
done using techniques such as cost complexity pruning or reduced error pruning.
∙ Applications: Used to improve the performance and interpretability of decision tree models in
classification and regression tasks.
These concepts are fundamental in the field of data mining and machine learning, each playing a crucial role in
building, analyzing, and optimizing models.
20
Q.4 (b) Differentiate Association vs. Classification. [04]
Feature Association Classification
Goal To find frequent patterns, correlations, To predict the category or class label of
or associations among a set of items or new observations based on training
variables. data.
Output Association rules (e.g., "If A, then B"). A model that can assign class labels
to new data points.
Data Often works with large datasets Can work with various types of data
Characteristics with categorical variables. including numerical, categorical, and
text data.
21
Q.4 (c) Explain Baye’s Theorm and calculate Naïve Bayesian Classification for given example:
[07]
OR Q.4 (a) Draw the topology of a multilayer, feed-forward Neural Network. [03]
OR Q.4 (b) Explain data mining application for fraud detection. [04]
Data mining is highly effective in detecting and preventing fraud by analyzing large datasets to uncover patterns,
anomalies, and relationships indicative of fraudulent activities. Here's an explanation of how data mining is
applied in fraud detection:
o Efficiency: Automates the detection process, allowing for real-time analysis and quicker response to
fraudulent activities.
o Accuracy: Provides high accuracy in identifying fraud by leveraging complex algorithms and large
datasets.
o Scalability: Can handle vast amounts of data and adapt to new fraud patterns as they emerge. o
Cost-Effectiveness: Reduces the costs associated with fraud by preventing losses and minimizing the
need for extensive manual investigations.
24
OR Q.4 (c) Define linear and nonlinear regression using figures. Calculate the value of Y for
X=100 based on Linear regression prediction method. [07]
----------------------------------------------------------------------------------------------------------------------------------------------------------
Q:5
Q.5 (a) Describe web mining using example. [03]
Web mining refers to the application of data mining techniques to extract valuable insights and knowledge from
web data, including web content, web structure, and web usage data. It can be broadly categorized into three
main areas: web content mining, web structure mining, and web usage mining.
∙ Definition: Involves extracting useful information from the content of web pages, which includes
text, images, videos, and audio.
∙ Example: A search engine like Google uses web content mining to index web pages and retrieve
relevant pages based on user queries. For instance, when a user searches for "best Italian
restaurants in New York," the search engine mines the web content to return a list of relevant
restaurants with descriptions, reviews, and ratings.
25
∙ Definition: Analyzes the structure of the web, which involves understanding the connections and
relationships between different web pages, often represented as a graph.
∙ Example: PageRank, the algorithm used by Google, is a prime example of web structure mining. It
evaluates the importance of web pages based on the number and quality of links pointing to
them. For instance, a web page that is linked by many high-authority pages is deemed more
important and is ranked higher in search results.
∙ Definition: Focuses on analyzing user behavior through the data generated by web interactions,
such as server logs, cookies, and clickstream data.
∙ Example: An e-commerce site like Amazon uses web usage mining to analyze customer behavior. By
tracking the sequence of pages a user visits, items they click on, and their purchase history,
Amazon can recommend products tailored to individual user preferences. For instance, if a user
frequently views and buys electronic gadgets, the site will suggest new gadgets that might
interest them.
Q.5 (b) What is Big Data? What is big data analytic? [04]
⮚ Big Data refers to extremely large and complex datasets that cannot be easily managed, processed, or
analyzed using traditional data processing tools. These datasets are characterized by the following five
V's:
1. Volume: The sheer amount of data generated, often measured in terabytes, petabytes, or even exabytes.
The scale of data is enormous, coming from various sources such as social media, sensors, transactions,
and more.
2. Velocity: The speed at which data is generated, collected, and processed. Big Data often involves real
time or near-real-time data streams, requiring rapid processing to derive timely insights.
3. Variety: The diverse types of data, including structured data (e.g., databases), semi-structured data (e.g.,
XML, JSON files), and unstructured data (e.g., text, images, videos).
5. Value: The potential insights and benefits that can be derived from analyzing the data.
Big Data Analytics in data mining refers to the process of examining large and varied datasets to uncover hidden
patterns, correlations, trends, and other useful information that can aid in decision-making and strategic
planning. Big Data Analytics leverages advanced data mining techniques, algorithms, and tools specifically
designed to handle the volume, velocity, and variety of Big Data.
2. Data Processing:
∙ Batch Processing: Techniques like MapReduce and tools like Apache Hadoop are used to process
large batches of data.
∙ Stream Processing: Tools like Apache Kafka, Apache Flink, and Apache Storm handle real-time
data streams, allowing for immediate analysis and response.
3. Data Analysis:
∙ Descriptive Analytics: Summarizing historical data to understand what has happened in the past. ∙
Predictive Analytics: Using statistical models and machine learning algorithms to predict future
trends and behaviors.
∙ Prescriptive Analytics: Recommending actions based on the analysis to achieve desired
outcomes.
4. Data Visualization:
∙ Tools and Techniques: Visualizing the results of Big Data Analytics using tools like Tableau, Power BI,
and D3.js to make the insights understandable and actionable.
1. Healthcare:
∙ Predictive Modeling: Predicting disease outbreaks and patient outcomes based on historical
health data.
∙ Personalized Medicine: Tailoring treatment plans based on individual genetic information and
medical history.
2. Retail:
∙ Customer Insights: Analyzing customer behavior to personalize shopping experiences and
improve customer satisfaction.
∙ Supply Chain Optimization: Improving inventory management and logistics through predictive
analytics.
3. Finance:
∙ Fraud Detection: Identifying fraudulent transactions in real-time using anomaly detection
techniques.
∙ Risk Management: Assessing and mitigating financial risks through predictive models.
4. Manufacturing:
∙ Predictive Maintenance: Predicting equipment failures before they occur to minimize downtime
and maintenance costs.
∙ Quality Control: Analyzing production data to improve product quality and reduce defects.
1. Data Collection: A retail company collects data from various sources such as transaction logs, customer
feedback, social media interactions, and IoT devices in stores.
2. Data Storage: The collected data is stored in a distributed storage system like HDFS.
27
3. Data Processing: Using Hadoop MapReduce, the company processes large batches of
transaction data to identify purchasing patterns.
4. Data Analysis: Applying machine learning algorithms, the company predicts customer
buying behaviors and identifies potential new product recommendations.
5. Data Visualization: The insights are visualized using tools like Tableau to create
dashboards that display sales trends, customer segments, and product performance.
Q.5 (c) Define the term “Information Gain”. Explain the steps of the ID3
Algorithm for generating Decision Tree. [07]
28
29
OR Q.5 (a) Define: 1) Data Node 2) Name Node 3) Text mining [03]
1) Data Node
Definition: In the context of Hadoop, a Data Node is a component of the Hadoop Distributed File System (HDFS)
responsible for storing the actual data. Each Data Node manages the storage attached to it and performs read
and write operations upon request from the Name Node.
Key Points:
∙ Function: Stores blocks of data and serves read/write requests from clients.
∙ Communication: Regularly communicates with the Name Node to report the status of data blocks it
stores.
∙ Fault Tolerance: Replicates data blocks to ensure fault tolerance and high availability.
2) Name Node
Definition: The Name Node is a critical component of HDFS that manages the metadata of the file system. It
maintains the directory tree of all files and tracks the locations of the data blocks across the Data Nodes.
Key Points:
∙ Function: Keeps track of the structure of the file system and the mapping of data blocks to Data Nodes. ∙
Responsibility: Handles namespace operations like opening, closing, renaming files and directories, and
determining the mapping of blocks to Data Nodes.
∙ High Availability: Can be configured for high availability with a secondary Name Node or a standby Name
Node to avoid a single point of failure.
30
3) Text Mining
Definition: Text mining, also known as text data mining or text analytics, refers to the process of deriving
meaningful information and patterns from unstructured text data. It involves techniques from natural language
processing (NLP), machine learning, and information retrieval.
Key Points:
∙ Goal: Extract useful insights, identify patterns, and transform text into structured data for further
analysis.
∙ Applications: Sentiment analysis, topic modeling, document classification, information extraction, and
summarization.
∙ Techniques: Includes processes like tokenization, stemming, lemmatization, named entity recognition,
and sentiment analysis.
∙ Tools and Libraries: Common tools and libraries for text mining include NLTK, SpaCy, Apache OpenNLP,
and commercial software like IBM Watson and Google Cloud Natural Language API.
These definitions provide a high-level overview of essential components in data processing frameworks like
Hadoop (Data Node and Name Node) and a key concept in data analysis (Text Mining).
⮚ Partitioning Clustering :
∙ It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means Clustering
algorithm.
∙ In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.
31
Key Characteristics:
∙ Flat Structure: The result is a flat set of clusters, without any inherent hierarchy. ∙ Predefined Number
of Clusters: The number of clusters �� is often specified in advance. ∙ Optimization Objective:
Commonly aims to minimize within-cluster variance or maximize between cluster variance.
⮚ Hierarchical Clustering :
∙ Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The observations
or any number of clusters can be selected by cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical algorithm.
Key Characteristics:
OR Q.5 (c) What is distributed file system? Explain HDFS architecture in detail. [07]
⮚ A distributed file system (DFS) is a type of file system that manages files and directories spread across
multiple physical locations (typically on different servers or nodes) and makes them accessible as if they
were on a single local file system. The main purpose of a DFS is to provide a way to store, access, and
manage data efficiently across a network, allowing multiple users and applications to access and share
files concurrently.
⮚ HDFS :
o NameNode(Master)
o DataNode(Slave)
32
∙ NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
∙ Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
∙ DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.
∙ File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it manually.
∙ Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the last
one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system, the
size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop
file system. As we all know Hadoop is mainly configured for storing the large size data which is in
33
petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
∙ Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
∙ By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
∙ This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is
why we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this
is known as fault tolerance.
∙ Now one thing we also need to notice that after making so many replica’s of our file blocks we are
wasting so much of our storage but for the big brand organization the data is very much important
than the storage so nobody cares for this extra storage. You can configure the Replication factor in
your hdfs-site.xml file.
∙ Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster
(maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks
information Namenode chooses the closest Datanode to achieve the maximum performance while
performing the read/write information which reduces the Network Traffic.
⮚ HDFS Architecture :
----------------------------------------------------------------------------------------------------------------------------------------------------------
34
35