DWDM Set-2
DWDM Set-2
UNIT-I
Ans. Because operational databases store huge amounts of data, you may wonder, “Why not
perform online analytical processing directly on such databases instead of spending additional time
and resources to construct a separate data warehouse?” A major reason for such a separation is to
help promote the high performance of both systems. An operational database is designed and tuned
from known tasks and workloads like indexing and hashing using primary keys, searching for
particular records, and optimizing “canned” queries. On the other hand, data warehouse queries are
often complex. They involve the computation of large data groups at summarized levels, and may
require the use of special data organization, access, and implementation methods based on
multidimensional views. Processing OLAP queries in operational databases would substantially
degrade the performance of operational tasks. Moreover, an operational database supports the
concurrent processing of multiple transactions. Concurrency control and recovery mechanisms (e.g.,
locking and logging) are required to ensure the consistency and robustness of transactions. An OLAP
query often needs read-only access of data records for summarization and aggregation. Concurrency
control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution
of concurrent transactions and thus substantially reduce the throughput of an OLTP system. Finally,
the separation of operational databases from data warehouses is based on the different structures,
contents, and uses of the data in these two systems.
b) Compare and contrast On-Line Analytical Processing with On-Line Transaction Processing.
Ans.
Ans. Extraction, Transformation, and Loading Data warehouse systems use back-end tools and
utilities to populate and refresh their data . These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible. Data
transformation, which converts data from legacy or host format to warehouse format. Load, which
sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
Refresh, which propagates the updates from the data sources to the warehouse. Besides cleaning,
loading, refreshing, and metadata definition tools, data warehouse systems usually provide a good
set of data warehouse management tools.
The querying of multidimensional databases can be based on a starnet model, which consists of
radial lines emanating from a central point, where each line represents a concept hierarchy for a
dimension. Each abstraction level in the hierarchy is called a footprint. These represent the
granularities available for use by OLAP operations such as drill-down and roll-up.
A starnet query model for the AllElectronics data warehouse is shown in Figure 4.13. This starnet
consists of four radial lines, representing concept hierarchies for the dimensions location, customer,
item, and time, respectively. Each line consists of footprints representing abstraction levels of the
dimension. For example, the time line has four footprints: “day,” “month,” “quarter,” and “year.” A
concept hierarchy may involve a single attribute (e.g., date for the time hierarchy) or several
attributes (e.g., the concept hierarchy for location involves the attributes street, city, province or
state, and country). In order to examine the item sales at AllElectronics, users can roll up along the
time dimension from month to quarter, or, say, drill down along the location dimension from country
to city. Concept hierarchies can be used to generalize data by replacing low-level values (such as
“day” for the time dimension) by higher-level abstractions (such as “year”), or to specialize data by
replacing higher-level abstractions with lower-level values.
Star Schema:
Ø Each dimension in a star schema is represented with only one-dimension table.
Ø The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.
Ø There is a fact table at the center. It contains the keys to each of four dimensions.
Ø The fact table also contains the attributes, namely dollars sold and units sold.
Ø Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains
the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data
redundancy along the attributes province_or_state and country.
Characteristics of Star Schema:
Ø Every dimension in a star schema is represented with the only one-dimension table.
Ø The dimension table is joined to the fact table using a foreign key
Ø The Star schema is easy to understand and provides optimal disk usage.
Ø The dimension tables are not normalized. For instance, in the above figure, Country_ID does not have Country lookup table as an OLTP
design would have.
Advantages:
Ø Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation Schema can design with a
collection of de-normalized FACT, Shared, and Conformed Dimension tables.
A fact constellation schema is shown in the figure below.
Ø This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions, namely, time, item, branch, and
location.
Ø The schema contains a fact table for sales that includes keys to each of the four dimensions, along with two measures: Rupee_sold and
units_sold.
Ø The shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two measures:
Rupee_cost and units_shipped.
Ø It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared
between the sales and shipping fact table.
Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.
2.A) What are the differences between the three main types of data warehouse information processing, analytical processing, and
data mining? Discuss the motivation behind OLAP mining(OLAM)
The three main types of data warehouse usage are information processing, analytical processing, and data mining. Let's discuss each one and
then delve into the motivation behind OLAP mining (OLAM).
1. Information Processing:
Information processing in a data warehouse involves collecting, storing, and managing large volumes of data from various sources to support
day-to-day business operations. The primary goal is to provide a centralized repository of integrated data that can be accessed and updated in
real-time to support transactional activities.
Key characteristics:
Real-time data updates: Information processing focuses on capturing and maintaining the most current state of the data to support
operational processes.
OLTP (Online Transaction Processing): The focus is on efficient handling of frequent and small-scale transactions.
Use cases:
Inventory management
Customer relationship management (CRM) systems
2. Analytical Processing:
Analytical processing in a data warehouse involves querying and analyzing historical data to gain insights, identify patterns, and make
strategic decisions. It emphasizes providing fast response times for complex analytical queries and data summarization.
Key characteristics:
Historical data analysis: Analytical processing deals with large volumes of historical data to identify trends and patterns over
time.
OLAP (Online Analytical Processing): The focus is on supporting complex queries and multidimensional analysis.
Use cases:
3. Data Mining:
Data mining in a data warehouse involves the discovery of valuable patterns, correlations, and insights from large datasets. It uses statistical
and machine learning techniques to find hidden relationships within the data and predict future trends.
Key characteristics:
Advanced data analysis: Data mining goes beyond standard analytical processing by discovering new knowledge and patterns in
the data.
Predictive modeling: It involves building models that can predict future trends and behaviors based on historical data.
Use cases:
Fraud detection
Recommender systems
OLAP mining (OLAM) is a combination of Online Analytical Processing (OLAP) and data mining techniques. The motivation behind
OLAM is to extend the capabilities of traditional OLAP systems by incorporating data mining algorithms to discover deeper insights and
patterns from the multidimensional data stored in the data warehouse.
1. Enhanced Decision Support: OLAM enhances decision-making processes by providing advanced analytical capabilities. It allows
users to uncover hidden relationships and patterns in data that may not be apparent through traditional OLAP analysis alone.
2. Pattern Discovery: OLAM employs data mining techniques to discover previously unknown patterns, trends, and associations in
multidimensional data. This can lead to actionable insights and a deeper understanding of business processes.
3. Predictive Analysis: By combining OLAP with data mining, OLAM enables predictive analysis, allowing organizations to make
data-driven forecasts and anticipate future trends based on historical data.
4. Deeper Insights: OLAM goes beyond standard OLAP aggregations and drill-downs to reveal deeper insights into data. It enables
users to identify outliers, anomalies, and other valuable patterns that may influence business strategies.
5. Complex Data Relationships: Data mining algorithms in OLAM can uncover complex relationships between dimensions that may
not be apparent through simple OLAP queries.
Overall, OLAP mining (OLAM) bridges the gap between OLAP and data mining, allowing organizations to make more informed decisions
and gain a competitive edge by leveraging the power of advanced analytics on multidimensional data.
2 B) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and
charge, where charge is the fee that a doctor charges a patient for a visit. Enumerate three classes of schemas that are popularly
used for modeling data warehouses and explain.
Sure! In the context of a data warehouse with dimensions like time, doctor, and patient, and measures such as count and charge, here are
three popular classes of schemas used for modeling data warehouses:
1. Star Schema:
The star schema is a widely used and simple schema design for data warehousing. In this schema, there is one central fact table that holds the
measures (count and charge) and is surrounded by dimension tables (time, doctor, and patient) that provide context to the measures. The fact
table contains foreign keys to link to the dimension tables.
Explanation:
Fact Table: Contains the quantitative measures (count and charge) and foreign keys to connect to the dimension tables.
Dimension Tables: Each dimension table represents a specific attribute or dimension, such as time, doctor, and patient. These
tables contain descriptive attributes related to the respective dimension.
Advantages:
2. Snowflake Schema:
The snowflake schema is an extension of the star schema where dimension tables are further normalized into multiple related tables. In this
schema, the dimension tables are broken down into sub-dimensions, reducing data redundancy and improving data integrity.
Explanation:
Fact Table: Same as in the star schema, contains the measures and foreign keys.
Dimension Tables: Dimension tables might be further normalized into sub-dimension tables. For example, the doctor dimension
may have separate tables for doctor details, specialty, and location, with relationships between them.
Advantages:
The fact constellation schema, also known as the galaxy schema, is a complex schema design that consists of multiple fact tables sharing
dimension tables. This schema is used when dealing with heterogeneous data with different grain levels.
Explanation:
Fact Tables: Multiple fact tables, each containing different measures related to specific business processes. For example, one fact
table may store patient-related measures, while another fact table stores doctor-related measures.
Dimension Tables: Shared dimension tables are used across all fact tables to maintain consistency and reduce redundancy.
Advantages:
Supports complex scenarios with multiple independent business processes or varying grain levels of data.
Each of these schema designs has its own advantages and trade-offs. The choice of schema depends on the specific requirements of the data
warehouse, the complexity of the data being analyzed, and the preferred querying and reporting performance.
Examples of gateways include ODBC(Open Database Connection) and OLEDB(OpenLinking and Embedding for Databases) by Microsoft
and JDBC(Java DatabaseConnection).This tier also contains a metadata repository, which stores informationabout the data warehouse and its
contents.
The middle tier is an OLAP server that is typically implemented using either
(a)a relational OLAP(ROLAP) model, that is an extended relational DBMS that maps operations on multidimensional data to standard
relational operations, or
(b) a multidimensional OLAP(MOLAP) model that is a special-purpose server that directly implements multidimensional data and
operations.
The top tier is a front end client layer, which contains query and reporting tools, analysis tools and data mining tools(ex: trend analysis,
prediction….)
An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It supports corporate-wide data
integration, usually from one or more operational systems or external data providers, and it's cross-functional in scope. It generally contains
detailed information as well as summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or
beyond. An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or parallel architecture
platforms. It required extensive business modeling and may take years to develop and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users. The scope is confined to particular
selected subjects. For example, a marketing data mart may restrict its subjects to the customer, items, and sales. The data contained in the
data marts tend to be summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or more operational systems or external data
providers, or data generally locally within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query processing, only some of the possible
summary vision may be materialized. A virtual warehouse is simple to build but required excess capacity on operational database servers.
OLAP OPERATIONS:
Ø In the multidimensional model, the records are organized into various dimensions, and each dimension includes multiple levels of
abstraction described by concept hierarchies.
Ø This organization support users with the flexibility to view data from various perspectives.
Ø A number of OLAP data cube operation exist to demonstrate these different views, allowing interactive queries and search of the record
at hand. Hence, OLAP supports a user-friendly environment for interactive data analysis.
Ø The data cubes for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is aggregated with
regard to city values, time is aggregated with respect to quarters, and an item is aggregated with respect to item types.
(ii) Drill-down
(iii) Slice
(iv) Dice
(v) Pivot
Roll-up:
Ø The roll-up operation performs aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is
like zooming-out on the data cubes.
Ø Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the location is defined as the Order
Street, city, province, or state, country.
Ø The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the level of the country.
Ø When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube.
Ø For example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed by removing, the time
dimensions, appearing in an aggregation of the total sales by location, relatively than by location and by time.
Drill-Down
Ø The drill-down operation is the reverse operation of roll-up.
Ø It is also called roll-down operation.
Ø It navigates from less detailed record to more detailed data. Drill-down can be performed by either stepping down a concept hierarchy
for a dimension or adding additional dimensions.
Ø Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy which is defined as day,
month, quarter, and year.
Ø Drill-down appears by descending the time hierarchy from the level of the quarter to a more detailed level of the month.
Ø Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension to a cube.
Slice:
Ø A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.
Ø The slice operation provides a new sub cube from one particular dimension in a given cube.
Ø For example, a slice operation is executed when the customer wants a selection on one dimension of a three-dimensional cube resulting
in a two-dimensional site. So, the Slice operations perform a selection on one dimension of the given cube, thus resulting in a sub cube.
Ø Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
SLICE OPERATION
c)Define data warehouse. Explain about data warehouse implementation.
Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries
be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access methods, and query processing techniques. In
this section, we present an overview of methods for the efficient implementation of data warehouse
systems.
how OLAP data can be indexed, using either bitmap or join indices.
The compute cube Operator and the Curse of Dimensionality One approach to cube computation
extends SQL so as to include a compute cube operator. The compute cube operator computes
aggregates over all subsets of the dimensions specified in the operation. This can require excessive
storage space, especially for large numbers of dimensions. We start with an intuitive look at what is
involved in the efficient computation of data cubes.
A data cube is a lattice of cuboids. Suppose that you want to create a data cube for AllElectronics
sales that contains the following: city, item, year, and sales in dollars. You want to be able to analyze
the data, with queries such as the following: “Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.” “Compute the sum of sales, grouping by item.
What is the total number of cuboids, or group-by’s, that can be computed for this data cube? Taking
the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as
the measure, the total number of cuboids, or groupby’s, that can be computed for this data cube is
23 = 8. The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item,
year), (city), (item), (year), ()}, where () means that the group-by is empty (i.e., the dimensions are
not grouped).
There are three choices for data cube materialization given a base cuboid: 1. No materialization: Do
not precompute any of the “nonbase” cuboids. This leads to computing expensive multidimensional
aggregates on-the-fly, which can be extremely slow. 2. Full materialization: Precompute all of the
cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically
requires huge amounts of memory space in order to store all of the precomputed cuboids. 3. Partial
materialization: Selectively compute a proper subset of the whole set of possible cuboids.
Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy
some user-specified criterion, such as where the tuple count of each cell is above some threshold.
We will use the term subcube to refer to the latter case, where only some of the cells may be
precomputed for various cuboids. Partial materialization represents an interesting trade-off between
storage space and response time.
Indexing OLAP Data: Bitmap Index and Join Index:The bitmap indexing method is popular in OLAP
products because it allows quick searching in data cubes. The bitmap index is an alternative
representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a distinct
bit vector, Bv, for each value v in the attribute’s domain. If a given attribute’s domain consists of n
values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the
attribute has the value v for a given row in the data table, then the bit representing that value is set
to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database. For example, if
two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join index record
contains the pair (RID, SID), where RID and SID are record identifiers from the R and S relations,
respectively. Hence, the join index records can identify joinable tuples without performing costly join
operations. Join indexing is especially useful for maintaining the relationship between a foreign key2
and its matching primary keys, from the joinable relation. The star schema model of data
warehouses makes join indexing attractive for crosstable search, because the linkage between a fact
table and its corresponding dimension tables comprises the fact table’s foreign key and the
dimension table’s primary key. Join indexing maintains relationships between attribute values of a
dimension (e.g., within a dimension table) and the corresponding rows in the fact table. Join indices
may span multiple dimensions to form composite join indices. We can use join indices to identify
subcubes that are of interest.
Join indexing. In Example 3.4, we defined a star schema for AllElectronics of the form “sales star
[time, item, branch, location]: dollars sold = sum (sales in dollars).” An example of a join index
relationship between the sales fact table and the location and item dimension tables is shown in
Figure 4.16. For example, the “Main Street” value in the location dimension table joins with tuples
T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table
joins with tuples T57 and T459 of the sales fact table. The corresponding join index tables are shown
in Figure .
Efficient Processing of OLAP Queries The purpose of materializing cuboids and constructing OLAP
index structures is to speed up query processing in data cubes. Given materialized views, query
processing should proceed as follows:
1. Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the
query into corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube
may correspond to selection and/or projection operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied: This involves
identifying all of the materialized cuboids that may potentially be used to answer the query, pruning
the set using knowledge of “dominance” relationships among the cuboids, estimating the costs of
using the remaining materialized cuboids, and selecting the cuboid with the least cost.
Example 4.9 OLAP query processing. Suppose that we define a data cube for AllElectronics of the
form “sales cube [time, item, location]: sum(sales in dollars).” The dimension hierarchies used are
“day < month < quarter < year” for time; “item name < brand < type” for item; and “street < city <
province or state < country” for location. Suppose that the query to be processed is on {brand,
province or state}, with the selection constant “year = 2010.” Also, suppose that there are four
materialized cuboids available, as follows:
where year = 2010 “Which of these four cuboids should be selected to process the query?” Finer-
granularity data cannot be generated from coarser-granularity data. Therefore, cuboid 2 cannot be
used because country is a more general concept than province or state. Cuboids 1, 3, and 4 can be
used to process the query because (1) they have the same set or a superset of the dimensions in the
query, (2) the selection clause in the query can imply the selection in the cuboid, and (3) the
abstraction levels for the item and location dimensions in these cuboids are at a finer level than
brand and province or state, respectively.
UNIT-II
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task
are retrieved from the database.
Data Transformation − In this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied
in order to extract data patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
b) What are the major issues in data mining?
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
Flat Files
Flat files are defined as data files in text
form or binary form with a structure that
can be easily extracted by data mining
algorithms.
Data stored in flat files have no
relationship or path among themselves,
like if a relational database is stored on flat
file, and then there will be no relations
between the tables.
Flat files are represented by data dictionary. Eg: CSV
file.
Application: Used in Data Warehousing to
store data, Used in carrying data to and
from server, etc.
Relational Databases
A Relational database is defined as the
collection of data organized in tables
with rows and columns.
Physical schema in Relational databases is
a schema which defines the structure of
tables.
Logical schema in Relational databases is
a schema which defines the relationship
among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
DataWarehouse
A datawarehouse is defined as the
collection of data integrated from
multiple sources that will query and
decision making.
There are three types of
datawarehouse: Enterprise
datawarehouse, Data Mart and Virtual
Warehouse.
Multimedia Databases
Multimedia databases consists audio, video, images
and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-
specified formats.
Application: Digital libraries, video-
on demand, news-on demand,
musical database, etc
Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology,
lines, polygons, etc.
Application: Maps, Global positioning, etc.
Time-series Databases
Time series databases contain stock exchange data
and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
WWW
WWW refers to World wide web is a
collection of documents and resources like
audio, video, text, etc which are identified
by Uniform Resource Locators (URLs)
through web browsers, linked by HTML
pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects
data from multiple resources.
It is dynamic in nature as Volume of data is
continuously increasing and changing.
Application: Online shopping, Job search, Research,
studying, etc.
(OR)
Attribute:
4a) Briefly discuss about types of attributes and measurements
Attribute:
Example:
Attribute Values
Colors Black, Green, Brown,
red
2. Binary Attributes: Binary data has only 2
values/states. For Example yes or no, affected or
unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
Attribute Values
Profession Teacher, Business man, Peon
b) Explain about similarity and dissimilarity between simple attributes and dataobjects.
l Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
l Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
l Proximity refers to a similarity or dissimilarity
(i) min-max normalization (ii) z-score normalization (iii) z-score normalization using the mean
absolute deviation instead of standard deviation(iv) normalization by decimal scaling.
b) Z-Score Normalization
The values for an attribute, A, are normalized based
UNIT-III
5 . a) Use the C4.5 algorithm to build a decision tree for classifying the following objects.
The C4.5 algorithm is a decision tree algorithm used for classification. It works by
recursively partitioning the dataset into subsets based on the attributes that provide the best
split, measured using information gain or another appropriate metric. Let's build a decision
tree for the given objects based on the attributes: Size, Color, and Shape.
Where:
Step 2: Calculate the information gain for each attribute (Size, Color, Shape):
For Size:
For Color:
For Shape:
Now, compare the information gains for Size, Color, and Shape. The attribute with the
highest information gain is Shape.
So, the decision tree should start by splitting on the Shape attribute. Here's the decision tree:
1. If Shape is Round:
2. If Shape is Cube:
o Classify as B.
This decision tree will correctly classify the given objects into classes A and B based on their
Size, Color, and Shape attributes.
The C4.5 algorithm, developed by Ross Quinlan as an improvement over his earlier ID3
algorithm, introduced several new features and enhancements. Here are the key differences
and new features of C4.5 compared to the original ID3 algorithm:
o C4.5 can handle both discrete and continuous attributes, whereas ID3 can only
handle discrete attributes. C4.5 achieves this by dynamically selecting
threshold values for continuous attributes, allowing it to split data effectively
based on numerical values.
o C4.5 uses a modified information gain metric called "Information Gain Ratio"
or "Gain Ratio" to overcome ID3's bias towards attributes with many values.
Gain Ratio adjusts for the number of branches an attribute can split into, which
helps prevent attributes with many values from dominating the tree.
3. Pruning:
5. Rule Generation:
o C4.5 can generate rules from the decision tree, which can be easier to interpret
and provide insights into the decision-making process. ID3 primarily produces
a decision tree without directly generating rules.
6. Reduced Attribute Inclusion:
o C4.5 can reduce attribute inclusion in the tree by considering whether adding
an attribute to a node will result in significant improvement. This feature helps
produce simpler and more efficient trees compared to ID3, which tends to
include all attributes.
8. Scalability:
o C4.5 is generally more scalable than ID3 due to its pruning mechanism and
better handling of attributes with many values. This makes it suitable for
larger datasets.
o C4.5 can handle numeric class labels by converting them into a binary
classification problem, whereas ID3 primarily works with categorical class
labels.
Overall, C4.5 improved upon several limitations of the original ID3 algorithm, making it a
more versatile and effective algorithm for decision tree generation, especially in scenarios
involving real-world datasets with diverse attributes and data types.
Cross-validation
The practice of cross-validation is to take a dataset and randomly split it into a number even
segments, called folds. The machine learning algorithm is trained on all but one fold. Cross-
validation then tests each fold against a model trained on all of the other folds. This means
that each trained model is tested on a segment of the data that it has never seen before. The
process is repeated with a different fold being hidden during training and then tested until all
folds have been used exactly once as a test and been trained on during every other iteration.
The training data is split into five folds. During each iteration, a different fold is set aside to
be used as test data.
The outcome of cross-validation is a set of test metrics that give a reasonable forecast of how
accurately the trained model will be able to predict on data that it has never seen before
2. Underfitting: On the other hand, underfitting occurs when a model is too simple to
capture the underlying structure of the data. This often happens when the model is too
constrained or lacks the capacity to represent complex relationships. Underfitting can
lead to poor performance on both the training and test data. To address underfitting,
one might need to use more complex models or feature engineering techniques.
3. Data leakage: Data leakage occurs when information from the test set (or future data)
leaks into the training process, leading to overly optimistic performance estimates.
This can happen if one inadvertently uses information from the test set to inform
model training or hyperparameter tuning. To prevent data leakage, it's essential to
properly partition the data into training and test sets and ensure that no information
from the test set is used during training.
4. Selection bias: Selection bias occurs when the process of selecting the model or
evaluating its performance is biased in some way. For example, if only a subset of the
available data is used for model evaluation, the results may not be representative of
the model's performance on unseen data. To mitigate selection bias, it's important to
use techniques like stratified sampling or cross-validation to ensure that the evaluation
process is fair and unbiased.
Navigating these pitfalls requires careful consideration of the data, model selection process,
and evaluation techniques. By being aware of these potential challenges and employing best
practices in model selection and evaluation, one can develop more robust and reliable
machine learning models.
(OR)
The holdout method and cross-validation are both techniques used for evaluating the
performance of machine learning models, but they differ in how they partition the data and
assess model performance:
1. Holdout Method:
o Partitioning: In the holdout method, the dataset is split into two subsets: a
training set and a validation/test set. Typically, a larger portion of the data
(e.g., 70-80%) is used for training, and the remaining portion (e.g., 20-30%) is
used for validation or testing.
o Training and Evaluation: The model is trained on the training set and then
evaluated on the validation or test set. The performance metrics computed on
the validation/test set are used to estimate the model's generalization
performance.
o Advantages:
o Disadvantages:
Limited use of data for training and evaluation, which can lead to less
reliable performance estimates, especially with smaller datasets.
2. Cross-Validation:
o Advantages:
o Disadvantages:
Bootstrap is a resampling technique used in statistics and machine learning for estimating the
distribution of statistics or parameters by repeatedly sampling data points from the original
dataset with replacement. It is particularly useful when the dataset is small or when
uncertainty about a statistic needs to be quantified. Bootstrap has various applications in
classification, including model evaluation, feature selection, and estimating prediction
uncertainty. Let's delve into each aspect:
1. Model Evaluation:
2. Feature Selection:
o Feature Subset Selection: For each bootstrap sample, train the classification
model using a subset of features randomly selected from the original feature
set. By evaluating the model's performance on each bootstrap sample, one can
assess the importance of different features in classification.
3. Prediction Uncertainty:
4. Outlier Detection:
Overall, bootstrap provides a powerful tool for assessing model performance, feature
importance, prediction uncertainty, and outlier detection in classification tasks. It leverages
resampling with replacement to generate multiple datasets, allowing for more robust
estimates and insights into the underlying data distribution. By utilizing bootstrap techniques,
practitioners can make more informed decisions in classification model development and
evaluation.
UNIT-IV
7 a) Explain about support counting in frequent itemset generation. [14M] CO4]July – 2023
1. Definition of Support:
2. Counting Support:
o The process of support counting involves scanning the dataset to count the
occurrences of each itemset.
3. Thresholding:
o Frequent itemsets are those whose support exceeds this minimum support
threshold.
o Itemsets that do not meet the minimum support threshold are considered
infrequent and are not considered further in the frequent itemset generation
process.
4. Example:
Support counting is a crucial step in frequent itemset generation as it helps identify patterns
of co-occurrence between items in a dataset. Frequent itemsets with high support are
considered significant and may indicate meaningful associations between items.
b) Explain about compact representation of frequent Item set.
The number of frequent itemsets can be very large for instance let us say that you are dealing
with a store that is trying to find the relationship between over 100 items. According to the
Apriori Principle, if an itemset is frequent, all of its subsets must be frequent so for frequent
100-itemset has
100 frequent 1-itemset and
1002 frequent 2-itemset and
1003 frequent 3-itemset and
the list goes on, if one was to calculate all the frequent itemsets that are subsets of this larger
100-itemset they will be close to 2100. Now I don’t know about you but this number isn’t the
sort of data you want to store on the average computer or try and find the support of and so it
is for this reason that alternative representations have been derived which reduce the initial
set but can be used to generate all other frequent itemsets. The Maximal and Closed Frequent
Itemsets are two such representations that are subsets of the larger frequent itemset that will
be discussed in this section. The table below provides the basic information about these two
representations and how they can be identified.
6.
Here's a comparison between the Apriori and FP-Growth algorithms for frequent itemset
mining in transactional databases:
1. Apriori Algorithm:
o Pros:
o Cons:
Multiple passes over the dataset are required, which can be inefficient
for large datasets.
2. FP-Growth Algorithm:
o Pros:
Requires only two passes over the dataset, making it more scalable for
large datasets.
Does not generate candidate itemsets explicitly, avoiding the need for
costly candidate generation and pruning steps.
o Cons:
Construction of the FP-tree can be memory-intensive for very large
datasets with high dimensionality.
After executing both algorithms, we can compare them based on the following factors:
1. Execution Time: Compare the time taken by each algorithm to find frequent itemsets.
2. Memory Usage: Analyze the memory footprint of each algorithm, considering factors
such as data preprocessing and internal data structures.
3. Scalability: Assess how well each algorithm performs as the size of the dataset
increases.
By analyzing these factors, we can determine which algorithm is more suitable for frequent
itemset mining in transactional databases under the given constraints and requirements.
(OR)
Find all frequent items using Apriori& FP-growth, respectively. Compare the efficiency of the two-
mining process.
Let's first find all frequent items using the Apriori and FP-Growth algorithms for the given
database with a minimum support threshold of 60% and minimum confidence of 80%. After
that, we'll compare the efficiency of the two algorithms.
Given Data:
Apriori Algorithm:
Step 1 (Initialization):
{A}: 4
{B}: 3
{C}: 2
{D}: 2
{E}: 3
o Prune items with support less than 60% (min_sup): {C}, {D}
2. For k = 2:
{A, B}: 3
{A, E}: 3
{B, E}: 3
The frequent item sets found using Apriori are {A, B}, {A, E}, {B, E}.
FP-Growth Algorithm:
The FP-Growth algorithm constructs an FP-tree and mines frequent item sets efficiently in a
single pass. I'll outline the steps without showing the tree structure.
The frequent item sets found using FP-Growth are {A, B}, {A, E}, {B, E}.
Comparison of Efficiency:
Apriori requires multiple passes over the data and generates candidate item sets,
which can be time-consuming for large databases. In this case, it required two passes
(one for frequent 1-item sets and one for frequent 2-item sets).
FP-Growth only requires a single pass over the data to build the FP-Tree and mine
frequent item sets directly. This is more efficient, especially for larger databases.
In this example, with a small database, the efficiency difference may not be very noticeable.
However, as the database size grows, FP-Growth's efficiency advantage becomes more
apparent, making it a preferred choice for frequent item set mining in many practical
applications.
Market basket analysis is a data mining and analytical technique used by businesses to
understand the purchasing behavior of customers based on the items they buy. It is a
valuable method for uncovering associations and patterns between products that are
frequently purchased together. Market basket analysis is often employed in the retail
industry, but it can be applied to various fields, such as e-commerce, grocery stores,
and even online services like streaming platforms.
Market basket analysis begins with the identification of frequent item sets. These are
sets of items (products) that are frequently purchased together in transactions.
2. Support:
Support is a key metric used in market basket analysis. It measures how often a
particular item or item set appears in the transactions. The support of an item set is
calculated as the number of transactions containing that item set divided by the total
number of transactions.
3. Confidence:
Lift is a metric that indicates the strength of association between items. It is calculated
as the confidence of the item set (A and B) divided by the support of the second item
(B). Lift greater than 1 suggests a positive association, less than 1 indicates a negative
association, and equal to 1 means no association.
5. Apriori Algorithm:
Market basket analysis is often performed using the Apriori algorithm or similar
techniques. This algorithm is used to generate frequent item sets efficiently and
discover association rules that reveal item combinations that occur more frequently
than expected by chance.
Example:
Suppose you're analyzing sales data for a grocery store. After applying market basket
analysis, you find that customers who buy bread (item A) are very likely to also buy
butter (item B) with a high confidence value. This information can be used to
optimize product placement in the store, create targeted promotions, or enhance
product recommendations for customers.
Cross-selling: Identify items that are often purchased together and offer bundle deals
or promotions to increase sales.
Store Layout and Merchandising: Improve store layout by placing related products
in close proximity to encourage additional purchases.
Market basket analysis provides valuable insights into customer behavior and can lead to
increased sales, improved customer satisfaction, and better decision-making for businesses in
various industries.
UNIT-V
Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in
an earth observation database and in the identification of groups of houses
in a city according to house type, value,and geographic location, as well as
the identification of groups of automobile insurance policy holders with a
high average claim cost.
Scalability:
Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of
objects. Clustering on a sample of a given large data set may lead to biased results.
High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two
to three dimensions. Human eyes are good at judging the qualityof clustering for up
to three dimensions. Finding clusters of data objects in highdimensionalspace is
challenging, especially considering that such data can be sparseand highly skewed.
Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of
new automatic banking machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the city’s rivers and highway
networks, and the type and number of customers per cluster. A challenging task is to
find groups of data with good clustering behavior that satisfy specified constraints.
A. K-Means, while a popular and widely used clustering algorithm, has several additional issues
beyond its basic strengths and weaknesses that can affect its performance and suitability for
different datasets or applications. Some of these additional issues include:
1. Selection of K (Number of clusters): Determining the appropriate number of clusters (K) can be
challenging. It often requires domain knowledge, visual inspection, or trial-and-error methods.
Choosing an incorrect K value can lead to suboptimal clustering.
2. Cluster Initialization: K-Means is sensitive to the initial placement of cluster centroids. Random
initialization can result in different clustering outcomes. Poor initialization can lead to slow
convergence or suboptimal clustering.
3. Convergence to Local Optima: K-Means seeks to minimize the objective function (e.g., the sum
of squared distances between data points and cluster centroids). However, it may converge to a
local optimum, which might not be the global best solution. Multiple initializations and restarts
from different initial positions can mitigate this, but it's not foolproof.
4. Impact of Outliers: Outliers can significantly influence the centroids' position, leading to
misleading cluster boundaries. As K-Means is highly sensitive to outliers, it might assign them to the
nearest cluster even if they don't represent the cluster well.
5. Scalability to High-Dimensional Data: K-Means may face challenges when dealing with high-
dimensional data. High-dimensional spaces can make distance calculations less meaningful,
impacting the quality of clustering.
6. Cluster Shape Assumptions: K-Means assumes clusters to be spherical and equally sized. If the
data contains non-spherical or irregularly shaped clusters, K-Means might not perform well.
7. Inability to Handle Non-Globular Clusters: Clusters that are not convex or globular in shape
might be challenging for K-Means to accurately identify.
8. Distance Metric Selection: The choice of distance metric (Euclidean distance being the default)
affects the clustering results. Using inappropriate distance measures for the data can lead to
suboptimal clustering.
9. Overcoming Curse of Dimensionality: K-Means can struggle with high-dimensional data, where
the "curse of dimensionality" affects the meaningfulness of distances between points.
Preprocessing techniques or feature selection can mitigate this issue.
C.Suppose that the data-mining task is to cluster the following. The eight points (representing
location) into three clusters:A1 (2;10) ; A2 (2;5) ; A3 (8;4) ; B1 (5;8) ; B2 (7;5) ; B3 (6;4) ; C1(1;2) ;
C2 (4;9). The distance function is Euclidean distance.Suppose initially we assign A1, B1, and C1 as
the center of each cluster, respectively. Use the k-means algorithm to determine:the three cluster
centers after the first round of execution.
Here's how to determine the three cluster centers after the first round of k-means execution
for the given data points and initial cluster assignments:
Data Points:
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
B1 (5, 8)
B2 (7, 5)
B3 (6, 4)
C1 (1, 2)
C2 (4, 9)
Cluster 2: B1 (5, 8)
Cluster 3: C1 (1, 2)
Calculate the Euclidean distance between each data point and all three initial
centroids.
Assign each data point to the cluster with the closest centroid.
Distance Calculations:
Calculate the mean of the data points assigned to each cluster (become the new
centroids).
New Centroid of Cluster 2: Mean of A2, A3, B1, B2, B3 = (5.2, 5.8)
Summary:
After the first round of k-means, the new cluster centers are:
Cluster 3: C1 (1, 2)
Note:
This is just one iteration of the k-means algorithm. In practice, you would repeat steps 1 and
2 until the centroids no longer change significantly (convergence). This indicates that the
algorithm has likely found a local minimum in the squared distance between points and their
assigned cluster centers.
(OR)
The K-Means algorithm is a widely used clustering technique known for its simplicity and
efficiency. However, it also has certain strengths and weaknesses that are important to consider
when using this method.
Strengths of K-Means:
2. Scalability: It is computationally efficient and works well with large datasets, making it suitable
for clustering in big data environments.
3. Speed: K-Means is fast and can converge quickly, making it useful for initial exploratory data
analysis and as a starting point for other clustering methods.
4. Versatility: It can work with various types of data and is suitable for many applications in
different domains.
5. Well-suited for spherical clusters: K-Means performs well when clusters are approximately
spherical and equally sized.
Weaknesses of K-Means:
1. Sensitivity to initial centroids: The algorithm's performance heavily depends on the initial
placement of centroids, which can lead to different results for different initializations.
2. Cluster shape and size assumption: K-Means assumes that clusters are spherical and of
approximately equal size, which may not reflect the actual structure of the data. It may perform
poorly with irregularly shaped or overlapping clusters.
3. Vulnerability to outliers: Outliers can significantly impact K-Means clustering results, as the
algorithm tends to assign them to the nearest cluster even if they don't belong to any.
4. Hard assignment of data points: K-Means provides a strict assignment of data points to clusters,
which might not represent the uncertainty or fuzziness in the data, unlike fuzzy clustering methods.
5. Difficulty with non-linear data: K-Means is not suitable for finding clusters in non-linear or
complex geometric structures within the data.
This method is based on the notion of density. The basic idea is to continue growing the given cluster
as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a minimum number of points.
Density-based spatial clustering of applications with noise (DBSCAN) clustering method. Clusters are
dense regions in the data space, separated by regions of the lower density of points. The DBSCAN
algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each
point of a cluster, the neighbourhood of a 15 given radius has to contain at least a minimum number
of points
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable only
for compact and well-separated clusters. Moreover, they are also severely affected by the
1. Clusters can be of arbitrary shape such as those shown in the figure below.
1. eps : It defines the neighbourhood around a data point i.e. if the distance
between two points is lower or equal to „eps‟ then they are considered as
neighbours. If the eps value is chosen too small then large part of the data
will be considered as outliers. If it is chosen very large then the clusters will16
and majority of the data points will be in the same clusters. One wayto find the
eps value is based on the k-distance graph.
1. MinPts: Minimum number of neighbours (data points) within eps radius.
Larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions
D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.
DBSCAN ALGORITHM:
Advantages The major advantage of this method is fast processing time. It is dependent only on
the number of cells in each dimension in the quantized space. Model-based methods In this method,
a model is hypothesized for each cluster to find the best fit of data for a given model. This method
locates the clusters by clustering the density function. It reflects spatial distribution of the data
points. This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.
Constraint-based Method In this method, the clustering is performed by the incorporation of user or
application- oriented 18 constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.
Clustering methods can be classified into the following categories − Partitioning Method
Hierarchical Method Density-based Method Grid-Based Method Model-Based Method
Constraint-based Method