0% found this document useful (0 votes)
19 views55 pages

DWDM Set-2

The document discusses data warehousing and OLAP. It explains why separate data warehouses are used instead of operational databases for analytics. It also compares OLTP and OLAP and describes ETL tools and metadata repositories. Finally, it discusses star schema and fact constellation schemas for multidimensional modeling.

Uploaded by

Gopl Kuppa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views55 pages

DWDM Set-2

The document discusses data warehousing and OLAP. It explains why separate data warehouses are used instead of operational databases for analytics. It also compares OLTP and OLAP and describes ETL tools and metadata repositories. Finally, it discusses star schema and fact constellation schemas for multidimensional modeling.

Uploaded by

Gopl Kuppa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Set-2

UNIT-I

1. a) Why to have a separate Data Warehouse? Explain.

Ans. Because operational databases store huge amounts of data, you may wonder, “Why not
perform online analytical processing directly on such databases instead of spending additional time
and resources to construct a separate data warehouse?” A major reason for such a separation is to
help promote the high performance of both systems. An operational database is designed and tuned
from known tasks and workloads like indexing and hashing using primary keys, searching for
particular records, and optimizing “canned” queries. On the other hand, data warehouse queries are
often complex. They involve the computation of large data groups at summarized levels, and may
require the use of special data organization, access, and implementation methods based on
multidimensional views. Processing OLAP queries in operational databases would substantially
degrade the performance of operational tasks. Moreover, an operational database supports the
concurrent processing of multiple transactions. Concurrency control and recovery mechanisms (e.g.,
locking and logging) are required to ensure the consistency and robustness of transactions. An OLAP
query often needs read-only access of data records for summarization and aggregation. Concurrency
control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution
of concurrent transactions and thus substantially reduce the throughput of an OLTP system. Finally,
the separation of operational databases from data warehouses is based on the different structures,
contents, and uses of the data in these two systems.

b) Compare and contrast On-Line Analytical Processing with On-Line Transaction Processing.

Ans.

Feature OLTP OLAP


It is a system which is used to It is a system which is used to manage
Characteristic
manage operational Data. informational Data.
Clerks, clients, and information Knowledge workers, including managers,
Users
technology professionals. executives, and analysts.
OLTP system is a customer-
oriented, transaction, and query OLAP system is market-oriented,
System
processing are done by clerks, knowledge workers including managers, do
orientation
clients, and information data analysts executive and analysts.
technology professionals.
OLAP system manages a large amount of
historical data, provides facilitates for
OLTP system manages current summarization and aggregation, and stores
Data contents data that too detailed and are and manages data at different levels of
used for decision making. granularity. This information makes the
data more comfortable to use in informed
decision making.
Database Size 100 MB-GB 100 GB-TB
Database design OLTP system usually uses an OLAP system typically uses either a star or
entity-relationship (ER) data snowflake model and subject-oriented
model and application-oriented database design.
database design.
OLAP system often spans multiple versions
OLTP system focuses primarily
of a database schema, due to the
on the current data within an
evolutionary process of an organization.
enterprise or department,
View OLAP systems also deal with data that
without referring to historical
originates from various organizations,
information or data in different
integrating information from many data
organizations.
stores.
Because of their large volume, OLAP data
Volume of data Not very large
are stored on multiple storage media.
The access patterns of an
OLTP system subsist mainly of
Accesses to OLAP systems are mostly read-
short, atomic transactions. Such
Access patterns only methods because of these data
a system requires concurrency
warehouses stores historical data.
control and recovery
techniques.
Access mode Read/write Mostly write
Insert and Short and fast inserts and Periodic long-running batch jobs refresh the
Updates updates proposed by end-users. data.
Number of
records Tens Millions
accessed
Normalization Fully Normalized Partially Normalized
It depends on the amount of files contained,
Processing batch data refresh, and complex query may
Very Fast
Speed take many hours, and query speed can be
upgraded by creating indexes.

C) Explain ETL tools and metadata repository?

C) Explain ETL tools and metadata repository?

Ans. Extraction, Transformation, and Loading Data warehouse systems use back-end tools and
utilities to populate and refresh their data . These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible. Data
transformation, which converts data from legacy or host format to warehouse format. Load, which
sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
Refresh, which propagates the updates from the data sources to the warehouse. Besides cleaning,
loading, refreshing, and metadata definition tools, data warehouse systems usually provide a good
set of data warehouse management tools.

2a) Describe a starnet query model for querying multidimensional databases.

The querying of multidimensional databases can be based on a starnet model, which consists of
radial lines emanating from a central point, where each line represents a concept hierarchy for a
dimension. Each abstraction level in the hierarchy is called a footprint. These represent the
granularities available for use by OLAP operations such as drill-down and roll-up.

A starnet query model for the AllElectronics data warehouse is shown in Figure 4.13. This starnet
consists of four radial lines, representing concept hierarchies for the dimensions location, customer,
item, and time, respectively. Each line consists of footprints representing abstraction levels of the
dimension. For example, the time line has four footprints: “day,” “month,” “quarter,” and “year.” A
concept hierarchy may involve a single attribute (e.g., date for the time hierarchy) or several
attributes (e.g., the concept hierarchy for location involves the attributes street, city, province or
state, and country). In order to examine the item sales at AllElectronics, users can roll up along the

time dimension from month to quarter, or, say, drill down along the location dimension from country
to city. Concept hierarchies can be used to generalize data by replacing low-level values (such as
“day” for the time dimension) by higher-level abstractions (such as “year”), or to specialize data by
replacing higher-level abstractions with lower-level values.

b) Describe OLAP operations in the Multidimensional Data Model.


1 B. Explain the star schema and fact constellation schemas.

Star Schema:
Ø Each dimension in a star schema is represented with only one-dimension table.

Ø This dimension table contains the set of attributes.

Ø The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.

Ø There is a fact table at the center. It contains the keys to each of four dimensions.

Ø The fact table also contains the attributes, namely dollars sold and units sold.

Ø Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains
the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data
redundancy along the attributes province_or_state and country.
Characteristics of Star Schema:
Ø Every dimension in a star schema is represented with the only one-dimension table.

Ø The dimension table should contain the set of attributes.

Ø The dimension table is joined to the fact table using a foreign key

Ø The dimension table are not joined to each other

Ø Fact table would contain key and measure

Ø The Star schema is easy to understand and provides optimal disk usage.

Ø The dimension tables are not normalized. For instance, in the above figure, Country_ID does not have Country lookup table as an OLTP
design would have.

Ø The schema is widely supported by BI Tools.

Advantages:

Ø (i) Simplest and Easiest

Ø (ii) It optimizes navigation through database

Ø (iii) Most suitable for Query Processing


Fact Constellation Schema:
Ø A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy schema.

Ø Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation Schema can design with a
collection of de-normalized FACT, Shared, and Conformed Dimension tables.
A fact constellation schema is shown in the figure below.
Ø This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions, namely, time, item, branch, and
location.

Ø The schema contains a fact table for sales that includes keys to each of the four dimensions, along with two measures: Rupee_sold and
units_sold.

Ø The shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two measures:
Rupee_cost and units_shipped.

Ø It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared
between the sales and shipping fact table.

Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.

2.A) What are the differences between the three main types of data warehouse information processing, analytical processing, and
data mining? Discuss the motivation behind OLAP mining(OLAM)

The three main types of data warehouse usage are information processing, analytical processing, and data mining. Let's discuss each one and
then delve into the motivation behind OLAP mining (OLAM).

1. Information Processing:

Information processing in a data warehouse involves collecting, storing, and managing large volumes of data from various sources to support
day-to-day business operations. The primary goal is to provide a centralized repository of integrated data that can be accessed and updated in
real-time to support transactional activities.

Key characteristics:

 Real-time data updates: Information processing focuses on capturing and maintaining the most current state of the data to support
operational processes.

 OLTP (Online Transaction Processing): The focus is on efficient handling of frequent and small-scale transactions.

Use cases:

 Online order processing

 Inventory management
 Customer relationship management (CRM) systems

2. Analytical Processing:

Analytical processing in a data warehouse involves querying and analyzing historical data to gain insights, identify patterns, and make
strategic decisions. It emphasizes providing fast response times for complex analytical queries and data summarization.

Key characteristics:

 Historical data analysis: Analytical processing deals with large volumes of historical data to identify trends and patterns over
time.

 OLAP (Online Analytical Processing): The focus is on supporting complex queries and multidimensional analysis.

Use cases:

 Business intelligence reporting

 Key performance indicator (KPI) analysis

 Market trend analysis

3. Data Mining:

Data mining in a data warehouse involves the discovery of valuable patterns, correlations, and insights from large datasets. It uses statistical
and machine learning techniques to find hidden relationships within the data and predict future trends.

Key characteristics:

 Advanced data analysis: Data mining goes beyond standard analytical processing by discovering new knowledge and patterns in
the data.

 Predictive modeling: It involves building models that can predict future trends and behaviors based on historical data.

Use cases:

 Customer segmentation and profiling

 Fraud detection

 Recommender systems

Now, let's discuss the motivation behind OLAP mining (OLAM):

OLAP mining (OLAM) is a combination of Online Analytical Processing (OLAP) and data mining techniques. The motivation behind
OLAM is to extend the capabilities of traditional OLAP systems by incorporating data mining algorithms to discover deeper insights and
patterns from the multidimensional data stored in the data warehouse.

Key motivations for OLAP mining (OLAM) include:

1. Enhanced Decision Support: OLAM enhances decision-making processes by providing advanced analytical capabilities. It allows
users to uncover hidden relationships and patterns in data that may not be apparent through traditional OLAP analysis alone.

2. Pattern Discovery: OLAM employs data mining techniques to discover previously unknown patterns, trends, and associations in
multidimensional data. This can lead to actionable insights and a deeper understanding of business processes.

3. Predictive Analysis: By combining OLAP with data mining, OLAM enables predictive analysis, allowing organizations to make
data-driven forecasts and anticipate future trends based on historical data.
4. Deeper Insights: OLAM goes beyond standard OLAP aggregations and drill-downs to reveal deeper insights into data. It enables
users to identify outliers, anomalies, and other valuable patterns that may influence business strategies.

5. Complex Data Relationships: Data mining algorithms in OLAM can uncover complex relationships between dimensions that may
not be apparent through simple OLAP queries.

Overall, OLAP mining (OLAM) bridges the gap between OLAP and data mining, allowing organizations to make more informed decisions
and gain a competitive edge by leveraging the power of advanced analytics on multidimensional data.

2 B) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and
charge, where charge is the fee that a doctor charges a patient for a visit. Enumerate three classes of schemas that are popularly
used for modeling data warehouses and explain.

Sure! In the context of a data warehouse with dimensions like time, doctor, and patient, and measures such as count and charge, here are
three popular classes of schemas used for modeling data warehouses:

1. Star Schema:

The star schema is a widely used and simple schema design for data warehousing. In this schema, there is one central fact table that holds the
measures (count and charge) and is surrounded by dimension tables (time, doctor, and patient) that provide context to the measures. The fact
table contains foreign keys to link to the dimension tables.

Explanation:

 Fact Table: Contains the quantitative measures (count and charge) and foreign keys to connect to the dimension tables.

 Dimension Tables: Each dimension table represents a specific attribute or dimension, such as time, doctor, and patient. These
tables contain descriptive attributes related to the respective dimension.

Advantages:

 Simple and easy to understand.

 Fast query performance as there are limited joins involved.

 Denormalized structure allows for efficient aggregation of data.

2. Snowflake Schema:

The snowflake schema is an extension of the star schema where dimension tables are further normalized into multiple related tables. In this
schema, the dimension tables are broken down into sub-dimensions, reducing data redundancy and improving data integrity.

Explanation:

 Fact Table: Same as in the star schema, contains the measures and foreign keys.

 Dimension Tables: Dimension tables might be further normalized into sub-dimension tables. For example, the doctor dimension
may have separate tables for doctor details, specialty, and location, with relationships between them.

Advantages:

 Reduced data redundancy due to normalization.

 Improved data integrity and consistency.

 Potentially better storage efficiency.


3. Fact Constellation (Galaxy) Schema:

The fact constellation schema, also known as the galaxy schema, is a complex schema design that consists of multiple fact tables sharing
dimension tables. This schema is used when dealing with heterogeneous data with different grain levels.

Explanation:

 Fact Tables: Multiple fact tables, each containing different measures related to specific business processes. For example, one fact
table may store patient-related measures, while another fact table stores doctor-related measures.

 Dimension Tables: Shared dimension tables are used across all fact tables to maintain consistency and reduce redundancy.

Advantages:

 Supports complex scenarios with multiple independent business processes or varying grain levels of data.

 Provides flexibility in organizing data for different analytical purposes.

Each of these schema designs has its own advantages and trade-offs. The choice of schema depends on the specific requirements of the data
warehouse, the complexity of the data being analyzed, and the preferred querying and reporting performance.

3. a) Explain datawarehouse architecture and models.

Data warehouses often adopt a three-tier architecture


ØThe bottom tier is a warehouse database server that is almost always a relationaldatabase system. Back-end tools and utilities are used to
feed the data into the bottomtier from operational database or other external sources. These tools and utilitiesperform data extraction,
cleaning and transformation(ex. To merge similar data fromdifferent sources into a unified format), as well as load and refresh functions to
updatethe data warehouse. The data are extracted using application program interfaces knownas gateways.A gateway is supported by the
underlying DBMS and allows client programsto generate SQL code to be executed at a server.

Examples of gateways include ODBC(Open Database Connection) and OLEDB(OpenLinking and Embedding for Databases) by Microsoft
and JDBC(Java DatabaseConnection).This tier also contains a metadata repository, which stores informationabout the data warehouse and its
contents.
The middle tier is an OLAP server that is typically implemented using either
(a)a relational OLAP(ROLAP) model, that is an extended relational DBMS that maps operations on multidimensional data to standard
relational operations, or

(b) a multidimensional OLAP(MOLAP) model that is a special-purpose server that directly implements multidimensional data and
operations.

The top tier is a front end client layer, which contains query and reporting tools, analysis tools and data mining tools(ex: trend analysis,
prediction….)

Types of Data Warehouse Models

An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It supports corporate-wide data
integration, usually from one or more operational systems or external data providers, and it's cross-functional in scope. It generally contains
detailed information as well as summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or
beyond. An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or parallel architecture
platforms. It required extensive business modeling and may take years to develop and build.

Data Mart

A data mart includes a subset of corporate-wide data that is of value to a specific collection of users. The scope is confined to particular
selected subjects. For example, a marketing data mart may restrict its subjects to the customer, items, and sales. The data contained in the
data marts tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or more operational systems or external data
providers, or data generally locally within a different department or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.

Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query processing, only some of the possible
summary vision may be materialized. A virtual warehouse is simple to build but required excess capacity on operational database servers.

b) Explain OLAP operations and different types of OLAP Server architectures.

OLAP OPERATIONS:
Ø In the multidimensional model, the records are organized into various dimensions, and each dimension includes multiple levels of
abstraction described by concept hierarchies.
Ø This organization support users with the flexibility to view data from various perspectives.

Ø A number of OLAP data cube operation exist to demonstrate these different views, allowing interactive queries and search of the record
at hand. Hence, OLAP supports a user-friendly environment for interactive data analysis.

Ø Consider the OLAP operations which are to be performed on multidimensional data.

Ø The data cubes for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is aggregated with
regard to city values, time is aggregated with respect to quarters, and an item is aggregated with respect to item types.

OLAP having 5 different operations


(i) Roll-up

(ii) Drill-down

(iii) Slice

(iv) Dice

(v) Pivot

Roll-up:
Ø The roll-up operation performs aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is
like zooming-out on the data cubes.

Ø It is also known as drill-up or aggregation operation

Ø Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the location is defined as the Order
Street, city, province, or state, country.

Ø The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the level of the country.

Ø When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube.

Ø For example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed by removing, the time
dimensions, appearing in an aggregation of the total sales by location, relatively than by location and by time.

Drill-Down
Ø The drill-down operation is the reverse operation of roll-up.
Ø It is also called roll-down operation.

Ø Drill-down is like zooming-in on the data cube.

Ø It navigates from less detailed record to more detailed data. Drill-down can be performed by either stepping down a concept hierarchy
for a dimension or adding additional dimensions.

Ø Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy which is defined as day,
month, quarter, and year.

Ø Drill-down appears by descending the time hierarchy from the level of the quarter to a more detailed level of the month.

Ø Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension to a cube.

Slice:
Ø A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.

Ø The slice operation provides a new sub cube from one particular dimension in a given cube.

Ø For example, a slice operation is executed when the customer wants a selection on one dimension of a three-dimensional cube resulting
in a two-dimensional site. So, the Slice operations perform a selection on one dimension of the given cube, thus resulting in a sub cube.

Ø Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".

Ø It will form a new sub-cubes by selecting one or more dimensions.

SLICE OPERATION
c)Define data warehouse. Explain about data warehouse implementation.

Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries
be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access methods, and query processing techniques. In
this section, we present an overview of methods for the efficient implementation of data warehouse
systems.

how to compute data cubes efficiently.

how OLAP data can be indexed, using either bitmap or join indices.

how OLAP queries are processed

various types of warehouse servers for OLAP processing.

The compute cube Operator and the Curse of Dimensionality One approach to cube computation
extends SQL so as to include a compute cube operator. The compute cube operator computes
aggregates over all subsets of the dimensions specified in the operation. This can require excessive
storage space, especially for large numbers of dimensions. We start with an intuitive look at what is
involved in the efficient computation of data cubes.

A data cube is a lattice of cuboids. Suppose that you want to create a data cube for AllElectronics
sales that contains the following: city, item, year, and sales in dollars. You want to be able to analyze
the data, with queries such as the following: “Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.” “Compute the sum of sales, grouping by item.

What is the total number of cuboids, or group-by’s, that can be computed for this data cube? Taking
the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as
the measure, the total number of cuboids, or groupby’s, that can be computed for this data cube is
23 = 8. The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item,
year), (city), (item), (year), ()}, where () means that the group-by is empty (i.e., the dimensions are
not grouped).

There are three choices for data cube materialization given a base cuboid: 1. No materialization: Do
not precompute any of the “nonbase” cuboids. This leads to computing expensive multidimensional
aggregates on-the-fly, which can be extremely slow. 2. Full materialization: Precompute all of the
cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically
requires huge amounts of memory space in order to store all of the precomputed cuboids. 3. Partial
materialization: Selectively compute a proper subset of the whole set of possible cuboids.
Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy
some user-specified criterion, such as where the tuple count of each cell is above some threshold.
We will use the term subcube to refer to the latter case, where only some of the cells may be
precomputed for various cuboids. Partial materialization represents an interesting trade-off between
storage space and response time.

Indexing OLAP Data: Bitmap Index and Join Index:The bitmap indexing method is popular in OLAP
products because it allows quick searching in data cubes. The bitmap index is an alternative
representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a distinct
bit vector, Bv, for each value v in the attribute’s domain. If a given attribute’s domain consists of n
values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the
attribute has the value v for a given row in the data table, then the bit representing that value is set
to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.

The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database. For example, if
two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join index record
contains the pair (RID, SID), where RID and SID are record identifiers from the R and S relations,
respectively. Hence, the join index records can identify joinable tuples without performing costly join
operations. Join indexing is especially useful for maintaining the relationship between a foreign key2
and its matching primary keys, from the joinable relation. The star schema model of data
warehouses makes join indexing attractive for crosstable search, because the linkage between a fact
table and its corresponding dimension tables comprises the fact table’s foreign key and the
dimension table’s primary key. Join indexing maintains relationships between attribute values of a
dimension (e.g., within a dimension table) and the corresponding rows in the fact table. Join indices
may span multiple dimensions to form composite join indices. We can use join indices to identify
subcubes that are of interest.

Join indexing. In Example 3.4, we defined a star schema for AllElectronics of the form “sales star
[time, item, branch, location]: dollars sold = sum (sales in dollars).” An example of a join index
relationship between the sales fact table and the location and item dimension tables is shown in
Figure 4.16. For example, the “Main Street” value in the location dimension table joins with tuples
T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table
joins with tuples T57 and T459 of the sales fact table. The corresponding join index tables are shown
in Figure .
Efficient Processing of OLAP Queries The purpose of materializing cuboids and constructing OLAP
index structures is to speed up query processing in data cubes. Given materialized views, query
processing should proceed as follows:

1. Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the
query into corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube
may correspond to selection and/or projection operations on a materialized cuboid.

2. Determine to which materialized cuboid(s) the relevant operations should be applied: This involves
identifying all of the materialized cuboids that may potentially be used to answer the query, pruning
the set using knowledge of “dominance” relationships among the cuboids, estimating the costs of
using the remaining materialized cuboids, and selecting the cuboid with the least cost.

Example 4.9 OLAP query processing. Suppose that we define a data cube for AllElectronics of the
form “sales cube [time, item, location]: sum(sales in dollars).” The dimension hierarchies used are
“day < month < quarter < year” for time; “item name < brand < type” for item; and “street < city <
province or state < country” for location. Suppose that the query to be processed is on {brand,
province or state}, with the selection constant “year = 2010.” Also, suppose that there are four
materialized cuboids available, as follows:

cuboid 1: {year, item name, city}

cuboid 2: {year, brand, country}

cuboid 3: {year, brand, province or state}

cuboid 4: {item name, province or state},

where year = 2010 “Which of these four cuboids should be selected to process the query?” Finer-
granularity data cannot be generated from coarser-granularity data. Therefore, cuboid 2 cannot be
used because country is a more general concept than province or state. Cuboids 1, 3, and 4 can be
used to process the query because (1) they have the same set or a superset of the dimensions in the
query, (2) the selection clause in the query can imply the selection in the cuboid, and (3) the
abstraction levels for the item and location dimensions in these cuboids are at a finer level than
brand and province or state, respectively.

UNIT-II

3a) Explain data mining as a KDD Process?

Data Mining is defined as extracting information from


huge sets of data. In other words, we can say that data
mining is the procedure of mining knowledge from data.
The information or knowledge extracted so can be used
for any of the following applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Knowledge discovery from Data (KDD) is essential for data


mining. While others view data mining as an essential step
in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −

 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task
are retrieved from the database.
 Data Transformation − In this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied
in order to extract data patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
b) What are the major issues in data mining?

a. Data mining is a dynamic and fast-expanding field


with great strengths. The major issues can divided into
five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a)Mining Methodology:
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge.
Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Mining knowledge in multidimensional space – when
searching for knowledge in large datasets, we can explore the
data in multidimensional space.
 Handling noisy or incomplete data − the data cleaning
methods are required to handle the noise and incomplete
objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered
patterns will be poor.
 Pattern evaluation − the patterns discovered should be
interesting because either they represent common knowledge
or lack novelty.
b) User Interaction:
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns,
providing and refining data mining requests based on the
returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the
background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data
mining.
 Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high
level languages, and visual representations. These
representations should be easily understandable.

c) Efficiency and scalability


There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order
to effectively extract the information from huge amount of data
in databases, data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of
data, and complexity of data mining methods motivate the
development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental
algorithms, update databases without mining the data again
from scratch.
d)Diverse Data Types Issues
 Handling of relational and complex types of data − The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and
global information -- The data is available at different data
sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
e) Data Mining and Society
 Social impacts of data mining – With data mining penetrating
our everyday lives, it is important to study the impact of data
mining on society.
 Privacy-preserving data mining – data mining will help
scientific discovery, business management, economy recovery,
and security protection.
 Invisible data mining – we cannot expect everyone in society
to learn and master data mining techniques. More and more
systems should have data mining functions built within so that
people can perform data mining or use data mining results
simply by mouse clicking, without any knowledge of data
mining algorithms.

c)Which technologies are used in datamining?


1. Statistics:
 It uses the mathematical analysis to express representations,
model and summarize empirical data or real world
observations.
 Statistical analysis involves the collection of methods,
applicable to large amount of data to conclude and report the
trend.
2. Machine learning
 Arthur Samuel defined machine learning as a field of study
that gives computers the ability to learn without being
programmed.
 When the new data is entered in the computer, algorithms
help the data to grow or change due to machine learning.
 In machine learning, an algorithm is constructed to predict the
data from the available database (Predictive analysis).
 It is related to computational statistics.

The four types of machine learning are:


a. Supervised learning
 It is based on the classification.
 It is also called as inductive learning. In this method, the
desired outputs are included in the training dataset.
b. Unsupervised learning
 Unsupervised learning is based on clustering. Clusters are
formed on the basis of similarity measures and desired outputs
are not included in the training dataset.
c. Semi-supervised learning

Semi-supervised learning includes some desired outputs to the


training dataset to generate the appropriate functions. This method
generally avoids the large number
of labeled examples (i.e. desired outputs).
d. Active learning
 Active learning is a powerful approach in analyzing the data efficiently.
 The algorithm is designed in such a way that, the desired
output should be decided by the algorithm itself (the user plays
important role in this type).
3. Information retrieval
 Information deals with uncertain representations of the
semantics of objects (text, images).
For example: Finding relevant information from a large document.

4. Database systems and data warehouse


 Databases are used for the purpose of recording the data as well as data
warehousing.
 Online Transactional Processing (OLTP) uses databases for day
to day transaction purpose.
 Data warehouses are used to store historical data which helps
to take strategically decision for business.
 It is used for online analytical processing (OALP), which helps to analyze
the data.
5. PatternRecognition:
Pattern recognition is the automated recognition of
patterns and regularities in data. Pattern recognition is
closely related to artificial intelligence and machine
learning, together with applications such as data mining
and knowledge discovery in databases (KDD), and is often
used interchangeably with these terms.
6. Visualization:
It is the process of extracting and visualizing the
data in a very clear and understandable way without any
form of reading or writing by displaying the results in the
form of pie charts, bar graphs, statistical representation
and through graphical forms as well.
7. Algorithms:
To perform data mining techniques we have to design best
algorithms.
8. High Performance Computing:
High Performance Computing most generally refers
to the practice of aggregating computing power in a way
that delivers much higher performance than one could get
out of a typical desktop computer or workstation in order
to solve large problems in science, engineering, or
business.

d) What kind of data can be mined?

1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)

Flat Files
 Flat files are defined as data files in text
form or binary form with a structure that
can be easily extracted by data mining
algorithms.
 Data stored in flat files have no
relationship or path among themselves,
like if a relational database is stored on flat
file, and then there will be no relations
between the tables.
 Flat files are represented by data dictionary. Eg: CSV
file.
 Application: Used in Data Warehousing to
store data, Used in carrying data to and
from server, etc.

Relational Databases
A Relational database is defined as the
collection of data organized in tables
with rows and columns.
 Physical schema in Relational databases is
a schema which defines the structure of
tables.
 Logical schema in Relational databases is
a schema which defines the relationship
among tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc.
DataWarehouse
 A datawarehouse is defined as the
collection of data integrated from
multiple sources that will query and
decision making.
 There are three types of
datawarehouse: Enterprise
datawarehouse, Data Mart and Virtual
Warehouse.

Two approaches can be used to


update data in DataWarehouse:

Query- driven Approach and


Update-driven Approach.
 Application: Business decision making, Data mining,
etc.
Transactional Databases
 Transactional databases are a collection
of data organized by time stamps, date,
etc to represent transaction in
databases.
 This type of database has the capability
to roll back or undo its operation when a
transaction is not completed or
committed.
 Highly flexible system where users can
modify information without changing any
sensitive information.
 Follows ACID property of DBMS.
 Application: Banking, Distributed systems, Object
databases, etc.

Multimedia Databases
 Multimedia databases consists audio, video, images
and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-
specified formats.
 Application: Digital libraries, video-
on demand, news-on demand,
musical database, etc

Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology,
lines, polygons, etc.
 Application: Maps, Global positioning, etc.
Time-series Databases
 Time series databases contain stock exchange data
and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.

WWW
 WWW refers to World wide web is a
collection of documents and resources like
audio, video, text, etc which are identified
by Uniform Resource Locators (URLs)
through web browsers, linked by HTML
pages, and accessible via the Internet
network.
 It is the most heterogeneous repository as it collects
data from multiple resources.
 It is dynamic in nature as Volume of data is
continuously increasing and changing.
 Application: Online shopping, Job search, Research,
studying, etc.

(OR)

4a) Briefly discuss about types of attributes and measurements

Attribute:
4a) Briefly discuss about types of attributes and measurements

Attribute:

It can be seen as a data field that represents characteristics or


features of a data object. For a customer object attributes can be
customer Id, address etc. The attribute types can represented as
follows—
1. Nominal Attributes – related to names: The values of a

Nominal attribute are name of things, some kind of


symbols. Values of Nominal attributes represent some
category or state and that’s why nominal attribute also
referred as categorical attributes.

Example:

Attribute Values
Colors Black, Green, Brown,
red
2. Binary Attributes: Binary data has only 2
values/states. For Example yes or no, affected or
unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes: The Ordinal Attributes contains values

that have a meaningful sequence or ranking (order)


between them, but the magnitude between values is not
actually known, the order of values that shows what is
important but don’t indicate how important it is.
Attribute Values
Grade O, S, A, B, C, D, F
4. Numeric: A numeric attribute is quantitative because, it is

a measurable quantity, represented in integer or real values.


Numerical attributes are of 2 types.
i. An interval-scaled attribute has values, whose
differences are interpretable, but the numerical
attributes do not have the correct reference point
or we can call zero point. Data can be added and
subtracted at interval scale but cannot be
multiplied or divided.Consider an example of
temperature in degrees Centigrade. If a day’s
temperature of one day is twice than the other
day we cannot say that one day is twice

as hot as another day.


i. A ratio-scaled attribute is a numeric attribute with a
fix zero-point. If a measurement is ratio-scaled, we
can say of a value as being a multiple (or ratio) of
another value. The values are ordered, and we can
also compute the difference between values, and the
mean, median, mode, Quantile-range and five
number summaries can be given.
5. Discrete: Discrete data have finite values it can be

numerical and can also be in categorical form. These


attributes has finite or countably infinite set of
values.
Example

Attribute Values
Profession Teacher, Business man, Peon

ZIP Code 521157, 521301

6. Continuous: Continuous data have infinite no of states.

Continuous data is of float type. There can be many values


between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73,
etc.,

General Characteristics of Data Sets


Dimensionality The dimensionality of a data set is the
number of attributes that the objects in the data set possess.
Data with a small number of dimen- sions tends to be
qualitatively different than moderate or high-dimensional
data. Indeed, the difficulties associated with analyzing high-
dimensional data are sometimes referred to as the curse of
dimensionality. Because of this,an important motivation in
preprocessing the data is dimensionality reduction.

Sparsity For some data sets, such as those with asymmetric


features, most attributes of an object have values of 0; in
many cases, fewer than 1% of the entries are non-zero.

Resolution It is frequently possible to obtain data at


different levels of resolution, and often the properties of the
data are different at different resolutions. For instance, the
surface of the Earth seems very uneven at a resolution of a
few meters, but is relatively smooth at a resolution of tens
of kilometers.

b) Explain about similarity and dissimilarity between simple attributes and dataobjects.

l Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
l Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
l Proximity refers to a similarity or dissimilarity

c) What are the value ranges of the following normalization methods?

(i) min-max normalization (ii) z-score normalization (iii) z-score normalization using the mean
absolute deviation instead of standard deviation(iv) normalization by decimal scaling.

a) Min-max normalization performs a linear


transformation on the original data. Suppose that
minAand maxAare the minimum and maximum values of

an attribute, A.Min- maxnormalization maps a value, vi, of


A to vi’in the range [new_minA,new_maxA]by computing
Min-max normalization preserves the relationships
among the original data values. Itwill encounter an
―out-of-bounds‖ error if a future input case for
normalization fallsoutside of the original data range
for A.
Example:-Min-max normalization. Suppose that the
minimum and maximum values fortheattribute income are
$12,000 and $98,000, respectively. We would like to map
income to the range [0.0, 1.0]. By min-max normalization,
a value of $73,600 for income istransformed to

b) Z-Score Normalization
The values for an attribute, A, are normalized based

on the mean (i.e., average) and standard deviation of


A. A value, vi, of A is normalized to vi’ by computing
where𝐴 and A are the mean and standard
deviation, respectively, of attribute A. Example z-
score normalization. Suppose that the mean and
standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively. With
z-score normalization, a value of $73,600 for income
is transformed to

c) Normalization by Decimal Scaling:


Normalization by decimal scaling normalizes by
moving the decimal point of values of attribute A.
The number of decimal points moved depends on
the maximum absolute value of

A. A value, vi, of A is normalized to vi’ by computing


where j is the smallest integer such that max(|vi’)< 1.

Example Decimal scaling. Suppose that the recorded


values of A range from -986 to 917. Themaximum
absolute value of A is 986. To normalize by decimal
scaling, we thereforedivide each value by 1000 (i.e., j
= 3) so that -986 normalizes to -0.986 and
917normalizes to 0.917.

UNIT-III

5 . a) Use the C4.5 algorithm to build a decision tree for classifying the following objects.

Class Size Color Shape

A Small Yellow Round

A Big Yellow Round

A Big Red Round

A Small Red Round.

B Small Black Round

B Big Black Cube

B Big Yellow Cube

B Big Black Round

B Small Yellow Cube

The C4.5 algorithm is a decision tree algorithm used for classification. It works by
recursively partitioning the dataset into subsets based on the attributes that provide the best
split, measured using information gain or another appropriate metric. Let's build a decision
tree for the given objects based on the attributes: Size, Color, and Shape.

Step 1: Calculate the entropy of the entire dataset:

Entropy(Dataset) = -p(A) * log2(p(A)) - p(B) * log2(p(B))

Where:

 p(A) is the proportion of Class A objects in the dataset.

 p(B) is the proportion of Class B objects in the dataset.


p(A) = 4/9 p(B) = 5/9

Entropy(Dataset) = - (4/9) * log2(4/9) - (5/9) * log2(5/9) ≈ 0.991

Step 2: Calculate the information gain for each attribute (Size, Color, Shape):

Information Gain(Attribute) = Entropy(Dataset) - Weighted Sum of Entropy(Child Datasets)

Now, we'll calculate the information gain for each attribute:

For Size:

 Split the dataset based on Small and Big sizes:

Entropy(Small) = - (2/4) * log2(2/4) - (2/4) * log2(2/4) = 1.0 Entropy(Big) = - (2/5) *


log2(2/5) - (3/5) * log2(3/5) ≈ 0.971

Weighted Entropy(Size) = (4/9) * Entropy(Small) + (5/9) * Entropy(Big) ≈ 0.983

Information Gain(Size) = Entropy(Dataset) - Weighted Entropy(Size) ≈ 0.008

For Color:

 Split the dataset based on Yellow, Red, and Black colors:

Entropy(Yellow) = - (3/4) * log2(3/4) - (1/4) * log2(1/4) ≈ 0.811 Entropy(Red) = - (1/2) *


log2(1/2) - (1/2) * log2(1/2) = 1.0 Entropy(Black) = - (1/3) * log2(1/3) - (2/3) * log2(2/3) ≈
0.918

Weighted Entropy(Color) = (4/9) * Entropy(Yellow) + (2/9) * Entropy(Red) + (3/9) *


Entropy(Black) ≈ 0.899

Information Gain(Color) = Entropy(Dataset) - Weighted Entropy(Color) ≈ 0.092

For Shape:

 Split the dataset based on Round and Cube shapes:

Entropy(Round) = - (4/8) * log2(4/8) - (4/8) * log2(4/8) = 1.0 Entropy(Cube) = - (1/1) *


log2(1/1) - 0 = 0.0

Weighted Entropy(Shape) = (8/9) * Entropy(Round) + (1/9) * Entropy(Cube) ≈ 0.889

Information Gain(Shape) = Entropy(Dataset) - Weighted Entropy(Shape) ≈ 0.102

Now, compare the information gains for Size, Color, and Shape. The attribute with the
highest information gain is Shape.

So, the decision tree should start by splitting on the Shape attribute. Here's the decision tree:
1. If Shape is Round:

o Classify as A if Color is Red, Yellow; Classify as B if Color is Black.

2. If Shape is Cube:

o Classify as B.

This decision tree will correctly classify the given objects into classes A and B based on their
Size, Color, and Shape attributes.

2.What are the new features of C4.5 algorithm comparing with

original Quinlan’s ID3 algorithm for decision-tree generation?

The C4.5 algorithm, developed by Ross Quinlan as an improvement over his earlier ID3
algorithm, introduced several new features and enhancements. Here are the key differences
and new features of C4.5 compared to the original ID3 algorithm:

1. Handling Continuous Attributes:

o C4.5 can handle both discrete and continuous attributes, whereas ID3 can only
handle discrete attributes. C4.5 achieves this by dynamically selecting
threshold values for continuous attributes, allowing it to split data effectively
based on numerical values.

2. Information Gain Ratio:

o C4.5 uses a modified information gain metric called "Information Gain Ratio"
or "Gain Ratio" to overcome ID3's bias towards attributes with many values.
Gain Ratio adjusts for the number of branches an attribute can split into, which
helps prevent attributes with many values from dominating the tree.

3. Pruning:

o C4.5 incorporates a pruning mechanism to reduce overfitting. Pruning


involves removing branches from the tree that do not significantly improve
accuracy on the validation data. ID3 does not have built-in pruning.

4. Handling Missing Values:

o C4.5 can handle missing attribute values by assigning probabilities to missing


values based on available data. ID3 typically treats missing values as a
separate category, which can lead to suboptimal decisions.

5. Rule Generation:

o C4.5 can generate rules from the decision tree, which can be easier to interpret
and provide insights into the decision-making process. ID3 primarily produces
a decision tree without directly generating rules.
6. Reduced Attribute Inclusion:

o C4.5 can reduce attribute inclusion in the tree by considering whether adding
an attribute to a node will result in significant improvement. This feature helps
produce simpler and more efficient trees compared to ID3, which tends to
include all attributes.

7. Handling Unequal Class Distribution:

o C4.5 handles datasets with unequal class distribution more effectively by


considering the class distribution when calculating information gain. It
addresses the problem of bias toward attributes with many instances of a
single class.

8. Scalability:

o C4.5 is generally more scalable than ID3 due to its pruning mechanism and
better handling of attributes with many values. This makes it suitable for
larger datasets.

9. Improved Handling of Numeric Class Labels:

o C4.5 can handle numeric class labels by converting them into a binary
classification problem, whereas ID3 primarily works with categorical class
labels.

Overall, C4.5 improved upon several limitations of the original ID3 algorithm, making it a
more versatile and effective algorithm for decision tree generation, especially in scenarios
involving real-world datasets with diverse attributes and data types.

b) Explain the concept of cross validation with suitable example.

Cross-validation
The practice of cross-validation is to take a dataset and randomly split it into a number even
segments, called folds. The machine learning algorithm is trained on all but one fold. Cross-
validation then tests each fold against a model trained on all of the other folds. This means
that each trained model is tested on a segment of the data that it has never seen before. The
process is repeated with a different fold being hidden during training and then tested until all
folds have been used exactly once as a test and been trained on during every other iteration.

The training data is split into five folds. During each iteration, a different fold is set aside to
be used as test data.
The outcome of cross-validation is a set of test metrics that give a reasonable forecast of how
accurately the trained model will be able to predict on data that it has never seen before

D.Elaborate on the pitfalls of model selection and evaluation.

4.Elaborate on the pitfalls of model selection and evaluation?


Model selection and evaluation are crucial steps in the process of developing machine
learning models. However, they come with several pitfalls that can lead to suboptimal
performance or misleading conclusions. Here are some common pitfalls:

1. Overfitting: One of the most common pitfalls in model selection is overfitting.


Overfitting occurs when a model learns the training data too well, capturing noise and
random fluctuations rather than underlying patterns. This can lead to poor
generalization performance on unseen data. To avoid overfitting, it's important to use
techniques like cross-validation and regularization.

2. Underfitting: On the other hand, underfitting occurs when a model is too simple to
capture the underlying structure of the data. This often happens when the model is too
constrained or lacks the capacity to represent complex relationships. Underfitting can
lead to poor performance on both the training and test data. To address underfitting,
one might need to use more complex models or feature engineering techniques.

3. Data leakage: Data leakage occurs when information from the test set (or future data)
leaks into the training process, leading to overly optimistic performance estimates.
This can happen if one inadvertently uses information from the test set to inform
model training or hyperparameter tuning. To prevent data leakage, it's essential to
properly partition the data into training and test sets and ensure that no information
from the test set is used during training.

4. Selection bias: Selection bias occurs when the process of selecting the model or
evaluating its performance is biased in some way. For example, if only a subset of the
available data is used for model evaluation, the results may not be representative of
the model's performance on unseen data. To mitigate selection bias, it's important to
use techniques like stratified sampling or cross-validation to ensure that the evaluation
process is fair and unbiased.

5. Improper validation: Using inappropriate validation techniques can also lead to


misleading results. For example, using simple holdout validation instead of more
robust techniques like k-fold cross-validation can lead to high variance in
performance estimates, especially with limited data. It's crucial to choose the right
validation strategy based on the specific characteristics of the data and the modeling
task.

6. Ignoring model assumptions: Different machine learning algorithms make different


assumptions about the underlying data distribution and relationships between
variables. Ignoring these assumptions can lead to models that perform poorly or
behave unexpectedly. It's important to understand the assumptions of the chosen
model and validate whether they hold true for the given dataset.

7. Failure to consider computational resources: Some models may perform well in


terms of predictive accuracy but require excessive computational resources or
memory. Failing to consider computational constraints can lead to models that are
impractical or expensive to deploy in real-world settings. It's essential to balance
model performance with computational feasibility when selecting a model.

8. Limited interpretability: While complex models may offer higher predictive


accuracy, they often lack interpretability, making it challenging to understand how
they arrive at their predictions. This can be problematic in applications where
interpretability is important, such as healthcare or finance. It's essential to strike a
balance between model complexity and interpretability based on the specific
requirements of the problem domain.

Navigating these pitfalls requires careful consideration of the data, model selection process,
and evaluation techniques. By being aware of these potential challenges and employing best
practices in model selection and evaluation, one can develop more robust and reliable
machine learning models.

(OR)

6.a) Distinguish between holdout method and cross-validation

The holdout method and cross-validation are both techniques used for evaluating the
performance of machine learning models, but they differ in how they partition the data and
assess model performance:

1. Holdout Method:

o Partitioning: In the holdout method, the dataset is split into two subsets: a
training set and a validation/test set. Typically, a larger portion of the data
(e.g., 70-80%) is used for training, and the remaining portion (e.g., 20-30%) is
used for validation or testing.

o Training and Evaluation: The model is trained on the training set and then
evaluated on the validation or test set. The performance metrics computed on
the validation/test set are used to estimate the model's generalization
performance.

o Advantages:

 Simple and easy to implement.

 Suitable for large datasets when computational resources are limited.

o Disadvantages:

 Results may have high variance depending on the random partitioning


of the data.

 Limited use of data for training and evaluation, which can lead to less
reliable performance estimates, especially with smaller datasets.

2. Cross-Validation:

o Partitioning: Cross-validation involves partitioning the dataset into k subsets


(folds) of approximately equal size. The model is trained k times, each time
using k-1 folds for training and the remaining fold for validation. This process
is repeated k times, with each fold used exactly once as the validation set.

o Training and Evaluation: For each iteration of training, a different fold is


held out as the validation set, and the model is trained on the remaining folds.
The performance metrics are then averaged over all k iterations to obtain a
more reliable estimate of the model's performance.

o Advantages:

 Provides a more robust estimate of the model's performance by


utilizing the entire dataset for both training and evaluation.

 Helps to mitigate the variance in performance estimates associated


with the holdout method.

o Disadvantages:

 Computationally more expensive, especially with large k values or


complex models.

 May not be suitable for very large datasets due to computational


constraints.
In summary, while the holdout method involves a single split of the data into training and
validation/test sets, cross-validation partitions the data into multiple subsets and iteratively
trains and evaluates the model on different combinations of these subsets. Cross-validation
generally provides a more reliable estimate of model performance, but it can be
computationally more expensive compared to the holdout method
b) Explain in detail about bootstrap. Give its role in classification.

Bootstrap is a resampling technique used in statistics and machine learning for estimating the
distribution of statistics or parameters by repeatedly sampling data points from the original
dataset with replacement. It is particularly useful when the dataset is small or when
uncertainty about a statistic needs to be quantified. Bootstrap has various applications in
classification, including model evaluation, feature selection, and estimating prediction
uncertainty. Let's delve into each aspect:

1. Model Evaluation:

Bootstrap can be used to estimate the performance of a classification model. Here's


how it works:

o Bootstrap Sampling: Generate multiple bootstrap samples by randomly


selecting data points from the original dataset with replacement. Each
bootstrap sample is of the same size as the original dataset.

o Model Training and Testing: Train the classification model on each


bootstrap sample and evaluate its performance on the data points not included
in the bootstrap sample. This process provides an estimate of the model's
performance on unseen data.

o Aggregation: Aggregate the performance metrics (e.g., accuracy, precision,


recall) obtained from multiple bootstrap samples to obtain a more robust
estimate of the model's performance.

2. Feature Selection:

Bootstrap can also be used for feature selection in classification tasks:

o Bootstrap Sampling: Generate multiple bootstrap samples from the original


dataset.

o Feature Subset Selection: For each bootstrap sample, train the classification
model using a subset of features randomly selected from the original feature
set. By evaluating the model's performance on each bootstrap sample, one can
assess the importance of different features in classification.

3. Prediction Uncertainty:

Bootstrap can help estimate prediction uncertainty in classification tasks:


o Bootstrap Sampling: Generate multiple bootstrap samples from the original
dataset.

o Model Training: Train multiple classification models, each on a different


bootstrap sample.

o Prediction Aggregation: For a given test instance, make predictions using


each of the trained models. The variability in predictions across the models
provides an estimate of prediction uncertainty.

4. Outlier Detection:

Bootstrap can aid in outlier detection by estimating the robustness of classification


predictions:

o Bootstrap Sampling: Generate multiple bootstrap samples from the original


dataset.

o Model Training and Testing: Train the classification model on each


bootstrap sample and evaluate its performance. Instances that are consistently
misclassified across bootstrap samples may be considered outliers.

Overall, bootstrap provides a powerful tool for assessing model performance, feature
importance, prediction uncertainty, and outlier detection in classification tasks. It leverages
resampling with replacement to generate multiple datasets, allowing for more robust
estimates and insights into the underlying data distribution. By utilizing bootstrap techniques,
practitioners can make more informed decisions in classification model development and
evaluation.

D.What is Bayes’ theorem? Explain about Naïve-Bayes classifier.

Bayesian classifiers are statistical classifiers.


They can predict class membership probabilities, such as the probability
that a given tuple belongs to a particular class.
 Bayesian classification is based on Bayes’ theorem.
Bayes’14
 Let X be a data tuple. In Bayesian terms, X is considered ― “evidence” and it is described
by measurements made on a set of n attributes.
 Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
 For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X.
 P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on
X.
 Bayes’ theorem is useful in that it provides a way of calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X).
Naïve Bayesian Classification:
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let=be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n measurements
made
on the tuple from n attributes, respectively, A1, A2,
…, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is,
the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus we maximize( | ). The class Ci for which ( | ). is maximized is called the maximum
posteriori hypothesis. By Bayes’ theorem
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that
is,
P(C1) = P(C2) = …= P(Cm), and we would therefore
maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption of
class
conditional independence is made. This presumes that the values of the attributes are
conditionally
independent of one another, given the class label of the tuple. Thus,15
5. We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training
tuples.
6. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
 If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in=having the
value xk for Ak, divided by |Ci ,D| the number of tuples of class Ci in D.
 If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.

UNIT-IV

7 a) Explain about support counting in frequent itemset generation. [14M] CO4]July – 2023

Support counting is a fundamental concept in frequent itemset generation algorithms like


Apriori and FP-Growth. It refers to the process of counting the occurrences of itemsets in a
dataset, which helps identify the frequent itemsets — sets of items that occur together with
sufficient frequency.

Here's how support counting works:

1. Definition of Support:

o Support measures the frequency of occurrence of an itemset in the dataset.

o Mathematically, the support of an itemset X, denoted as supp(X), is the


proportion of transactions in the dataset that contain all the items in X.
o For example, if we have a dataset of 100 transactions and the itemset {milk,
bread} appears in 20 transactions, then the support of {milk, bread} is 20/100
= 0.2 or 20%.

2. Counting Support:

o The process of support counting involves scanning the dataset to count the
occurrences of each itemset.

o For larger itemsets, support counting typically involves candidate generation


and pruning to efficiently count support without examining every possible
combination of items.

o For instance, in the Apriori algorithm, support counting is done incrementally.


Initially, single items' support is counted. Then, larger itemsets' support is
derived from the support of their subsets, utilizing the Apriori property.

3. Thresholding:

o After support counting, a minimum support threshold is set by the user or


determined automatically.

o Frequent itemsets are those whose support exceeds this minimum support
threshold.

o Itemsets that do not meet the minimum support threshold are considered
infrequent and are not considered further in the frequent itemset generation
process.

4. Example:

o Suppose we have a transaction dataset with the following transactions.

o T1: {milk, bread, eggs}


o T2: {milk, bread}
o T3: {bread, butter}
o T4: {milk, eggs}
o T5: {bread, eggs}
o To calculate the support of {milk, bread}, we count how many transactions
contain both milk and bread. From the dataset, T1 and T2 contain both milk
and bread, so the support of {milk, bread} is 2/5 = 0.4 or 40%.

o Similarly, the support of {bread, eggs} is 2/5 = 0.4 or 40%.

Support counting is a crucial step in frequent itemset generation as it helps identify patterns
of co-occurrence between items in a dataset. Frequent itemsets with high support are
considered significant and may indicate meaningful associations between items.
b) Explain about compact representation of frequent Item set.

The number of frequent itemsets can be very large for instance let us say that you are dealing
with a store that is trying to find the relationship between over 100 items. According to the
Apriori Principle, if an itemset is frequent, all of its subsets must be frequent so for frequent
100-itemset has
100 frequent 1-itemset and
1002 frequent 2-itemset and
1003 frequent 3-itemset and
the list goes on, if one was to calculate all the frequent itemsets that are subsets of this larger
100-itemset they will be close to 2100. Now I don’t know about you but this number isn’t the
sort of data you want to store on the average computer or try and find the support of and so it
is for this reason that alternative representations have been derived which reduce the initial
set but can be used to generate all other frequent itemsets. The Maximal and Closed Frequent
Itemsets are two such representations that are subsets of the larger frequent itemset that will
be discussed in this section. The table below provides the basic information about these two
representations and how they can be identified.

Maximal and Closed Frequent Itemsets

Relationship between Frequent


Itemset Representations
In conclusion to this section it is important to point
out the relationship between frequent itemsets, closed
frequent itemsets and maximal frequent itemsets. As mentioned earlier closed and maximal
frequent itemsets are subsets of frequent itemsets but maximal frequent itemsets are a more
compact representation because it is a subset of closed frequent itemsets. The diagram to the
right shows the relationship between these three types of itemsets. Closed frequent itemsets
are more widely used than maximal frequent itemset because when efficiency is more
important that space, they provide us with the support of the subsets so no additional pass is
needed to find this information.

6.

Here's a comparison between the Apriori and FP-Growth algorithms for frequent itemset
mining in transactional databases:

1. Apriori Algorithm:

o Pros:

 Conceptually simple and easy to understand.

 Suitable for datasets with a small number of frequent itemsets.

 Works well with sparse datasets.

o Cons:

 Generates a large number of candidate itemsets, leading to high


memory usage and computational overhead.

 Multiple passes over the dataset are required, which can be inefficient
for large datasets.

 Pruning of candidate itemsets based on the Apriori property may still


result in a large number of unnecessary candidates.

2. FP-Growth Algorithm:

o Pros:

 Efficiently mines frequent itemsets using a tree-based structure (FP-


tree), resulting in reduced memory usage and computational overhead.

 Requires only two passes over the dataset, making it more scalable for
large datasets.

 Does not generate candidate itemsets explicitly, avoiding the need for
costly candidate generation and pruning steps.

o Cons:
 Construction of the FP-tree can be memory-intensive for very large
datasets with high dimensionality.

 May require additional memory to store the FP-tree structure,


especially for datasets with a large number of unique items.

 Initial preprocessing step to construct the FP-tree may introduce


overhead, especially for datasets with a large number of transactions.

After executing both algorithms, we can compare them based on the following factors:

1. Execution Time: Compare the time taken by each algorithm to find frequent itemsets.

2. Memory Usage: Analyze the memory footprint of each algorithm, considering factors
such as data preprocessing and internal data structures.

3. Scalability: Assess how well each algorithm performs as the size of the dataset
increases.

By analyzing these factors, we can determine which algorithm is more suitable for frequent
itemset mining in transactional databases under the given constraints and requirements.

(OR)

8 a) A database has four transactions.Let min_sup=60% and min_conf=80%

TID Date items_bought

100 10/15/2022 {K,A, B, D}

200 10/15/2022 {D,A, C,E,B}

300 10/19/2022 {C,A, B,E}

400 10/22/2022 {B,A,D}

Find all frequent items using Apriori& FP-growth, respectively. Compare the efficiency of the two-
mining process.

Let's first find all frequent items using the Apriori and FP-Growth algorithms for the given
database with a minimum support threshold of 60% and minimum confidence of 80%. After
that, we'll compare the efficiency of the two algorithms.

Given Data:

 Database with four transactions.

 Minimum Support (min_sup) = 60%

 Minimum Confidence (min_conf) = 80%

Apriori Algorithm:
Step 1 (Initialization):

1. Find frequent 1-item sets:

o Count support for each item:

 {A}: 4

 {B}: 3

 {C}: 2

 {D}: 2

 {E}: 3

o Prune items with support less than 60% (min_sup): {C}, {D}

o F1 (Frequent 1-item sets): {A}, {B}, {E}

Step 2 (Generate Frequent Item Sets):

2. For k = 2:

o Generate candidate 2-item sets: {A, B}, {A, E}, {B, E}

o Count support for each candidate set:

 {A, B}: 3

 {A, E}: 3

 {B, E}: 3

o Prune infrequent 2-item sets (min_sup = 60%): None

o F2: {A, B}, {A, E}, {B, E}

3. For k = 3 (No candidate sets generated, so stop).

The frequent item sets found using Apriori are {A, B}, {A, E}, {B, E}.

FP-Growth Algorithm:

The FP-Growth algorithm constructs an FP-tree and mines frequent item sets efficiently in a
single pass. I'll outline the steps without showing the tree structure.

1. Build the FP-Tree from the transactions.


2. Mine frequent item sets directly from the FP-Tree. Starting with the least frequent
item, you can traverse the tree to find all frequent item sets.

The frequent item sets found using FP-Growth are {A, B}, {A, E}, {B, E}.

Comparison of Efficiency:

 Apriori requires multiple passes over the data and generates candidate item sets,
which can be time-consuming for large databases. In this case, it required two passes
(one for frequent 1-item sets and one for frequent 2-item sets).

 FP-Growth only requires a single pass over the data to build the FP-Tree and mine
frequent item sets directly. This is more efficient, especially for larger databases.

In this example, with a small database, the efficiency difference may not be very noticeable.
However, as the database size grows, FP-Growth's efficiency advantage becomes more
apparent, making it a preferred choice for frequent item set mining in many practical
applications.

b) Explain market basket analysis

Market basket analysis is a data mining and analytical technique used by businesses to
understand the purchasing behavior of customers based on the items they buy. It is a
valuable method for uncovering associations and patterns between products that are
frequently purchased together. Market basket analysis is often employed in the retail
industry, but it can be applied to various fields, such as e-commerce, grocery stores,
and even online services like streaming platforms.

Here's a detailed explanation of market basket analysis:

1. Frequent Item Sets:

 Market basket analysis begins with the identification of frequent item sets. These are
sets of items (products) that are frequently purchased together in transactions.

2. Support:

 Support is a key metric used in market basket analysis. It measures how often a
particular item or item set appears in the transactions. The support of an item set is
calculated as the number of transactions containing that item set divided by the total
number of transactions.

3. Confidence:

 Confidence is another important metric. It measures the likelihood that if a customer


buys one item (A), they will also buy another item (B). Confidence is calculated as
the support of both items together (A and B) divided by the support of the first item
(A).
4. Lift:

 Lift is a metric that indicates the strength of association between items. It is calculated
as the confidence of the item set (A and B) divided by the support of the second item
(B). Lift greater than 1 suggests a positive association, less than 1 indicates a negative
association, and equal to 1 means no association.

5. Apriori Algorithm:

 Market basket analysis is often performed using the Apriori algorithm or similar
techniques. This algorithm is used to generate frequent item sets efficiently and
discover association rules that reveal item combinations that occur more frequently
than expected by chance.

Example:

 Suppose you're analyzing sales data for a grocery store. After applying market basket
analysis, you find that customers who buy bread (item A) are very likely to also buy
butter (item B) with a high confidence value. This information can be used to
optimize product placement in the store, create targeted promotions, or enhance
product recommendations for customers.

Use Cases and Benefits of Market Basket Analysis:

 Cross-selling: Identify items that are often purchased together and offer bundle deals
or promotions to increase sales.

 Inventory Management: Optimize stock levels and reduce waste by understanding


which items are frequently bought together.

 Store Layout and Merchandising: Improve store layout by placing related products
in close proximity to encourage additional purchases.

 Recommendation Systems: Enhance recommendation engines in e-commerce


platforms by suggesting related items to customers based on their purchase history.

 Pricing Strategies: Develop pricing strategies based on the analysis of customer


purchasing patterns.

 Customer Segmentation: Group customers with similar purchase behavior to tailor


marketing efforts more effectively.

Market basket analysis provides valuable insights into customer behavior and can lead to
increased sales, improved customer satisfaction, and better decision-making for businesses in
various industries.

UNIT-V

9a) Explain Applications and typical requirements of clustering?


Applications:

Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.

In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionality, and gain insight into structures inherent in populations.

Clustering may also help in the identification of areas of similar land use in
an earth observation database and in the identification of groups of houses
in a city according to house type, value,and geographic location, as well as
the identification of groups of automobile insurance policy holders with a
high average claim cost.

Clustering is also called data segmentation in some applications because


clustering partitions large data sets into groups according to their similarity.

Clustering can also be used for outlier detection,Applications of outlier detection


include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce.
Requirements:

Scalability:
Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of
objects. Clustering on a sample of a given large data set may lead to biased results.

Highly scalable clustering algorithms are needed.


Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.

Discovery of clusters with arbitrary shape:


Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any
shape. It is important to develop algorithms that can detect clusters of arbitrary
shape.
Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster
analysis (such as the number of desired clusters). The clustering results can be
quite sensitive to input parameters. Parameters are often difficult to determine,
especially for data sets containing high-dimensional objects. This not only burdens
users, but it also makes the quality of clustering difficult to control.
Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters
of poor quality.

Incremental clustering and insensitivity to the order of input records:


Some clustering algorithms cannot incorporate newly inserted data (i.e., database
updates) into existing clustering structures and, instead, must determine a new
clustering from scratch. Some clustering algorithms are sensitive to the order of input
data.That is, given a set of data objects, such an algorithm may return dramatically
different clusterings depending on the order of presentation of the input objects.

It is important to develop incremental clustering algorithms and algorithms thatare


insensitive to the order of input.

High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two
to three dimensions. Human eyes are good at judging the qualityof clustering for up
to three dimensions. Finding clusters of data objects in highdimensionalspace is
challenging, especially considering that such data can be sparseand highly skewed.
Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of
new automatic banking machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the city’s rivers and highway
networks, and the type and number of customers per cluster. A challenging task is to
find groups of data with good clustering behavior that satisfy specified constraints.

b) Explain additional issues in k-means?

A. K-Means, while a popular and widely used clustering algorithm, has several additional issues
beyond its basic strengths and weaknesses that can affect its performance and suitability for
different datasets or applications. Some of these additional issues include:

1. Selection of K (Number of clusters): Determining the appropriate number of clusters (K) can be
challenging. It often requires domain knowledge, visual inspection, or trial-and-error methods.
Choosing an incorrect K value can lead to suboptimal clustering.

2. Cluster Initialization: K-Means is sensitive to the initial placement of cluster centroids. Random
initialization can result in different clustering outcomes. Poor initialization can lead to slow
convergence or suboptimal clustering.

3. Convergence to Local Optima: K-Means seeks to minimize the objective function (e.g., the sum
of squared distances between data points and cluster centroids). However, it may converge to a
local optimum, which might not be the global best solution. Multiple initializations and restarts
from different initial positions can mitigate this, but it's not foolproof.

4. Impact of Outliers: Outliers can significantly influence the centroids' position, leading to
misleading cluster boundaries. As K-Means is highly sensitive to outliers, it might assign them to the
nearest cluster even if they don't represent the cluster well.

5. Scalability to High-Dimensional Data: K-Means may face challenges when dealing with high-
dimensional data. High-dimensional spaces can make distance calculations less meaningful,
impacting the quality of clustering.

6. Cluster Shape Assumptions: K-Means assumes clusters to be spherical and equally sized. If the
data contains non-spherical or irregularly shaped clusters, K-Means might not perform well.

7. Inability to Handle Non-Globular Clusters: Clusters that are not convex or globular in shape
might be challenging for K-Means to accurately identify.

8. Distance Metric Selection: The choice of distance metric (Euclidean distance being the default)
affects the clustering results. Using inappropriate distance measures for the data can lead to
suboptimal clustering.

9. Overcoming Curse of Dimensionality: K-Means can struggle with high-dimensional data, where
the "curse of dimensionality" affects the meaningfulness of distances between points.
Preprocessing techniques or feature selection can mitigate this issue.
C.Suppose that the data-mining task is to cluster the following. The eight points (representing
location) into three clusters:A1 (2;10) ; A2 (2;5) ; A3 (8;4) ; B1 (5;8) ; B2 (7;5) ; B3 (6;4) ; C1(1;2) ;
C2 (4;9). The distance function is Euclidean distance.Suppose initially we assign A1, B1, and C1 as
the center of each cluster, respectively. Use the k-means algorithm to determine:the three cluster
centers after the first round of execution.

Here's how to determine the three cluster centers after the first round of k-means execution
for the given data points and initial cluster assignments:

Data Points:

 A1 (2, 10)

 A2 (2, 5)

 A3 (8, 4)

 B1 (5, 8)

 B2 (7, 5)

 B3 (6, 4)

 C1 (1, 2)

 C2 (4, 9)

Initial Cluster Centers:

 Cluster 1: A1 (2, 10)

 Cluster 2: B1 (5, 8)

 Cluster 3: C1 (1, 2)

Step 1: Assign Data Points to Clusters

 Calculate the Euclidean distance between each data point and all three initial
centroids.

 Assign each data point to the cluster with the closest centroid.

Distance Calculations:

Data Point Distance to A1 Distance to B1 Distance to C1 Assigned Cluster


A1 (2, 10) 0 (Same point) 6.7 7.21 Cluster 1
A2 (2, 5) 5 4.12 5.39 Cluster 2
A3 (8, 4) 6.3 5.83 8.66 Cluster 2
B1 (5, 8) Cluster 2 (Already assigned)
B2 (7, 5) 5.83 2.24 7.81 Cluster 2
B3 (6, 4) 4.47 3.16 6.71 Cluster 2
C1 (1, 2) 7.21 5.39 0 (Same point) Cluster 3
C2 (4, 9) 8.94 6.7 5.39 Cluster 3

Step 2: Recalculate Centroids

 Calculate the mean of the data points assigned to each cluster (become the new
centroids).

 New Centroid of Cluster 1 (A1 remains the same): A1 (2, 10)

 New Centroid of Cluster 2: Mean of A2, A3, B1, B2, B3 = (5.2, 5.8)

 New Centroid of Cluster 3: Mean of C2 and C1 (remains the same): C1 (1, 2)

Summary:

After the first round of k-means, the new cluster centers are:

 Cluster 1: A1 (2, 10)

 Cluster 2: (5.2, 5.8)

 Cluster 3: C1 (1, 2)

Note:

This is just one iteration of the k-means algorithm. In practice, you would repeat steps 1 and
2 until the centroids no longer change significantly (convergence). This indicates that the
algorithm has likely found a local minimum in the squared distance between points and their
assigned cluster centers.

(OR)

10a) Strengths and weaknesses K-means and Explain Bisecting K-means?

The K-Means algorithm is a widely used clustering technique known for its simplicity and
efficiency. However, it also has certain strengths and weaknesses that are important to consider
when using this method.

Strengths of K-Means:

1. Simplicity: K-Means is straightforward and easy to understand, making it relatively easy to


implement and apply, especially for large datasets.

2. Scalability: It is computationally efficient and works well with large datasets, making it suitable
for clustering in big data environments.
3. Speed: K-Means is fast and can converge quickly, making it useful for initial exploratory data
analysis and as a starting point for other clustering methods.

4. Versatility: It can work with various types of data and is suitable for many applications in
different domains.

5. Well-suited for spherical clusters: K-Means performs well when clusters are approximately
spherical and equally sized.

Weaknesses of K-Means:

1. Sensitivity to initial centroids: The algorithm's performance heavily depends on the initial
placement of centroids, which can lead to different results for different initializations.

2. Cluster shape and size assumption: K-Means assumes that clusters are spherical and of
approximately equal size, which may not reflect the actual structure of the data. It may perform
poorly with irregularly shaped or overlapping clusters.

3. Vulnerability to outliers: Outliers can significantly impact K-Means clustering results, as the
algorithm tends to assign them to the nearest cluster even if they don't belong to any.

4. Hard assignment of data points: K-Means provides a strict assignment of data points to clusters,
which might not represent the uncertainty or fuzziness in the data, unlike fuzzy clustering methods.

5. Difficulty with non-linear data: K-Means is not suitable for finding clusters in non-linear or
complex geometric structures within the data.

b) Explain Density based clustering and Merits of DBSCAN?

This method is based on the notion of density. The basic idea is to continue growing the given cluster
as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a minimum number of points.
Density-based spatial clustering of applications with noise (DBSCAN) clustering method. Clusters are
dense regions in the data space, separated by regions of the lower density of points. The DBSCAN
algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each
point of a cluster, the neighbourhood of a 15 given radius has to contain at least a minimum number
of points

Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for

finding spherical-shaped clusters or convex clusters. In other words, they are suitable only

for compact and well-separated clusters. Moreover, they are also severely affected by the

presence of noise and outliers in the data.


Real life data may contain irregularities, like:

1. Clusters can be of arbitrary shape such as those shown in the figure below.

2. Data may contain noise.

DBSCAN algorithm requires two parameters:

1. eps : It defines the neighbourhood around a data point i.e. if the distance
between two points is lower or equal to „eps‟ then they are considered as
neighbours. If the eps value is chosen too small then large part of the data

will be considered as outliers. If it is chosen very large then the clusters will16

and majority of the data points will be in the same clusters. One wayto find the
eps value is based on the k-distance graph.
1. MinPts: Minimum number of neighbours (data points) within eps radius.
Larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions
D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.

DBSCAN ALGORITHM:
Advantages  The major advantage of this method is fast processing time.  It is dependent only on
the number of cells in each dimension in the quantized space. Model-based methods In this method,
a model is hypothesized for each cluster to find the best fit of data for a given model. This method
locates the clusters by clustering the density function. It reflects spatial distribution of the data
points. This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.
Constraint-based Method In this method, the clustering is performed by the incorporation of user or
application- oriented 18 constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.

D.Measures for clusters.

Clustering methods can be classified into the following categories −  Partitioning Method 
Hierarchical Method  Density-based Method  Grid-Based Method  Model-Based Method 
Constraint-based Method

You might also like