0% found this document useful (0 votes)
11 views

DM Unit 2

Uploaded by

abookofresources
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

DM Unit 2

Uploaded by

abookofresources
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1....Discuss various challenges of Data Mining.

Data mining, the process of extracting knowledge from data, has become increasingly
important as the amount of data generated by individuals, organizations, and machines
has grown exponentially. Few of the challenges are::

1]Data Quality

The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, duplications, or inconsistencies, which may lead to
inaccurate results.

Data quality issues can arise due to a variety of reasons, including data entry errors, data
storage issues, data integration problems, and data transmission errors.

To address these challenges, data mining users must apply data cleaning and data
preprocessing techniques to improve the quality of the data.

2]Data Complexity

Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things . The complexity of the data may make it
challenging to process, analyze, and understand.

To address this challenge, data mining users has to use clustering, classification of data.

3]Data Privacy and Security

Data privacy and security is another significant challenge in data mining.

As more data is collected, stored, and analyzed, the risk of storing data and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information that must
be protected.

To address this challenge, data mining users must apply data anonymization and data
encryption techniques to protect the privacy and security of the data.

4]Scalability:

Data mining algorithms must be scalable to handle large datasets efficiently. As the size of
the dataset increases, the time and computational resources required to perform data
mining operations also increases .

To address this challenge, data mining users use distributed computing frameworks .
5]Interpretability

Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data.

To address this challenge, data mining user use visualization techniques to represent the
data and the models visually.

6]Ethics

2...What are the applications of Data Mining?

Data mining is the process of discovering patterns and insights in large datasets. It is
important because it allows organizations to make informed decisions based on data-
driven insights. Data mining is widely used in many areas, such as marketing and customer
relationship management, fraud detection and risk management, healthcare and medical
research, and manufacturing and supply chain management. These applications enable
organizations to optimize operations, improve customer experiences, reduce risks, and
make data-driven decisions.

Data mining has a wide range of applications, such as marketing and customer relationship
management, fraud detection and risk management, healthcare and medical research,
manufacturing and supply chain management,

Few of them are:

...Financial Data Analysis:

.Design and construction of data warehouses for multidimensional data analysis and data
mining.

.Loan payment prediction and customer credit policy analysis.

.Classification and clustering of customers for targeted marketing.

.Detection of money laundering and other financial crimes.

Retail Industry:

.Design and Construction of data warehouses based on the benefits of data mining.

.Multidimensional analysis of sales, customers, products, time and region.


.Analysis of effectiveness of sales campaigns.

.Customer Retention.

.Product recommendation and cross-referencing of items.

Telecommunication Industry:

.Multidimensional Analysis of Telecommunication data.

.Fraudulent pattern analysis.

.Identification of unusual patterns.

.Multidimensional association and sequential patterns analysis.

.Mobile Telecommunication services.

.Use of visualization tools in telecommunication data analysis.

Biological Data Analysis:

.Semantic integration of heterogeneous, distributed genomic and proteomic databases.

.Alignment, indexing, similarity search and comparative analysis multiple nucleotide


sequences.

.Discovery of structural patterns and analysis of genetic networks and protein pathways.

.Association and path analysis.

.Visualization tools in genetic data analysis.

Other Scientific Applications:

.Data Warehouses and data preprocessing.

.Graph-based mining.

.Visualization and domain specific knowledge.

Intrusion Detection:

.Development of data mining algorithm for intrusion detection.

.Association and correlation analysis, aggregation to help select and build discriminating
attributes.

.Analysis of Stream data.

.Distributed data mining.


.Visualization and query tools.

3...Explain OLAP server architectures in detail.

Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis
in order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data.

Type of OLAP servers:

The three major types of OLAP servers are as follows:

1)ROLAP

2)MOLAP

3)HOLAP

>Relational OLAP (ROLAP):

.Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as
relational tables. ROLAP servers are used to bridge the gap between the relational back-
end server and the client’s front-end tools. ROLAP servers store and manage warehouse
data using RDBMS, and OLAP middleware fills in the gaps.

Benefits:

.It is compatible with data warehouses and OLTP systems.

.The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of data that can be stored.

Limitations:

.SQL functionality is constrained.

.It’s difficult to keep aggregate tables up to date.

>Multidimensional OLAP (MOLAP):


. Multidimensional On-Line Analytical Processing (MOLAP) supports multidimensional
views of data. Storage utilization in multidimensional data stores may be low if the data set
is sparse.

.MOLAP stores data on discs in the form of a specialized multidimensional array structure.
It is used for OLAP, which is based on the arrays’ random access capability. Dimension
instances determine array elements, and the data or measured value associated with each
cell is typically stored in the corresponding array element. The multidimensional array is
typically stored in MOLAP in a linear allocation based on nested traversal of the axes in
some predetermined order.

.However, unlike ROLAP, which stores only records with non-zero facts, all array elements
are defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements
occupying a larger portion of them. MOLAP systems typically include provisions such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays, because both storage and retrieval costs are important when evaluating online
performance. MOLAP cubes are ideal for slicing and dicing data and can perform complex
calculations. When the cube is created, all calculations are pre-generated.

Benefits:

Suitable for slicing and dicing operations.

Outperforms ROLAP when data is dense.

Capable of performing complex calculations.

Limitations:

.It is difficult to change the dimensions without re-aggregating.

.Since all calculations are performed when the cube is built, a large amount of data cannot
be stored in the cube itself.

>Hybrid OLAP (HOLAP):

.ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than MOLAP.HOLAP
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-
type information. Because detailed data is stored in a relational database, cubes are
smaller than MOLAP.

Benefits:

.HOLAP combines the benefits of MOLAP and ROLAP.

.Provide quick access at all aggregation levels.

Limitations:

.Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.

.There is a greater likelihood of overlap, particularly in their functionalities.

4....What is KDD? Explain various steps involved in KDD.

In the context of computer science, “Data Mining” can be referred to as knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information
from data stored in databases.

.The need of data mining is to extract useful information from large datasets and use it to
make predictions or better decision-making. Nowadays, data mining is used in almost all
places where a large amount of data is stored and processed.

.For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.

>KDD Process

KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.

.The KDD process is an iterative process and it requires multiple iterations of the above
steps to extract accurate knowledge from the data.

.The following steps are included in KDD process:

1)Data Cleaning

.Data cleaning is defined as removal of noisy and irrelevant data from collection.

.Cleaning in case of Missing values.


.Cleaning noisy data, where noise is a random or variance error.

Cleaning with Data discrepancy detection and Data transformation tools.

2)Data Integration

Data integration is defined as heterogeneous data from multiple sources combined in a


common source(DataWarehouse).

.Data integration using Data Migration tools, Data Synchronization tools and ETL(Extract-
Load-Transformation) process.

3)Data Selection

Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.

4)Data Transformation

.Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data .Transformation is a two step process:

1.Data Mapping: Assigning elements from source base to destination to capture


transformations.

2.Code generation: Creation of the actual transformation program.

5)Data Mining

.Data mining is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.

6)Pattern Evaluation

.Pattern Evaluation is defined as identifying strictly increasing patterns representing


knowledge based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.

7)Knowledge Representation:

This involves presenting the results in a way that is meaningful and can be used to make
decisions.

Note: KDD is an iterative process where evaluation measures can be enhanced, mining can
be refined, new data can be integrated and transformed in order to get different and more
appropriate results . Preprocessing of databases consists of Data cleaning and Data
Integration.

5....List various steps involved in Data Pre-processing.

.Data preprocessing is an important step in the data mining process.

. It refers to the cleaning, transforming, and integrating of data in order to make it ready for
analysis.

.The goal of data preprocessing is to improve the quality of the data and to make it more
suitable for the specific data mining task.

. Some common steps in data preprocessing include:

1)Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.

2)Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can
be used for data integration.

3)Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.

4)Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.

5)Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
6)Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.

.Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of
the analysis results.

.The specific steps involved in data preprocessing may vary depending on the nature of the
data and the analysis goals.

.By performing these steps, the data mining process becomes more efficient and the
results become more accurate.

6....Describe about various types of data.

In data mining, data can come in various forms and types, each requiring different
techniques and tools for analysis. Here are some common types of data encountered in
data mining:

1. **Structured Data**:

- **Relational Data**: Data stored in tables, often in relational databases, where each
row represents a record and each column represents an attribute.

- **Transactional Data**: Data generated from transactions, such as sales records in a


retail store. Each transaction is a record containing items bought, transaction ID, etc.

2. **Unstructured Data**:

- **Text Data**: Data in the form of text, such as emails, documents, social media posts,
etc. Text mining techniques are used to analyze and extract information from such data.

- **Multimedia Data**: Includes images, audio, and video files. Image processing, audio
analysis, and video analysis techniques are applied to mine this type of data.

3. **Semi-Structured Data**:

- **XML/JSON Data**: Data stored in formats like XML or JSON, which have some
structure but do not fit neatly into relational databases.

- **Web Data**: Data from web pages, which might include semi-structured elements
like HTML tags along with unstructured content.
4. **Time-Series Data**:

- Data collected over time, such as stock prices, sensor readings, or weather data.
Analysis involves identifying patterns, trends, and seasonal variations.

5. **Spatial Data**:

- Data related to geographic locations, such as maps, satellite imagery, or GPS


coordinates. Techniques like spatial clustering and spatial regression are used to analyze
this data.

6. **Graph Data**:

- Data represented as nodes and edges, such as social networks, biological networks, or
transportation networks. Graph mining techniques are used to find patterns and
relationships in such data.

7. **Streaming Data**:

- Data that is continuously generated and arrives in real-time, such as data from IoT
devices, social media feeds, or financial transactions. Real-time data processing and
mining techniques are required for this type of data.

8. **High-Dimensional Data**:

- Data with a large number of attributes (dimensions), such as genomics data or image
pixel data. Techniques like dimensionality reduction are used to manage and analyze high-
dimensional data.

9. **Metadata**:

- Data about data, providing information such as the source, context, or characteristics of
the primary data. Metadata is useful for data management and understanding data
provenance.

.Different types of data require different preprocessing, analysis, and mining techniques to
extract meaningful insights. Understanding the nature of the data is crucial for selecting
appropriate data mining methods.

7...Differentiate between ROLAP vs MOLAP vs HOLAP

ROLAP (Relational Online Analytical Processing), MOLAP (Multidimensional Online


Analytical Processing), and HOLAP (Hybrid Online Analytical Processing) are different
approaches to OLAP systems used for data warehousing and analysis. Here are the key
differences:
1. **ROLAP (Relational OLAP):**

- **Data Storage:** Uses relational databases to store data.

- **Architecture:** Data is stored in relational tables and a multidimensional view is


created dynamically.

- **Performance:** Slower query performance compared to MOLAP due to on-the-fly


calculations.

- **Scalability:** Highly scalable, can handle large amounts of data.

- **Flexibility:** More flexible for handling complex queries.

- **Example:** Implemented using SQL databases.

2. **MOLAP (Multidimensional OLAP):**

- **Data Storage:** Uses multidimensional databases (often in a cube format) to store


data.

- **Architecture:** Pre-calculates and stores the data in multidimensional cubes.

- **Performance:** Faster query performance due to pre-calculated and stored


aggregates.

- **Scalability:** Limited scalability compared to ROLAP due to the size of the pre-
calculated data.

- **Flexibility:** Less flexible for complex queries but excels in speed for standard OLAP
queries.

- **Example:** Implemented using specialized OLAP databases like Microsoft Analysis


Services.

3. **HOLAP (Hybrid OLAP):**

- **Data Storage:** Combines both relational and multidimensional storage.

- **Architecture:** Stores summary data in multidimensional cubes and detailed data in


relational databases.

- **Performance:** Offers a balance between ROLAP and MOLAP, providing good query
performance with the ability to handle large datasets.

- **Scalability:** More scalable than MOLAP and can manage larger datasets by
leveraging relational storage for detailed data.
- **Flexibility:** Balances the flexibility of ROLAP with the performance benefits of
MOLAP.

- **Example:** Microsoft SQL Server Analysis Services can be configured to use HOLAP.

.These approaches offer different trade-offs in terms of performance, scalability, and


flexibility, making them suitable for different types of data analysis requirements.

8...Describe different Data Mining tasks in detail.

. Introduction to Data Mining Tasks:

The data mining tasks can be classified generally into two types based on what a specific
task tries to achieve.

. Those two categories are descriptive tasks and predictive tasks.

. The descriptive data mining tasks characterize the general properties of data whereas
predictive data mining tasks perform inference on the available data set to predict how a
new data set will behave.

..Different Data Mining Tasks

There are a number of data mining tasks such as classification, prediction, time-series
analysis, association, clustering, summarization etc.

All these tasks are either predictive data mining tasks or descriptive data mining tasks. A
data mining system can execute one or more of the above specified tasks as part of data
mining.

..Predictive data mining tasks come up with a model from the available data set that is
helpful in predicting unknown or future values of another data set of interest. A medical
practitioner trying to diagnose a disease based on the medical test results of a patient can
be considered as a predictive data mining task. Those types include::

1)Classification

.Classification derives a model to determine the class of an object based on its attributes.

.A collection of records will be available, each record with a set of attributes.

. One of the attributes will be class attribute and the goal of classification task is assigning
a class attribute to new set of records as accurately as possible.

.Classification can be used in direct marketing, that is to reduce marketing costs by


targeting a set of customers who are likely to buy a new product.
2)Prediction:

.Prediction task predicts the possible values of missing or future data.

.Prediction involves developing a model based on the available data and this model is used
in predicting future values of a new data set of interest.

3)Time – Series Analysis:

.Time series is a sequence of events where the next event is determined by one or more of
the preceding events.

.Time series reflects the process being measured and there are certain components that
affect the behavior of a process.

..Descriptive data mining tasks usually finds data describing patterns and comes up with
new, significant information from the available data set. A retailer trying to identify
products that are purchased together can be considered as a descriptive data mining task.

1)Association:

Association discovers the association or connection among a set of items. Association


identifies the relationships between objects.

. Association analysis is used for commodity management, advertising, catalog design,


direct marketing etc.

2)Clustering:

.Clustering is used to identify data objects that are similar to one another.

. The similarity can be decided based on a number of factors like purchase behavior,
responsiveness to certain actions, geographical locations and so on.

3)Summarization:

.Summarization is the generalization of data.

. A set of relevant data is summarized which result in a smaller set that gives aggregated
information of the data.

9...Discuss about similarity and Dissimilarity Measures in detail.

Measuring similarity and dissimilarity in data mining is an important task that helps
identify patterns and relationships in large datasets.
.To quantify the degree of similarity or dissimilarity between two data points or objects,
mathematical functions called similarity and dissimilarity measures are used. .

.Similarity measures produce a score that indicates the degree of similarity between two
data points, while dissimilarity measures produce a score that indicates the degree of
dissimilarity between two data points.

Learn via video courses

Topics Covered

Overview

Data similarity and dissimilarity are important measures in data mining that help in
identifying patterns and trends in datasets. Similarity measures are used to determine how
similar two datasets or data points are, while dissimilarity measures are used to determine
how different they are. In this article, we will discuss some commonly used measures of
similarity and dissimilarity in data mining.

Introduction

Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or
dissimilarity between two data points or objects, mathematical functions called similarity
and dissimilarity measures are used. Similarity measures produce a score that indicates
the degree of similarity between two data points, while dissimilarity measures produce a
score that indicates the degree of dissimilarity between two data points. These measures
are crucial for many data mining tasks, such as identifying duplicate records, clustering,
classification, and anomaly detection.

>>>>Similarity Measure

.A similarity measure is a mathematical function that quantifies the degree of similarity


between two objects or data points. It is a numerical score measuring how alike two data
points are.

.It takes two data points as input and produces a similarity score as output, typically
ranging from 0 (completely dissimilar) to 1 (identical or perfectly similar).

.A similarity measure can be based on various mathematical techniques such as Cosine


similarity, Jaccard similarity, and Pearson correlation coefficient.
.Similarity measures are generally used to identify duplicate records, equivalent instances,
or identifying clusters.

>>>Dissimilarity Measure

.A dissimilarity measure is a mathematical function that quantifies the degree of


dissimilarity between two objects or data points. It is a numerical score measuring how
different two data points are.

.It takes two data points as input and produces a dissimilarity score as output, ranging from
0 (identical or perfectly similar) to 1 (completely dissimilar). A few dissimilarity measures
also have infinity as their upper limit.

.A dissimilarity measure can be obtained by using different techniques such as Euclidean


distance, Manhattan distance, and Hamming distance.

.Dissimilarity measures are often used in identifying outliers, anomalies, or clusters.

10...Explain about various data pre-processing steps in detail.

((Same as 5th)

11...Explain about Bitmap indexing and Join indexing.

Bitmap indexing and join indexing are techniques used in data mining and databases to
optimize query performance, especially in large datasets. Here’s an explanation of each:

>>>Bitmap Indexing

**Definition:*

Bitmap indexing is a technique that uses bit arrays (bitmaps) to represent the presence or
absence of a value in a given column.

. This is particularly effective for columns with a low cardinality, meaning columns that
have a limited number of distinct values.

**How it Works:**

- For each distinct value in a column, a separate bitmap (bit array) is created.

- Each bitmap has a bit for every row in the table, indicating whether the row contains the
specific value.
- For example, for a column “Gender” with values “Male” and “Female,” there would be two
bitmaps: one for “Male” and one for “Female.” If a row contains “Male,” the corresponding
bit in the “Male” bitmap would be set to 1, and the bit in the “Female” bitmap would be set
to 0.

**Advantages:**

- Efficient for querying and filtering on low-cardinality columns.

- Enables fast combination of conditions using bitwise operations (AND, OR, NOT).

- Reduces the storage space required compared to traditional indexing methods for low-
cardinality data.

**Disadvantages:**

- Less effective for high-cardinality columns (columns with many distinct values).

- Can become inefficient for update-intensive environments because updating the bitmaps
requires modifying multiple bit arrays.

>>>> Join Indexing

**Definition:**

Join indexing is a technique that pre-computes and stores the relationships between
tables, facilitating faster join operations.

.It is particularly useful in a star schema or snowflake schema in data warehouses.

**How it Works:**

- A join index maintains the mapping between the rows of two or more tables that are
frequently joined together.

- For example, if there are two tables, “Orders” and “Customers,” a join index would store
pairs of row identifiers from “Orders” and “Customers” that are related based on a
common key, such as “CustomerID.”

**Advantages:**

- Speeds up join operations by avoiding the need to repeatedly compute joins during query
execution.

- Enhances performance for complex queries involving multiple joins.

- Useful in environments where join operations are frequent and involve large tables.
**Disadvantages:**

- Requires additional storage space to maintain the join index.

- Needs to be updated whenever the underlying tables are modified, which can add
overhead in environments with frequent updates.

- Can be complex to manage, especially as the number of tables and join conditions
increases.

>>> Use Cases

- **Bitmap Indexing:** Ideal for scenarios with low-cardinality columns and read-heavy
workloads, such as data warehouses and decision support systems.

- **Join Indexing:** Best suited for environments where complex join operations are
frequent, such as data warehouses with star or snowflake schemas.

12...Explain about Data Cleaning , Data transformation and Data Reduction.

These are the Steps Involved in Data Preprocessing:

1. >>>>>>Data Cleaning:

.The data can have many irrelevant and missing parts. To handle this part, data cleaning
is done..

. It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways.

Some of them are:

Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc.

It can be handled in following ways :

Binning Method:

This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.

Regression:

Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

Clustering:

This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.

2. >>>>>Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining
process.

This involves following ways:

Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

Discretization:

This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. >>>>>Data Reduction:

Data reduction is a crucial step in the data mining process that involves reducing the size of
the dataset while preserving the important information.

Some common steps involved in data reduction are:

Feature Selection: This involves selecting a subset of relevant features from the dataset.

Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information.

Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information.

Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid.

Compression: This involves compressing the dataset while preserving the important
information.

You might also like