DM Unit 2
DM Unit 2
Data mining, the process of extracting knowledge from data, has become increasingly
important as the amount of data generated by individuals, organizations, and machines
has grown exponentially. Few of the challenges are::
1]Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, duplications, or inconsistencies, which may lead to
inaccurate results.
Data quality issues can arise due to a variety of reasons, including data entry errors, data
storage issues, data integration problems, and data transmission errors.
To address these challenges, data mining users must apply data cleaning and data
preprocessing techniques to improve the quality of the data.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things . The complexity of the data may make it
challenging to process, analyze, and understand.
To address this challenge, data mining users has to use clustering, classification of data.
As more data is collected, stored, and analyzed, the risk of storing data and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information that must
be protected.
To address this challenge, data mining users must apply data anonymization and data
encryption techniques to protect the privacy and security of the data.
4]Scalability:
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of
the dataset increases, the time and computational resources required to perform data
mining operations also increases .
To address this challenge, data mining users use distributed computing frameworks .
5]Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data.
To address this challenge, data mining user use visualization techniques to represent the
data and the models visually.
6]Ethics
Data mining is the process of discovering patterns and insights in large datasets. It is
important because it allows organizations to make informed decisions based on data-
driven insights. Data mining is widely used in many areas, such as marketing and customer
relationship management, fraud detection and risk management, healthcare and medical
research, and manufacturing and supply chain management. These applications enable
organizations to optimize operations, improve customer experiences, reduce risks, and
make data-driven decisions.
Data mining has a wide range of applications, such as marketing and customer relationship
management, fraud detection and risk management, healthcare and medical research,
manufacturing and supply chain management,
.Design and construction of data warehouses for multidimensional data analysis and data
mining.
Retail Industry:
.Design and Construction of data warehouses based on the benefits of data mining.
.Customer Retention.
Telecommunication Industry:
.Discovery of structural patterns and analysis of genetic networks and protein pathways.
.Graph-based mining.
Intrusion Detection:
.Association and correlation analysis, aggregation to help select and build discriminating
attributes.
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis
in order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data.
1)ROLAP
2)MOLAP
3)HOLAP
.Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as
relational tables. ROLAP servers are used to bridge the gap between the relational back-
end server and the client’s front-end tools. ROLAP servers store and manage warehouse
data using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
.The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of data that can be stored.
Limitations:
.MOLAP stores data on discs in the form of a specialized multidimensional array structure.
It is used for OLAP, which is based on the arrays’ random access capability. Dimension
instances determine array elements, and the data or measured value associated with each
cell is typically stored in the corresponding array element. The multidimensional array is
typically stored in MOLAP in a linear allocation based on nested traversal of the axes in
some predetermined order.
.However, unlike ROLAP, which stores only records with non-zero facts, all array elements
are defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements
occupying a larger portion of them. MOLAP systems typically include provisions such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays, because both storage and retrieval costs are important when evaluating online
performance. MOLAP cubes are ideal for slicing and dicing data and can perform complex
calculations. When the cube is created, all calculations are pre-generated.
Benefits:
Limitations:
.Since all calculations are performed when the cube is built, a large amount of data cannot
be stored in the cube itself.
.ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than MOLAP.HOLAP
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-
type information. Because detailed data is stored in a relational database, cubes are
smaller than MOLAP.
Benefits:
Limitations:
.Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.
In the context of computer science, “Data Mining” can be referred to as knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information
from data stored in databases.
.The need of data mining is to extract useful information from large datasets and use it to
make predictions or better decision-making. Nowadays, data mining is used in almost all
places where a large amount of data is stored and processed.
.For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.
>KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
.The KDD process is an iterative process and it requires multiple iterations of the above
steps to extract accurate knowledge from the data.
1)Data Cleaning
.Data cleaning is defined as removal of noisy and irrelevant data from collection.
2)Data Integration
.Data integration using Data Migration tools, Data Synchronization tools and ETL(Extract-
Load-Transformation) process.
3)Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
4)Data Transformation
.Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data .Transformation is a two step process:
5)Data Mining
.Data mining is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.
6)Pattern Evaluation
7)Knowledge Representation:
This involves presenting the results in a way that is meaningful and can be used to make
decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced, mining can
be refined, new data can be integrated and transformed in order to get different and more
appropriate results . Preprocessing of databases consists of Data cleaning and Data
Integration.
. It refers to the cleaning, transforming, and integrating of data in order to make it ready for
analysis.
.The goal of data preprocessing is to improve the quality of the data and to make it more
suitable for the specific data mining task.
1)Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
2)Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can
be used for data integration.
3)Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
4)Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
5)Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
6)Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
.Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of
the analysis results.
.The specific steps involved in data preprocessing may vary depending on the nature of the
data and the analysis goals.
.By performing these steps, the data mining process becomes more efficient and the
results become more accurate.
In data mining, data can come in various forms and types, each requiring different
techniques and tools for analysis. Here are some common types of data encountered in
data mining:
1. **Structured Data**:
- **Relational Data**: Data stored in tables, often in relational databases, where each
row represents a record and each column represents an attribute.
2. **Unstructured Data**:
- **Text Data**: Data in the form of text, such as emails, documents, social media posts,
etc. Text mining techniques are used to analyze and extract information from such data.
- **Multimedia Data**: Includes images, audio, and video files. Image processing, audio
analysis, and video analysis techniques are applied to mine this type of data.
3. **Semi-Structured Data**:
- **XML/JSON Data**: Data stored in formats like XML or JSON, which have some
structure but do not fit neatly into relational databases.
- **Web Data**: Data from web pages, which might include semi-structured elements
like HTML tags along with unstructured content.
4. **Time-Series Data**:
- Data collected over time, such as stock prices, sensor readings, or weather data.
Analysis involves identifying patterns, trends, and seasonal variations.
5. **Spatial Data**:
6. **Graph Data**:
- Data represented as nodes and edges, such as social networks, biological networks, or
transportation networks. Graph mining techniques are used to find patterns and
relationships in such data.
7. **Streaming Data**:
- Data that is continuously generated and arrives in real-time, such as data from IoT
devices, social media feeds, or financial transactions. Real-time data processing and
mining techniques are required for this type of data.
8. **High-Dimensional Data**:
- Data with a large number of attributes (dimensions), such as genomics data or image
pixel data. Techniques like dimensionality reduction are used to manage and analyze high-
dimensional data.
9. **Metadata**:
- Data about data, providing information such as the source, context, or characteristics of
the primary data. Metadata is useful for data management and understanding data
provenance.
.Different types of data require different preprocessing, analysis, and mining techniques to
extract meaningful insights. Understanding the nature of the data is crucial for selecting
appropriate data mining methods.
- **Scalability:** Limited scalability compared to ROLAP due to the size of the pre-
calculated data.
- **Flexibility:** Less flexible for complex queries but excels in speed for standard OLAP
queries.
- **Performance:** Offers a balance between ROLAP and MOLAP, providing good query
performance with the ability to handle large datasets.
- **Scalability:** More scalable than MOLAP and can manage larger datasets by
leveraging relational storage for detailed data.
- **Flexibility:** Balances the flexibility of ROLAP with the performance benefits of
MOLAP.
- **Example:** Microsoft SQL Server Analysis Services can be configured to use HOLAP.
The data mining tasks can be classified generally into two types based on what a specific
task tries to achieve.
. The descriptive data mining tasks characterize the general properties of data whereas
predictive data mining tasks perform inference on the available data set to predict how a
new data set will behave.
There are a number of data mining tasks such as classification, prediction, time-series
analysis, association, clustering, summarization etc.
All these tasks are either predictive data mining tasks or descriptive data mining tasks. A
data mining system can execute one or more of the above specified tasks as part of data
mining.
..Predictive data mining tasks come up with a model from the available data set that is
helpful in predicting unknown or future values of another data set of interest. A medical
practitioner trying to diagnose a disease based on the medical test results of a patient can
be considered as a predictive data mining task. Those types include::
1)Classification
.Classification derives a model to determine the class of an object based on its attributes.
. One of the attributes will be class attribute and the goal of classification task is assigning
a class attribute to new set of records as accurately as possible.
.Prediction involves developing a model based on the available data and this model is used
in predicting future values of a new data set of interest.
.Time series is a sequence of events where the next event is determined by one or more of
the preceding events.
.Time series reflects the process being measured and there are certain components that
affect the behavior of a process.
..Descriptive data mining tasks usually finds data describing patterns and comes up with
new, significant information from the available data set. A retailer trying to identify
products that are purchased together can be considered as a descriptive data mining task.
1)Association:
2)Clustering:
.Clustering is used to identify data objects that are similar to one another.
. The similarity can be decided based on a number of factors like purchase behavior,
responsiveness to certain actions, geographical locations and so on.
3)Summarization:
. A set of relevant data is summarized which result in a smaller set that gives aggregated
information of the data.
Measuring similarity and dissimilarity in data mining is an important task that helps
identify patterns and relationships in large datasets.
.To quantify the degree of similarity or dissimilarity between two data points or objects,
mathematical functions called similarity and dissimilarity measures are used. .
.Similarity measures produce a score that indicates the degree of similarity between two
data points, while dissimilarity measures produce a score that indicates the degree of
dissimilarity between two data points.
Topics Covered
Overview
Data similarity and dissimilarity are important measures in data mining that help in
identifying patterns and trends in datasets. Similarity measures are used to determine how
similar two datasets or data points are, while dissimilarity measures are used to determine
how different they are. In this article, we will discuss some commonly used measures of
similarity and dissimilarity in data mining.
Introduction
Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or
dissimilarity between two data points or objects, mathematical functions called similarity
and dissimilarity measures are used. Similarity measures produce a score that indicates
the degree of similarity between two data points, while dissimilarity measures produce a
score that indicates the degree of dissimilarity between two data points. These measures
are crucial for many data mining tasks, such as identifying duplicate records, clustering,
classification, and anomaly detection.
>>>>Similarity Measure
.It takes two data points as input and produces a similarity score as output, typically
ranging from 0 (completely dissimilar) to 1 (identical or perfectly similar).
>>>Dissimilarity Measure
.It takes two data points as input and produces a dissimilarity score as output, ranging from
0 (identical or perfectly similar) to 1 (completely dissimilar). A few dissimilarity measures
also have infinity as their upper limit.
((Same as 5th)
Bitmap indexing and join indexing are techniques used in data mining and databases to
optimize query performance, especially in large datasets. Here’s an explanation of each:
>>>Bitmap Indexing
**Definition:*
Bitmap indexing is a technique that uses bit arrays (bitmaps) to represent the presence or
absence of a value in a given column.
. This is particularly effective for columns with a low cardinality, meaning columns that
have a limited number of distinct values.
**How it Works:**
- For each distinct value in a column, a separate bitmap (bit array) is created.
- Each bitmap has a bit for every row in the table, indicating whether the row contains the
specific value.
- For example, for a column “Gender” with values “Male” and “Female,” there would be two
bitmaps: one for “Male” and one for “Female.” If a row contains “Male,” the corresponding
bit in the “Male” bitmap would be set to 1, and the bit in the “Female” bitmap would be set
to 0.
**Advantages:**
- Enables fast combination of conditions using bitwise operations (AND, OR, NOT).
- Reduces the storage space required compared to traditional indexing methods for low-
cardinality data.
**Disadvantages:**
- Less effective for high-cardinality columns (columns with many distinct values).
- Can become inefficient for update-intensive environments because updating the bitmaps
requires modifying multiple bit arrays.
**Definition:**
Join indexing is a technique that pre-computes and stores the relationships between
tables, facilitating faster join operations.
**How it Works:**
- A join index maintains the mapping between the rows of two or more tables that are
frequently joined together.
- For example, if there are two tables, “Orders” and “Customers,” a join index would store
pairs of row identifiers from “Orders” and “Customers” that are related based on a
common key, such as “CustomerID.”
**Advantages:**
- Speeds up join operations by avoiding the need to repeatedly compute joins during query
execution.
- Useful in environments where join operations are frequent and involve large tables.
**Disadvantages:**
- Needs to be updated whenever the underlying tables are modified, which can add
overhead in environments with frequent updates.
- Can be complex to manage, especially as the number of tables and join conditions
increases.
- **Bitmap Indexing:** Ideal for scenarios with low-cardinality columns and read-heavy
workloads, such as data warehouses and decision support systems.
- **Join Indexing:** Best suited for environments where complex join operations are
frequent, such as data warehouses with star or snowflake schemas.
1. >>>>>>Data Cleaning:
.The data can have many irrelevant and missing parts. To handle this part, data cleaning
is done..
This situation arises when some data is missing in the data. It can be handled in various
ways.
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc.
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. >>>>>Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process.
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
3. >>>>>Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of
the dataset while preserving the important information.
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information.
Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid.
Compression: This involves compressing the dataset while preserving the important
information.