Unit 1 DM
Unit 1 DM
Multimedia database
The multimedia databases are used to store multimedia data such as
images, animation, audio, video along with text. This data is stored in
the form of multiple file types
like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.
Spatial database
A spatial database is a database that is enhanced to store and access
spatial data or data that defines a geometric space. These data are
often associated with geographic locations and features, or
constructed features like cities. Data on spatial databases are stored as
coordinates, points, lines, polygons and topology.
Flat files is defined as data files in text form or binary form with
a structure that can be easily extracted by data mining
algorithms.
Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file, then
there will be no relations between the tables.
Example:
Data characterization
Data discrimination
Classification
:
Classification is a data mining technique that categorizes items in a
collection based on some predefined properties. It uses methods like
IF-THEN, Decision trees orNeural networks to predict a class or
essentially classify a collection of items.
Classification is a supervised learning technique used to categorize
data into predefined classes or labels.
Example:
Prediction
Finding missing data in a database is very important for the accuracy
of the analysis. Prediction is one of the data mining functionalities
that help the analyst find the missing numeric values. If there is a
missing class label, then this function is done using classification. It is
very important in business intelligence and is very popular. One of the
methods is to predict the missing or unavailable data using prediction
analysis.
Example:
Association Analysis
Association Analysis is a functionality of data mining. It relates two
or more attributes of the data. It discovers the relationship between the
data and the rules that are binding them. It is also known as Market
Basket Analysis for its wide use in retail sales.
The suggestion that Amazon shows on the bottom, “Customers who
bought this also bought.” is a real-time example of association
analysis.
It relates two transactions of similar items and finds out the
probability of the same happening again. This helps the companies
improve their sales of various items.
Cluster Analysis
Clustering is an unsupervised learning technique that group’s similar
data points together based on their features. The goal is to identify
underlying structures or patterns in the data. Some common clustering
algorithms include K-means, hierarchical clustering, and DBSCAN.
This data mining functionality is similar to classification. But in this
case, the class label is unknown.Similar objects are grouped in a
cluster. There are vast differences between one cluster and another.
Example1:
Example2:
Outlier Analysis
When data that cannot be grouped in any of the class appears, we use
outlier analysis. There will be occurrences of data that will have
different attributes/features to any of the other classes or clusters.
These outstanding data are called outliers. They are usually
considered noise or exceptions, and the analysis of these outliers is
called outlier mining.
Outlier analysis is important to understand the quality of data. If there
are too many outliers, you cannot trust the data or draw patterns out of
it.
Example1:
Example2:
Interestingness Patterns
A data mining system has the potential to generate thousands or even
millions of patterns, or rules. then “are all of the patterns
interesting?” Typically, not—only a small fraction of the patterns
potentially generated would be of interest to any given user.
This raises some serious questions for data mining. You may wonder,
:
We can classify a data mining system according to the kind of
knowledge mined. It means the data mining system is classified based
on functionalities such as
Association Analysis
Classification
Prediction
Cluster Analysis
Characterization
Discrimination
Classification based on Type of Technique Utilized
Finance
Telecommunications
E-Commerce
Medial Sector
Stock Markets
Database Attributes
Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics in
charge of sales in the United States and Canada. You would like to
study the buying trends of customers in Canada. Rather than mining
on the entire database. These are referred to as relevant attributes.
Characterization& Discrimination
Association
Classification
Clustering
Prediction
Outlier analysis
For instance, if studying the buying habits of customers in Canada,
you may choose to mine associations between customer profiles and
the items that these customers like to buy.
Background knowledge to be used in discovery process
No Coupling
Loose Coupling
Semi-Tight Coupling
Tight Coupling
No Coupling
No coupling means that a Data Mining system will not utilize any
function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system),
process data using some data mining algorithms, and then store the
mining results in another file.
Drawbacks of No Coupling
Loose Coupling
In this Loose coupling, the data mining system uses some facilities /
services of a database or data warehouse system. The data is fetched
from a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a
database or data warehouse.
Loose coupling is better than no coupling because it can fetch any
portion of data stored in Databases or Data Warehouses by using
query processing, indexing, and other system facilities.
Semi-Tight Coupling
Semitight couplingmeans that besides linking a Data Mining system
to a Data Base/Data Warehousesystem, efficient implementations of a
few essential data mining primitives can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multi way join, and precomputation of some
essential statistical measures, such as sum, count, max, min, standard
deviation.
Tight Coupling
Tight couplingmeans that a Data Mining system is smoothly
integrated into the Data Base/Data Warehousesystem. The data
mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized
based on mining query analysis, data structures, indexing schemes,
and query processing methods of a DB or DW system.
Performance Issues
There can be performance-related issues such as follows
Data Preprocessing
Missing Values
Imagine that you need to analyze All Electronics sales and customer
data. You note thatmany tuples have no recorded value for several
attributes such as customer income. Howcan you go about filling in
the missing values for this attribute? There are several methods to fill
the missing values.
Those are,
a. Ignore the tuple: This is usually done when the class label is
missing(classification). This method is not very effective, unless
the tuple contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is
time consuming andmay not be feasible given a large data set
with many missing values.
c. Use a global constant to fill in the missing value: Replace all
missing attribute valuesby the same constant such as a label like
“Unknown” or “- ∞ “.
d. Use the attribute mean or median to fill in the missing
value: Replace all missing values in the attribute by the mean or
median of that attribute values.
Noisy Data
:
Noise is a random error or variance in a measured variable.Data
smoothing techniques are used to eliminate noise and extract the
useful patterns. The different techniques used for data smoothing are:
a. Binning: Binning methods smooth a sorted data value by
consulting its “neighbourhood,” that is, the values around it. The
sorted values are distributed into several “buckets,” or bins.
Because binning methods consult the neighbourhood of values,
they perform local smoothing.
There are three kinds of binning. They are:
o Smoothing by Bin Means:In this method, each value in a
bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the
value 9.
o Smoothing by Bin Medians:In this method, each value in a
bin is replaced by the median value of the bin. For
example, the median of the values 4, 8, and 15 in Bin 1 is
8. Therefore, each original value in this bin is replaced by
the value 8.
o Smoothing by Bin Boundaries:In this method, the
minimum and maximum values in each bin are identified
as the bin boundaries. Each bin value is then replaced by
the closest boundary value.For example, the middle value
of the values 4, 8, and 15 in Bin 1is replaced with nearest
boundary i.e., 4.
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
2. Data Integration
Data integration is the process of combining data from multiple
sources into a single, unified view. This process involves identifying
and accessing the different data sources, mapping the data to a
common format. Different data sources may include multiple data
cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze
data that is spread across multiple systems or platforms, in order to
gain a more complete and accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M)
approach, where G denotes the global schema, S denotes the schema
of the heterogeneous data sources, and M represents the mapping
between the queries of the source and global schema.
There are several issues that can arise when integrating data from
multiple sources, including:
3. Data Reduction
Imagine that you have selected data from the AllElectronics data
warehouse for analysis.The data set will likely be huge! Complex data
analysis and mining on huge amounts ofdata can take a long time,
making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced
representation of thedata set that ismuch smaller in volume, yet
closely maintains the integrity of the originaldata. That is, mining on
the reduced data set should be more efficient yet produce thesame (or
almost the same) analytical results.
In simple words,Data reduction is a technique used in data mining to
reduce the size of a dataset while still preserving the most important
information. This can be beneficial in situations where the dataset is
too large to be processed efficiently, or where the dataset contains a
large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used
in data mining, including:
4. Data Transformation
Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modelling. The
goal of data transformation is to prepare the data for data mining so
that it can be used to extract useful insights and knowledge.