0% found this document useful (0 votes)
6 views

Unit 1 DM

The document provides an overview of various types of databases, including relational, transactional, multimedia, and spatial databases, as well as data warehouses and the World Wide Web. It also discusses data mining functionalities, categorizing them into descriptive and predictive tasks, and outlines different classification methods for data mining systems based on data types, knowledge types, techniques, and application domains. Additionally, it introduces data mining task primitives that guide users in specifying their mining queries and evaluating discovered patterns.

Uploaded by

lubnasiddiqui028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 1 DM

The document provides an overview of various types of databases, including relational, transactional, multimedia, and spatial databases, as well as data warehouses and the World Wide Web. It also discusses data mining functionalities, categorizing them into descriptive and predictive tasks, and outlines different classification methods for data mining systems based on data types, knowledge types, techniques, and application domains. Additionally, it introduces data mining task primitives that guide users in specifying their mining queries and evaluating discovered patterns.

Uploaded by

lubnasiddiqui028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit 1

Database Data (or) Relational database


A database system, also called a database management system
(DBMS), consists of a collection of interrelated data, known as a
database, and a set of software programs to manage and access the
data.
A relational database: is a collection of tables, each of which is
assigned a unique name, each table consists of a set of attributes
(columns or fields) and usually stores a large set of tuples (records or
rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values.
Example:

Data warehouse data


A data warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually residing
at a single site. Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading, and
periodic data refreshing.
A data warehouse is defined as the collection of data integrated from
multiple sources. Later this data can be mined for decision making.
A data warehouse is usually modelled by a multidimensional data
structure, called a data cube, in which each dimension corresponds to
an attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure such as count or sum. A data
cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.
Example:
Transactional data
Transactional database is a collection of data organized by time
stamps, date etc to represent transaction in databases. In general, each
record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web
page.
This type of database has the capability to roll back or undo operation
when a transaction is not completed or committed. And it follows
ACID property of DBMS.
Example:
TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Popcorn, Coke, Egg, Milk
T4 Popcorn, Bread, Egg, Milk
T5 Coke, Egg, Milk
Fig: Transactional data

Multimedia database
The multimedia databases are used to store multimedia data such as
images, animation, audio, video along with text. This data is stored in
the form of multiple file types
like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.
Spatial database
A spatial database is a database that is enhanced to store and access
spatial data or data that defines a geometric space. These data are
often associated with geographic locations and features, or
constructed features like cities. Data on spatial databases are stored as
coordinates, points, lines, polygons and topology.

World Wide Web


The World Wide Web is a collection of documents and resources such
as audio, video, and text. It identifies all this by URLs of the web
browsers which are linked through HTML pages. Online shopping,
job hunting, and research are some uses.
It is the most heterogeneous repository as it collects data from
multiple resources. And it is dynamic in nature as Volume of data is
continuously increasing and changing.
Text data (Flat File)
Flat files are a type of structured data that are stored in a plain text
format. They are called “flat” because they have no hierarchical
structure, unlike a relational database table. Flat files typically consist
of rows and columns of data, with each row representing a single
record and each column representing a field or attribute within that
record. They can be stored in various formats such as CSV, tab-
separated values (TSV) and fixed-width format.

 Flat files is defined as data files in text form or binary form with
a structure that can be easily extracted by data mining
algorithms.
 Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file, then
there will be no relations between the tables.
Example:

Time series database


Time-series data is a sequence of data points collected over time
intervals, allowing us to track changes over time. Time-series data can
track changes over milliseconds, days, or even years.
A time series database (TSDB) is a database optimized for time-
stamped or time series data. Time series data are simply
measurements or events that are tracked, monitored, downsampled,
and aggregated over time. This could be server metrics, application
performance monitoring, network data, sensor data, events, clicks,
trades in a market, and many other types of analytics data.
Example:
Fig:
What is Data Mart?
A Data Mart is focused on a single functional area of an organization
and contains a subset of data stored in a Data Warehouse. A Data
Mart is a abbreviated version of Data Warehouse and is designed for
use by a specific department, unit or set of users in an organization.
E.g., Marketing, Sales, HR or finance. It is often controlled by a
single department in an organization.

A transaction typically includes a unique transaction identity number


(trans ID) and a list of the items making up the transaction, such as
the items purchased in the transaction.

Data Mining Functionalities


Data mining is important because there is so much data out there, and
it's impossible for people to look through it all by themselves.
Data mining uses various functionalities to analyze the data and find
patterns, trends, and other information that would be hard for people
to find on their own.
Data mining functionalities areused to specify the kinds of patterns to
be found in data mining tasks.In general, such data mining tasks can
be classified into two categories: descriptive and predictive.
Descriptive data mining
Similarities and patterns in data may be discovered using descriptive
data mining.
This kind of mining focuses on transforming raw data into
information that can be used in reports and analyses. It provides
certain knowledge about the data,
for instance, count, average.
It gives information about what is happening inside the data without
any previous idea. It exhibits the common features in the data. In
simple words, you get to know the general properties of the data
present in the database.

Predictive data mining


These kind of mining tasks perform inference on the current data in
order to make predictions.
This helps the developers in understanding the characteristics that are
not explicitly available. For instance, the prediction of business
analysis in the next quarter with the performance of the previous
quarters. In general, the predictive analysis predicts or infers the
characteristics with the previously available data.

The following are data mining functionalities:

 Class/Concept Description (Characterization and


Discrimination)
 Classification
 Prediction
 Association Analysis
 Cluster Analysis
 Outlier Analysis
Class/Concept Description: Characterization and Discrimination
Data is associated with classes or concepts.
Class: A collection of things sharing a common attribute
Example: Classes of items – computers and printers
Concept: An abstract or general idea derived from specific
instances.
Example: Concepts of customers – bigSpenders and
budgetSpenders.
It can be useful to describe individual classesand concepts in
summarized, concise, and yet precise terms. Such descriptions of a
classor a concept are called class/concept descriptions.
These descriptions can be derivedusing data
characterization and data discrimination, or both.

Data characterization

Data characterization is a summarization of the general characteristics


or featuresof a target class of data.
Data summarization can be done based on statistical measures and
plots.
The output of data characterization can be presented in various forms
it includes pie charts, bar charts, curves, and multidimensional data
cubes.
Example: A customer relationship manager at AllElectronics may
order thefollowing data mining task: Summarize the characteristics of
customers who spend morethan $5000 a year at AllElectronics.
The result is a general profile of these customers,such as that they
are 40 to 50 years old, employed, and have excellent credit ratings.

Data discrimination

Data discrimination is one of the functionalities of data mining. It


compares the data between the two classes. Generally, it maps the
target class with a predefined group or class. It compares and
contrasts the characteristics of the class with the predefined class
using a set of rules called discriminate rules.
Example: A customer relationship manager at AllElectronics may
want to compare two groups of customers those who shop for
computer products regularly(e.g., more than twice a month) and those
who rarely shop for such products (e.g.,less than three times a year).
The resulting description provides a general comparativeprofile of
these customers, such as that 80% of the customers who frequently
purchasecomputer products are between 20 and 40 years old and have
a university education,whereas 60% of the customers who
infrequently buy such products are either seniors oryouths, and have
no university degree.

Classification
:
Classification is a data mining technique that categorizes items in a
collection based on some predefined properties. It uses methods like
IF-THEN, Decision trees orNeural networks to predict a class or
essentially classify a collection of items.
Classification is a supervised learning technique used to categorize
data into predefined classes or labels.
Example:
Prediction
Finding missing data in a database is very important for the accuracy
of the analysis. Prediction is one of the data mining functionalities
that help the analyst find the missing numeric values. If there is a
missing class label, then this function is done using classification. It is
very important in business intelligence and is very popular. One of the
methods is to predict the missing or unavailable data using prediction
analysis.
Example:
Association Analysis
Association Analysis is a functionality of data mining. It relates two
or more attributes of the data. It discovers the relationship between the
data and the rules that are binding them. It is also known as Market
Basket Analysis for its wide use in retail sales.
The suggestion that Amazon shows on the bottom, “Customers who
bought this also bought.” is a real-time example of association
analysis.
It relates two transactions of similar items and finds out the
probability of the same happening again. This helps the companies
improve their sales of various items.
Cluster Analysis
Clustering is an unsupervised learning technique that group’s similar
data points together based on their features. The goal is to identify
underlying structures or patterns in the data. Some common clustering
algorithms include K-means, hierarchical clustering, and DBSCAN.
This data mining functionality is similar to classification. But in this
case, the class label is unknown.Similar objects are grouped in a
cluster. There are vast differences between one cluster and another.
Example1:

Example2:
Outlier Analysis
When data that cannot be grouped in any of the class appears, we use
outlier analysis. There will be occurrences of data that will have
different attributes/features to any of the other classes or clusters.
These outstanding data are called outliers. They are usually
considered noise or exceptions, and the analysis of these outliers is
called outlier mining.
Outlier analysis is important to understand the quality of data. If there
are too many outliers, you cannot trust the data or draw patterns out of
it.
Example1:

Example2:
Interestingness Patterns
A data mining system has the potential to generate thousands or even
millions of patterns, or rules. then “are all of the patterns
interesting?” Typically, not—only a small fraction of the patterns
potentially generated would be of interest to any given user.
This raises some serious questions for data mining. You may wonder,

1. What makes a pattern interesting?


2. Can a data mining system generate all the interesting patterns?
3. Can a data mining system generate only interesting patterns?
To answer the first question, a pattern is interesting if it is

1. easily understood byhumans,


2. valid on new or test data with some degree of certainty,
3. potentiallyuseful, and
4. novel.
The second question―Can a data mining system generate all the
interestingpatterns?--refers to the completeness of a data mining
algorithm. It is often unrealistic andinefficient for data mining
systems to generate all the possible patterns. Instead, user-provided
constraints and interestingness measures should be used to focus the
search. A data mining algorithm is complete if it mines all interesting
patterns.
Finally, the third question -- “Can a data mining system generate
only interesting patterns?”— is an optimization problem in data
mining. It is highly desirable for datamining systems to generate only
interesting patterns.An interesting pattern represents knowledge.

Classification of Data Mining systems


Data Mining is considered as an interdisciplinary field. It includes a
set of various disciplines such as statistics, database systems, machine
learning, visualization, and information sciences. Classification of the
data mining system helps users to understand the system and match
their requirements with such systems.
Data mining discovers patterns and extracts useful information from
large datasets. Organizations need to analyze and interpret data using
data mining systems as data grows rapidly. With an exponential
increase in data, active data analysis is necessary to make sense of it
all.
Data mining (DM) systems can be classified based on various factors.

 Classification based on Types of Data Mined


 Classification based on Type of knowledge Mined
 Classification based on Type of Technique Utilized
 Classification based on Application Domain
Classification based on Types of Data Mined

A database mining system can be classified based on ‘type of data’ or


‘use of data’ model or ‘application of data.’
For Example: Relational Database, Transactional Database,
Multimedia Database, Textual Data, World Wide Web (WWW) and
etc,

Classification based on Type of knowledge Mined

:
We can classify a data mining system according to the kind of
knowledge mined. It means the data mining system is classified based
on functionalities such as

 Association Analysis
 Classification
 Prediction
 Cluster Analysis
 Characterization
 Discrimination
Classification based on Type of Technique Utilized

We can classify a data mining system according to the kind of


techniques used. We can describe these techniques according to the
degree of user interaction involved or the methods of analysis
employed.
Data Mining systems use various techniques, including Statistics,
Machine Learning, Database Systems, Information retrieval,
Visualization, and pattern recognition.

Classification based on Application Domain

We can classify a data mining system according to the applications


adapted. These applications are as follows

 Finance
 Telecommunications
 E-Commerce
 Medial Sector
 Stock Markets

Data mining Task primitives


A data mining task can be specified in the form of a data mining
query, which is input to the data mining system. A data mining query
is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining
system during the mining process to discover interesting patterns.
Here is the list of Data Mining Task Primitives

 Set of task relevant data to be mined.


 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This specifies the portions of the database or the set of data in which
the user is interested.
This portion includes the following

 Database Attributes
 Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics in
charge of sales in the United States and Canada. You would like to
study the buying trends of customers in Canada. Rather than mining
on the entire database. These are referred to as relevant attributes.

Kind of knowledge to be mined

This specifies the data mining functions to be performed, such as

 Characterization& Discrimination
 Association
 Classification
 Clustering
 Prediction
 Outlier analysis
For instance, if studying the buying habits of customers in Canada,
you may choose to mine associations between customer profiles and
the items that these customers like to buy.
Background knowledge to be used in discovery process

Users can specify background knowledge, or knowledge about the


domain to be mined. This knowledge is usefulfor guiding the
knowledge discovery process, and for evaluating the patterns found.
User beliefs about relationship in the data.
There are several kinds of background knowledge. Concept
hierarchies are a popular form of background knowledge, which allow
data to be mined at multiple levels of abstraction.
Example:
An example of a concept hierarchy for the attribute (or dimension)
age is shown in the following Figure.

In the above, the root node represents the mostgeneral abstraction


level, denoted as all.

Interestingness measures and thresholds for pattern evaluation

The Interestingness measures are used to separateinteresting and


uninteresting patterns from the knowledge.They may be used to guide
the mining process, or after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different
interestingness measures.
For example, interesting measures for association rules include
support and confidence.
Representation for visualizing the discovered patterns

This refers to the formin which discovered patterns are to be


displayed. Users can choose from different forms for knowledge
presentation, such as

 rules, tables, reports, charts, graphs, decision trees, and cubes.

Integration of Data mining system with a Data warehouse


The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective mode. A
data mining system operates in an environment that needs
to communicate with other data systems like a Database or
Datawarehouse system.
There are differentpossible integration (coupling) schemes as follows:

 No Coupling
 Loose Coupling
 Semi-Tight Coupling
 Tight Coupling
No Coupling
No coupling means that a Data Mining system will not utilize any
function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system),
process data using some data mining algorithms, and then store the
mining results in another file.

Drawbacks of No Coupling

 First, without using a Database/Data Warehouse system, a Data


Mining system may spend a substantial amount of time finding,
collecting, cleaning, and transforming data.
 Second, there are many tested, scalable algorithms and data
structures implemented in Database and Data Warehouse
systems.

Loose Coupling
In this Loose coupling, the data mining system uses some facilities /
services of a database or data warehouse system. The data is fetched
from a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a
database or data warehouse.
Loose coupling is better than no coupling because it can fetch any
portion of data stored in Databases or Data Warehouses by using
query processing, indexing, and other system facilities.

Drawbacks of Loose Coupling

 It is difficult for loose coupling to achieve high scalability and


good performance with large data sets.

Semi-Tight Coupling
Semitight couplingmeans that besides linking a Data Mining system
to a Data Base/Data Warehousesystem, efficient implementations of a
few essential data mining primitives can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multi way join, and precomputation of some
essential statistical measures, such as sum, count, max, min, standard
deviation.

Advantage of Semi-Tight Coupling

 This Coupling will enhance the performance of Data Mining


systems

Tight Coupling
Tight couplingmeans that a Data Mining system is smoothly
integrated into the Data Base/Data Warehousesystem. The data
mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized
based on mining query analysis, data structures, indexing schemes,
and query processing methods of a DB or DW system.

Major issues in Data Mining

Data mining, the process of extracting knowledge from data, has


become increasingly important as the amount of data generated by
individuals, organizations, and machines has grown
exponentially.Data mining is not an easy task, as the algorithms used
can get very complex and data is not always available at one place. It
needs to be integrated from various heterogeneous data sources.
The above factors may lead to some issues in data mining. These
issues are mainly divided into three categories, which are given
below:

1. Mining Methodology and User Interaction


2. Performance Issues
3. Diverse Data Types Issues

Mining Methodology and User Interaction


It refers to the following kinds of issues

 Mining different kinds of knowledge in databases − Different


users may be interested in different kinds of knowledge.
Therefore, it is necessary for data mining to cover a broad range
of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns,
providing and refining data mining requests based on the
returned results.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning
methods are required to handle the noise and incomplete objects
while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be
poor.
 Pattern evaluation − The patterns discovered should be
interesting because either they represent common knowledge or
lack novelty.

Performance Issues
There can be performance-related issues such as follows

 Efficiency and scalability of data mining algorithms − In


order to effectively extract the information from huge amount of
data in databases, data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms −
The factors such as huge size of databases, wide distribution of
data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further
processed in a parallel fashion. Theincremental algorithms,
update databases without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The


database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kinds of data.
 Mining information from heterogeneous databases and
global information systems − The data is available at different
data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore, mining
the knowledge from them adds challenges to data mining.

Data Preprocessing

What is Data Preprocessing?


Data preprocessing is a crucial step in data mining. It involves
transforming raw data into a clean, structured, and suitable format for
mining. Proper data preprocessing helps improve the quality of the
data, enhances the performance of algorithms, and ensures more
accurate and reliable results.

Why Preprocess the Data?


In the real world, many databases and data warehouses
have noisy, missing, and inconsistent data due to their huge size.
Low quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from

 Human or computer error at data entry.


 Errors in data transmission.
Missing: lacking certain attribute values or containing only aggregate
data. E.g., Occupation = “”
Missing (Incomplete) may data come from

 “Not applicable” data value when collected.


 Human/hardware/software problems.
Inconsistent: Data inconsistency meaning is that different versions of
the same data appear in different places.For example, the ZIP code is
saved in one table as 1234-567 numeric data format; while in another
table it may be represented in 1234567.
Inconsistent data may come from

 Errors in data entry.


 Merging data from different sources with varying formats.
 Differences in the data collection process.
Data preprocessing is used to improve the quality of data and mining
results. And The goal of data preprocessing is to enhance the
accuracy, efficiency, and reliability of data mining algorithms.

Major Tasks in Data Preprocessing


Data preprocessing is an essentialstepin the knowledge discovery
process, because quality decisions must be based on qualitydata.And
Data Preprocessing involvesData Cleaning, Data Integration, Data
Reduction and Data Transformation.
Steps in Data Preprocessing
1. Data Cleaning

Data cleaning is a process that "cleans" the data by filling in the


missing values, smoothing noisy data, analyzing, and removing
outliers, and removing inconsistencies in the data.
If usersbelieve the data are dirty, they are unlikely to trust the results
of any data mining that hasbeen applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or datacleansing) routines attempt to fill in missing values,
smooth out noise while identifyingoutliers, and correct
inconsistencies in the data.

Missing Values

Imagine that you need to analyze All Electronics sales and customer
data. You note thatmany tuples have no recorded value for several
attributes such as customer income. Howcan you go about filling in
the missing values for this attribute? There are several methods to fill
the missing values.
Those are,

a. Ignore the tuple: This is usually done when the class label is
missing(classification). This method is not very effective, unless
the tuple contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is
time consuming andmay not be feasible given a large data set
with many missing values.
c. Use a global constant to fill in the missing value: Replace all
missing attribute valuesby the same constant such as a label like
“Unknown” or “- ∞ “.
d. Use the attribute mean or median to fill in the missing
value: Replace all missing values in the attribute by the mean or
median of that attribute values.

Noisy Data

:
Noise is a random error or variance in a measured variable.Data
smoothing techniques are used to eliminate noise and extract the
useful patterns. The different techniques used for data smoothing are:
a. Binning: Binning methods smooth a sorted data value by
consulting its “neighbourhood,” that is, the values around it. The
sorted values are distributed into several “buckets,” or bins.
Because binning methods consult the neighbourhood of values,
they perform local smoothing.
There are three kinds of binning. They are:
o Smoothing by Bin Means:In this method, each value in a
bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the
value 9.
o Smoothing by Bin Medians:In this method, each value in a
bin is replaced by the median value of the bin. For
example, the median of the values 4, 8, and 15 in Bin 1 is
8. Therefore, each original value in this bin is replaced by
the value 8.
o Smoothing by Bin Boundaries:In this method, the
minimum and maximum values in each bin are identified
as the bin boundaries. Each bin value is then replaced by
the closest boundary value.For example, the middle value
of the values 4, 8, and 15 in Bin 1is replaced with nearest
boundary i.e., 4.
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

b. Regression: Data smoothing can also be done by regression, a


technique that used to predict the numeric values in a given data
set. It analyses the relationship between a target variable
(dependent) and its predictor variable (independent).
o Regression is a form of a supervised machine learning
technique that tries to predict any continuous valued
attribute.
o Regression done in two ways; Linear regression involves
finding the “best” line to fit two attributes (or variables) so
that one attribute can be used to predict the other. Multiple
linear regression is an extension of linear regression,
where more than two attributes are involved and the data
are fit to a multidimensional surface.
c. Clustering:It supports in identifying the outliers. The similar
values are organized into clusters and those values which fall
outside the cluster are known as outliers.

2. Data Integration
Data integration is the process of combining data from multiple
sources into a single, unified view. This process involves identifying
and accessing the different data sources, mapping the data to a
common format. Different data sources may include multiple data
cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze
data that is spread across multiple systems or platforms, in order to
gain a more complete and accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M)
approach, where G denotes the global schema, S denotes the schema
of the heterogeneous data sources, and M represents the mapping
between the queries of the source and global schema.

Example: To understand the (G, S, M) approach, let us consider a


data integration scenario that aims to combine employee data from
two different HR databases, database A and database B. The global
schema (G) would define the unified view of employee data,
including attributes like EmployeeID, Name, Department, and Salary.
In the schema of heterogeneous sources, database A (S1) might have
attributes like EmpID, FullName, Dept, and Pay, while database B's
schema (S2) might have attributes like ID, EmployeeName,
DepartmentName, and Wage. The mappings (M) would then define
how the attributes in S1 and S2 map to the attributes in G, allowing
for the integration of employee data from both systems into the global
schema.

Issues in Data Integration

There are several issues that can arise when integrating data from
multiple sources, including:

a. Data Quality:Data from different sources may have varying


levels of accuracy, completeness, and consistency, which can
lead to data quality issues in the integrated data.
b. Data Semantics:Integrating data from different sources can be
challenging because the same data element may have different
meanings across sources.
c. Data Heterogeneity: Different sources may use different data
formats, structures, or schemas, making it difficult to combine
and analyze the data.

3. Data Reduction
Imagine that you have selected data from the AllElectronics data
warehouse for analysis.The data set will likely be huge! Complex data
analysis and mining on huge amounts ofdata can take a long time,
making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced
representation of thedata set that ismuch smaller in volume, yet
closely maintains the integrity of the originaldata. That is, mining on
the reduced data set should be more efficient yet produce thesame (or
almost the same) analytical results.
In simple words,Data reduction is a technique used in data mining to
reduce the size of a dataset while still preserving the most important
information. This can be beneficial in situations where the dataset is
too large to be processed efficiently, or where the dataset contains a
large amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used
in data mining, including:

a. Data Sampling: This technique involves selecting a subset of the


data to work with, rather than using the entire dataset. This can
be useful for reducing the size of a dataset while still preserving
the overall trends and patterns in the data.
b. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features
that are not relevant or by combining multiple features into a
single feature.
c. Data compression:This is the process of altering, encoding, or
transforming the structure of data in order to save space. By
reducing duplication and encoding data in binary form, data
compression creates a compact representation of information.
And it involves the techniques such as lossy or lossless
compression to reduce the size of a dataset.

4. Data Transformation
Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modelling. The
goal of data transformation is to prepare the data for data mining so
that it can be used to extract useful insights and knowledge.

Data transformation typically involves several steps, including:

1. Smoothing: It is a process that is used to remove noise from the


dataset using techniques include binning,regression, and
clustering.
2. Attribute construction (or feature construction): In this, new
attributes are constructedand added from the given set of
attributes to help the mining process.
3. Aggregation: In this, summary or aggregation operations are
applied to the data. Forexample, the daily sales data may be
aggregated to compute monthly and annualtotal amounts.
4. Data normalization: This process involves converting all data
variables into a small range.such as -1.0 to 1.0, or 0.0 to 1.0.
5. Generalization: It converts low-level data attributes to high-
level data attributes using concept hierarchy. For Example, Age
initially in Numerical form (22, ) is converted into categorical
value (young, old).

You might also like