Data Mining and Data Warehousing
Data Mining and Data Warehousing
ON
vast amounts of heterogeneous data other lists. When dealing with data that
integration process easier to learn and with limits in row length or table size,
use, faster to implement and maintain, can, in some cases, represent such
designed to store data according to the products to support data outside the
mathematical set theory as expressed such data remains relational and, for
in the relational paradigm. In many the most part, SQL based. This fact
method for cataloging data is not the products unnecessarily difficult to set
most efficient method for storing and up and manage, and too inefficient, for
Data mining tools predict future trends applications illustrate its relevance to
analyses offered by data mining move deliver the value of data mining to end
beyond the analyses of past events users.The past two decades has seen a
questions that traditionally were too of data has taken place at an explosive
The phases depicted start with the raw research. The data is made useable
data and finish with the extracted and navigable.
knowledge which was acquired as a Data mining: this stage is
result of the following stages: concerned with the extraction of
patterns from the data. A pattern
Selection: Selecting or segmenting can be defined as given a set of
the data according to some criteria facts (data) F, a language L, and
e.g. all those people who own a some measure of certainty C a
car, in this way subsets of the data pattern is a statement S in L that
can be determined. describes relationships among a
Preprocessing: This is the data subset Fs of F with a certainty c
cleansing stage where certain such that S is simpler in some
information is removed which is sense than the enumeration of all
deemed unnecessary and may slow the facts in Fs.
down queries for example Applications of Data mining
unnecessary to note the sex of a Data mining has many and varied
patient when studying pregnancy. fields of application some of which are
Also the data is reconfigured to listed below.
ensure a consistent format as there 11. Retail/Marketing
is a possibility of inconsistent Identify buying patterns from
formats because the data is drawn customers
from several sources e.g. sex may Find associations among
recorded as f or m and also as 1 or customer demographic
0. characteristics
Market basket analysis
Transformation: The data is not
22. Banking
merely transferred across but
Detect patterns of fraudulent
transformed in that overlays may
credit card use
added such as the demographic
Identify `loyal' customers
overlays commonly used in market
Predict customers likely to more attributes that denote the class of
change their credit card a tuple and these are known as
affiliation predicted attributes whereas the
Determine credit card spending remaining attributes are called
by customer groups predicting attributes. A combination of
33. Insurance and Health Care: values for the predicted attributes
Claims analysis - i.e which defines a class.
medical procedures are claimed 1
together 22. Associations:
far reaching, then what you gain by to meet the needs of transaction
win before you even fight. When your defined as any centralized data
strategic thinking is shallow and near- repository which can be queried for
sighted, then what you gain by your business benefit but this will be more
The first phase in data end-user finds and understands the data
preserve the security and integrity of should at the very least contain;The
while giving you access to the broadest The algorithm used for
possible base of data. The resulting summarization;
database or data warehouse may The mapping from the
consume hundreds of gigabytes - or operational environment to the
even terabytes - of disk space, what is data warehouse.
required then are efficient techniques
for storing and retrieving massive Data cleansing is an important
amounts of information. Increasingly, aspect of creating an efficient data
large organizations have found that warehouse in that it is the removal of
only parallel processing systems offer certain aspects of operational data,
sufficient bandwidth. such as low-level transaction
information, which slow down the
The data warehouse thus query times. The cleansing stage has to
retrieves data from a variety of be as dynamic as possible to
heterogeneous operational databases. accommodate all types of queries even
The data is then transformed and those which may require low-level
delivered to the data warehouse/store information. Data should be extracted
based on a selected model (or mapping from production sources at regular
definition). The data transformation intervals and differences between
and movement processes are executed various styles of data collection.
whenever an update to the warehouse Pooled centrally but the cleansing
data is required so there should some process has to remove duplication and
form of automation to manage and reconcile
measured in hundreds of
The current detail data is central in millions of rows and gigabytes
importance as it: per hour and must not
Reflects the most recent artificially constrain the volume
happenings, which are usually of data required by the
the most interesting; business.
It is voluminous as it is stored Load Processing: Many steps
at the lowest level of must be taken to load new or
granularity; updated data into the data
It is always (almost) stored on warehouse including data
disk storage which is fast to conversions, filtering,
access but expensive and reformatting, integrity checks,
complex to manage physical storage, indexing, and
Uses of Data Warehousing metadata update. These steps