0% found this document useful (0 votes)
121 views

Data Mining and Data Warehousing

Data mining and data warehousing techniques can help companies gain valuable insights from large databases. The presented paper discusses how data mining can extract hidden predictive patterns to help businesses make knowledge-driven decisions. It provides an introduction to basic data mining technologies and how data warehouse architectures can integrate these tools to deliver value. As data storage has exponentially increased, data mining is necessary to analyze massive databases and answer important questions for various applications like marketing, fraud detection, and product development.

Uploaded by

Peter Asane
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Data Mining and Data Warehousing

Data mining and data warehousing techniques can help companies gain valuable insights from large databases. The presented paper discusses how data mining can extract hidden predictive patterns to help businesses make knowledge-driven decisions. It provides an introduction to basic data mining technologies and how data warehouse architectures can integrate these tools to deliver value. As data storage has exponentially increased, data mining is necessary to analyze massive databases and answer important questions for various applications like marketing, fraud detection, and product development.

Uploaded by

Peter Asane
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

PAPER PRENSENTATION

ON

DATA MINING & DATA WAREHOUSE


Presented by:
P.v.surendranathReddy Y.AnilkumarReddy
4th B.TECH (IT) 4th B.TECH (IT)
[email protected] [email protected]
CELL NO: 9866857915 CELL NO: 9703779377

VAAGDEVI INSTITUTE OF TECHNOLOGY AND SCIENCES


PEDDASETTIPALLI (VILL), PRODDATUR,
KADAPA (DT), ANDHRA PRADESH.
ABSTRACT: databases do well is where the data is

In today's fast-paced, most appropriately managed, as flat

information-based economy, lists having simple data types,

companies must be able to integrate involving few associations with data in

vast amounts of heterogeneous data other lists. When dealing with data that

and applications from disparate must be kept in complex

sources in order to support strategic IT interdependent structures or when data

initiatives such as Business must be rapidly retrieved by following

Intelligence, Business Process paths of associations rather than by

Management, Business Process simply walking down simple lists, the

Reengineering, Business Activity relational database begins to show

Monitoring and Business Performance characteristics such as multiple-index

Management. Since its inception, It has management and traversal and

continued to build on its unique complex normalized schema

software architecture to make the structures. These impediments, along

integration process easier to learn and with limits in row length or table size,

use, faster to implement and maintain, can, in some cases, represent such

and operate at the best performance profound encumbrances that an

possible- in other words, Simply Faster RDBMS must be regarded as

Integration. impractical for certain data


management tasks. Although leading

Relational database RDBMS vendors have been

management systems (RDBMSs) are introducing features that enable their

designed to store data according to the products to support data outside the

most efficient method of data relational paradigm, the fundamental

cataloging, which is that defined by means of management and access of

mathematical set theory as expressed such data remains relational and, for

in the relational paradigm. In many the most part, SQL based. This fact

cases, however, the most efficient will continue to make RDBMS

method for cataloging data is not the products unnecessarily difficult to set

most efficient method for storing and up and manage, and too inefficient, for

retrieving such data. Where relational some kinds of databases.


An Introduction toData Mining "Which clients are most likely to

Data mining, the extraction of respond to my next promotional

hidden predictive information from mailing, and why?"

large databases, is a powerful new


technology with great potential to help This paper provides an

companies focus on the most important introduction to the basic technologies

information in their data warehouses. of data mining. Examples of profitable

Data mining tools predict future trends applications illustrate its relevance to

and behaviors, allowing businesses to today’s business environment as well

make proactive, knowledge-driven as a basic description of how data

decisions. The automated, prospective warehouse architectures can evolve to

analyses offered by data mining move deliver the value of data mining to end

beyond the analyses of past events users.The past two decades has seen a

provided by retrospective tools typical dramatic increase in the amount of

of decision support systems. Data information or data being stored in

mining tools can answer business electronic format. This accumulation

questions that traditionally were too of data has taken place at an explosive

time consuming to resolve. They scour


databases for hidden patterns, finding
predictive information that experts
may miss because it lies outside their
expectations. Data mining techniques
can be implemented rapidly on existing
software and hardware platforms to
enhance the value of existing
information resources, and can be
rate.
integrated with new products and
systems as they are brought on-line.
When implemented on high
performance client/server or parallel
processing computers, data mining
Figure 1 shows the data explosion. and
tools can analyze massive databases to
the Growing Base of Data
deliver answers to questions such as,
management, fraud detection, new
product rollout, and so on.

The term data mining has been


stretched beyond its limits to apply to
any form of data analysis. Some of the
numerous definitions of Data Mining,
Data storage became easier as the
or Knowledge Discovery in Databases
availability of large amounts of
are:
computing power at low cost ie the
cost of processing power and storage is
Data Mining, or Knowledge
falling, made data cheap.
Discovery in Databases (KDD) as it is
also known, is the nontrivial extraction
An Architecturefor Data of implicit, previously unknown, and
potentially useful information from
Mining
data. This encompasses a number of
To best apply these advanced
different technical approaches, such as
techniques, they must be fully
clustering, data summarization,
integrated with a data warehouse as
learning classification rules, finding
well as flexible interactive business
dependency net works, analyzing
analysis tools. Many data mining tools
changes, and detecting anomalies.
currently operate outside of the
warehouse, requiring extra steps for
Data mining is the search for
extracting, importing, and analyzing
relationships and global patterns that
the data. Furthermore, when new
exist in large databases but are `hidden'
insights require operational
among the vast amount of data, such as
implementation, integration with the
a relationship between patient data and
warehouse simplifies the application of
their medical diagnosis. These
results from data mining. The resulting
relationships represent valuable
analytic data warehouse can be applied
knowledge about the database and the
to improve business processes
objects in the database and, if the
throughout the organization, in areas
database is a faithful mirror, of the real
such as promotional campaign
world registered by the database
The following diagram summarizes the in data mining and knowledge
some of the stages/processes identified discovery

The phases depicted start with the raw research. The data is made useable
data and finish with the extracted and navigable.
knowledge which was acquired as a  Data mining: this stage is
result of the following stages: concerned with the extraction of
patterns from the data. A pattern
 Selection: Selecting or segmenting can be defined as given a set of
the data according to some criteria facts (data) F, a language L, and
e.g. all those people who own a some measure of certainty C a
car, in this way subsets of the data pattern is a statement S in L that
can be determined. describes relationships among a
 Preprocessing: This is the data subset Fs of F with a certainty c
cleansing stage where certain such that S is simpler in some
information is removed which is sense than the enumeration of all
deemed unnecessary and may slow the facts in Fs.
down queries for example  Applications of Data mining
unnecessary to note the sex of a Data mining has many and varied
patient when studying pregnancy. fields of application some of which are
Also the data is reconfigured to listed below.
ensure a consistent format as there 11. Retail/Marketing
is a possibility of inconsistent  Identify buying patterns from
formats because the data is drawn customers
from several sources e.g. sex may  Find associations among
recorded as f or m and also as 1 or customer demographic
0. characteristics
 Market basket analysis
 Transformation: The data is not
22. Banking
merely transferred across but
 Detect patterns of fraudulent
transformed in that overlays may
credit card use
added such as the demographic
 Identify `loyal' customers
overlays commonly used in market
 Predict customers likely to more attributes that denote the class of
change their credit card a tuple and these are known as
affiliation predicted attributes whereas the
 Determine credit card spending remaining attributes are called
by customer groups predicting attributes. A combination of
33. Insurance and Health Care: values for the predicted attributes
 Claims analysis - i.e which defines a class.
medical procedures are claimed 1
together 22. Associations:

 Predict which customers will


buy new policies Given a collection of items and

 Identify behaviour patterns of a set of records, each of which contain

risky customers some number of items from the given

14. Medicine collection, an association function is an

 Characterise patient behaviour operation against this set of records

to predict office visits which return affinities or patterns that


exist among the collection of items.
 Identify successful medical
These patterns can be expressed by
therapies for different illnesses
rules such as "72% of all the records
that contain items A, B and C also
Data Mining Functions
contain items D and E." The specific
Data mining methods may be
percentage of occurrences (in this case
classified by the function they perform
72) is called the confidence factor of
or according to the class of application
the rule. Also, in this rule, A,B and C
they can be used in. Some of the main
are said to be on an opposite side of the
techniques used in data mining are…
rule to D and E. Associations can
involve any number of items on either
11. Classification
side of the rule.

Data mine tools have to infer a


Comprehensive data
model from the database, and in the
warehouses that integrate operational
case of supervised learning this
data with customer, supplier, and
requires the user to define one or more
market information have resulted in an
classes. The database contains one or
explosion of information. Competition
requires timely and sophisticated strategy can only be defeated. So it is
analysis on an integrated view of the said that victorious warriors win first
data. However, there is a growing gap and then go to war, while defeated
between more powerful storage and warriors go to war first and then seek
retrieval systems and the users’ ability to win. It is obvious to anyone that
to effectively analyze and act on the culls through the voluminous
information they contain. Both information technology (I/T) literature,
relational and OLAP technologies have attends industry seminars, user group
tremendous capabilities for navigating meetings or expositions, reads the ever
massive data warehouses, but brute accelerating new product
force navigation of data is not enough. announcements of I/T vendors, or
A new technological leap is needed to listens to the advice of industry gurus
structure and prioritize information for and analysts, that there are four
specific end-user problems. The data subjects that overwhelmingly dominate
mining tools can make this leap. I/T industry attention as we move into
Quantifiable business benefits have the late 1990s:
been proven through the integration of
data mining with current information Why we need Data
systems, and new products are on the Warehousing
horizon that will bring this integration
Data mining potential can be
to an even wider audience of users.
enhanced if the appropriate data has
been collected and stored in a data
Data Warehousing warehouse. A data warehouse is a
Introduction relational database management

When your strategy is deep and system (RDMS) designed specifically

far reaching, then what you gain by to meet the needs of transaction

your calculations is much, so you can processing systems. It can be loosely

win before you even fight. When your defined as any centralized data

strategic thinking is shallow and near- repository which can be queried for

sighted, then what you gain by your business benefit but this will be more

calculations is little, so you lose before clearly defined later.

you do battle. Much strategy prevails


over little strategy, so those with no
Data warehousing is a new instead of application e.g. an
powerful technique making it possible insurance company using a data
to extract archived operational data and warehouse would organize their
overcome inconsistencies between data by customer, premium,
different legacy data formats. As well and claim, instead of by
as integrating data throughout an different products (auto, life,
enterprise, regardless of location, etc.). The data organized by
format, or communication subject contain only the
requirements it is possible to information necessary for
incorporate additional or expert decision support processing.
information. It is, the logical link  Integrated: When data resides
between what the managers see in their in many separate applications
decision support EIS applications and in the operational environment,
the company's operational activities encoding of data is often
inconsistent. For instance, in
In other words the data one application, gender might
warehouse provides data that is already be coded as "m" and "f" in
transformed and summarized, therefore another by 0 and 1. When data
making it an appropriate environment are moved from the operational
for more efficient DSS and EIS environment into the data
applications. warehouse, they assume a
consistent coding convention
Characteristics of A Data e.g. gender data is transformed
Warehouse to "m" and "f".
According to Bill Inmon,  Time-Variant: The data
author of Building the Data Warehouse warehouse contains a place for
and the guru who is widely considered storing data that are five to 10
to be the originator of the data years old, or older, to be used
warehousing concept, there are for comparisons, trends, and
generally four characteristics that forecasting. These data are not
describe a data warehouse: updated.
 Non-Volatile: Data are not
 Subject-Oriented: Data are updated or changed in any way
organized according to subject
once they enter the data execute these functions. The
warehouse, but are only loaded information that describes the model
and accessed. and definition of the source data
elements is called "metadata". The
Processes In Data Warehousing metadata is the means by which the

The first phase in data end-user finds and understands the data

warehousing is to "insulate" your in the warehouse and is an important

current operational information, i.e. to part of the warehouse. The metadata

preserve the security and integrity of should at the very least contain;The

mission-critical OLTP applications, structure of the data

while giving you access to the broadest  The algorithm used for
possible base of data. The resulting summarization;
database or data warehouse may  The mapping from the
consume hundreds of gigabytes - or operational environment to the
even terabytes - of disk space, what is data warehouse.
required then are efficient techniques
for storing and retrieving massive Data cleansing is an important
amounts of information. Increasingly, aspect of creating an efficient data
large organizations have found that warehouse in that it is the removal of
only parallel processing systems offer certain aspects of operational data,
sufficient bandwidth. such as low-level transaction
information, which slow down the
The data warehouse thus query times. The cleansing stage has to
retrieves data from a variety of be as dynamic as possible to
heterogeneous operational databases. accommodate all types of queries even
The data is then transformed and those which may require low-level
delivered to the data warehouse/store information. Data should be extracted
based on a selected model (or mapping from production sources at regular
definition). The data transformation intervals and differences between
and movement processes are executed various styles of data collection.
whenever an update to the warehouse Pooled centrally but the cleansing
data is required so there should some process has to remove duplication and
form of automation to manage and reconcile
measured in hundreds of
The current detail data is central in millions of rows and gigabytes
importance as it: per hour and must not
 Reflects the most recent artificially constrain the volume
happenings, which are usually of data required by the
the most interesting; business.
 It is voluminous as it is stored  Load Processing: Many steps
at the lowest level of must be taken to load new or
granularity; updated data into the data
 It is always (almost) stored on warehouse including data
disk storage which is fast to conversions, filtering,
access but expensive and reformatting, integrity checks,
complex to manage physical storage, indexing, and
 Uses of Data Warehousing metadata update. These steps

 Retail: Analysis of scanner must be executed as a single,

check-out data Tracking, seamless unit of work.

analysis, and tuning of sales  Data Quality Management:


promotions and so on… The shift to fact-based
 Telecommunications Analysis management demands the
of: call volumes, equipment, highest data quality. The
sales, customer, and warehouse must ensure local
profitability costs Inventory, consistency, global
consistency, and referential
Criteria for a Data Warehouse integrity despite "dirty" sources
The criteria for data warehouse and massive database size.
RDBMS are as follows: While loading and preparation
are necessary steps, they are

 Load Performance: Data not sufficient. Query

warehouses require incremental throughput is the measure of

loading of new data on a success for a data warehouse

periodic basis within narrow application. As more questions

time windows; performance of are answered, analysts are

the load process should be


catalysed to ask more creative advantage. The business need to build,
and insightful questions. compound, and sustain advantage is
the most fundamental and dominant
 Query Performance - Fact-
business need and it is insatiable.
based management and ad-hoc
Advantage is built through deep and
analysis must not be slowed or
far-reaching strategic thinking. The
inhibited by the performance of
strategic ideas that support data
the data warehouse RDBMS;
warehousing as a strategic initiative are
large, complex queries for key
learning, maneuverability, prescience,
business operations must
and foreknowledge. Data warehousing
complete in seconds not days
meets the fundamental business needs
to compete in a superior manner across
the elementary strategic dimension of
CONCLUSION time. Data warehousing is a rare
instance of a rising tide strategy. A
Our strategic analysis of data rising tide strategy occurs when an
warehousing is as follows: Strategy is action yields tremendous
about, and only about, building
Leverage. Data warehousing raises the NY: John Wiley & Sons, Inc.,
ability of all employees to serve their 1998), Pp. 87-100
customers and out-think their
43. Len Silverston, W. H. Inmon, and
competitors.
Kent Graziano, The Data Model
Resource Book (New York, NY:
REFERENCES
John Wiley & Sons, Inc., 1997)
1
21. Ralph Kimball, The Data 54. Douglas Hackney, Understanding
Warehouse Toolkit (New York, and Implementing Successful Data
NY: John Wiley & Sons, Inc., Marts (Reading, MA: Addison-
1996), Pp. 15-16 Wesley, 1997), Pp. 52-54, 183-84,
257, 307-309
32. W. H. Inmon, Claudia Imhoff, and
5. White Paper, available at
Ryan Sousa, Corporate
http://www.informatica.com.
InformatioFactory (New York,
6. Hackney, op. cit.
67. Informatica, op. cit.
7/

You might also like