0% found this document useful (0 votes)
2 views

1 DM intro

The document discusses the importance of data warehousing and data mining in today's dynamic business environment, highlighting the need for organizations to extract valuable knowledge from vast amounts of data. It explains the concepts of data, information, knowledge, and wisdom, and emphasizes the role of data mining in automating the discovery of patterns and relationships in data for decision support. Additionally, it outlines the differences between data processing and data mining, the significance of data warehouses, and the challenges faced in data mining implementation.

Uploaded by

getanehcourse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1 DM intro

The document discusses the importance of data warehousing and data mining in today's dynamic business environment, highlighting the need for organizations to extract valuable knowledge from vast amounts of data. It explains the concepts of data, information, knowledge, and wisdom, and emphasizes the role of data mining in automating the discovery of patterns and relationships in data for decision support. Additionally, it outlines the differences between data processing and data mining, the significance of data warehouses, and the challenges faced in data mining implementation.

Uploaded by

getanehcourse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

DATA WAREHOUSING

AND DATA MINING


Why the focus shifts to
“Knowledge”
• We are living in dynamic/complex environment;
an environment which is characterized by:
– Competitors
• very strong competition
– Market
• volatility of the market
• The business landscape is changing rapidly and non-linearly
– Customers/Consumers
• customers reaches to the level of prosumers
• Prosumer are more educated consumer, who provide feedback
regarding products/services they need
– Professionals
• The high turnover rate of professionals
• Diminishing Individual Experience 2
Data, Information, Knowledge, Wisdom & Truth
FACT

DATA
Dispersed Explicit
elements
Depth of meaning

INFORMATION
Patterned data

KNOWLEDGE
Validated platform for action

WISDOM
Implicitly knowing how to generate,
access and integrate knowledge Tacit

TRUTH
Data/Information Overload
• Data is being produced (generated & collected) at
alarming rate because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned
texts & image platforms to satellite remote sensing
systems
– Popular use of WWW as a global information system
• With the phenomenal rate of growth of data, users
expect more sophisticated useful and valuable
information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants
detailed information about customers past
purchasing behavior and prediction of future
purchases
Too much data & too little
knowledge
• There is a need to extract knowledge (useful
information) from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
– Data mining can automate the process of finding patterns
& relationships in raw data and the results can be utilized
for decision support. That is why data mining is used,
especially in science and business areas.
• If we know how to reveal valuable knowledge hidden
in raw data, data might be one of our most valuable
assets.
– Data mining is the tool that involves retrospective analysis
to extract diamonds of knowledge from historical data &
predict outcome of the future.
What is data mining?
• Data Mining is a technology that uses various
techniques to discover hidden knowledge
from heterogeneous and distributed
historical data stored in large databases,
warehouses and other massive information
repositories so to find patterns in data that are:
– valid: not only represent current state, but also
hold on new data with some certainty
– novel: non-obvious to the system that are
generated as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to
interpret the pattern
Why DM Now?
• Four main reasons why DM now?
– The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
• How to manage the high turnover rate of
professionals?

– Massive data collection: produced at alarming


rate & is being warehoused
– The computing power is available and is
affordable
– DM commercial products and machine learning
algorithms are available
Why DM Now: Massive data
collection
• Massive data collection: large
databases (data warehouses) are growing
at unprecedented rates to manage the
explosive growth in stored data.
• Examples of massive data sets
– Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
– MEDLINE text database: 17 million published
articles
– Retail transaction data: EBay, Amazon, Wal-Mart:
order of 100 million transactions per day
• Visa, MasterCard: similar or larger numbers
Why DM Now: Powerful computers
• Powerful computers: The computing power is available
and is also affordable
– The need for improved computational engines can now
be met in a cost-effective manner with parallel
multiprocessor computer technology.
• Technological Driving Factors
– Larger, cheaper memory (in hundred GBs, not in MBs)
• Moore’s law for magnetic disk density
“capacity doubles every 18 months”
• Storage cost per byte falling rapidly
– Faster, cheaper processors (in GHz, not in MHz)
• the CRAY of 15 years ago is now on your desk
– Success of Relational Databases and the World Wide
Web
• everybody is a “data owner”
Why DM Now: DM
algorithms
• Commercial products (for data mining) are
available
– Data mining algorithms have been matured &
there are reliable tools that consistently
outperform older statistical methods.
– New ideas in machine learning/statistics
• Boosting, SVMs, decision trees, non-parametric
Bayes, text models, etc
– Existence of around 20-30 mining tool vendors
– Existence of many embedded products
• Fraud detection
• Customer relationship management
• Health care
• E-commerce applications
Example: Why Data Mining
• Customer relationship management:
– Which of my customers are likely to be the most
loyal, and which are most likely to leave for a
competitor?
• Credit ratings:
– Given a database of 100,000 names, which persons
are the least likely to default on their credit cards?
• Targeted marketing:
– Identify likely responders to sales promotions
• Fraud detection/Network intrusion detection
– Which types of transactions are likely to be
fraudulent, given the demographics and
transactional history of a particular customer?
Data Mining helps extract such useful
information
Database Processing vs. Data Mining
Processing
Database Data mining Comments
Query Well • Poorly The data miner
defined defined might not know
Structured • No precise what he exactly
Query query wants to see
Language
language
Data Operationa Non- The data have been
l data Operational cleansed and
data modified to better
support the mining
process
Outpu Precise Not a subset of The output is some
t and Subset database hidden useful
of patterns &
database knowledge in the
Query Examples
• Database
– Find all credit applicants with first name ‘Alex’.
– Identify customers who have purchased more than
Birr 10,000 in the last month.
– Find all customers who have purchased Bread

• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
Bread.
Bread (association rules)
Data Mining works with Data
Warehouse
• Data Warehouse provides the
Enterprise with a memory

• Data Mining provides the


Enterprise with
intelligence
Data Warehouse
• Data warehouse
– A data warehouse is a relational database management
system responsible for the collection and storage of
data to support management decision making and
problem solving.
– It enables managers and other business professionals to
undertake data mining, online analytical processing,
market research and decision support
– Current evolution of Decision Support Systems (DSSs)
• Data mart
– A subset of a data warehouse for small and medium-
size businesses or departments within larger companies
Data Warehouse Stores
Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data Warehouse as part of Data
Mining
Data warehousing
• Data warehouse is an integrated, subject-
oriented, time-variant, non-volatile database
that provides support for decision making.
• Integrated  centralized, consolidated database
that integrates data derived from the entire
organization.
• Consolidates data from multiple & diverse sources
with diverse formats.
• Helps managers to better understand the company’s
operations.

• Subject-Oriented  Data warehouse contains


data organized by topics.
• E.g. Sales, marketing, finance, etc.
Data warehousing
• Time variant  In contrast to the operational
database that focus on current transactions,
the data warehouse represent the flow of data
through time.
– Data warehouse contains data that reflect what
happened last week, last month, past five years,
and so on.

Non volatile  Once data enter the data


warehouse, they are never removed. Because
the data in the warehouse represent the
company’s entire history.
Because data is added all the time,
warehouse is growing.
Database & data warehouse:
Differences
• The data warehouse and operational
environments are separated. Data warehouse
receives its data from operational databases.
– Data warehouse environment is characterized by
read-only transactions to very large data sets.
– Operational environment is characterized by
numerous update transactions to a few data entities
at a time.
– Data warehouse contains historical data over a long
time horizon.
• Ultimately Information is created from data
warehouses. Such Information becomes the basis for
rational decision making.
• The data found in data warehouse is analyzed to
discover previously unknown data characteristics,
relationships, dependencies, or trends.
Data Processing Technologies
• OLAP – Online Analytical Processing
– refers to an advanced data analysis environment that
supports decision making.
– Access to multidimensional databases providing
managerially useful display techniques
• Data mining tools analyze the data, uncover
problems or opportunities hidden in the data
relationships.
• E.g.: Credit system : who are likely not to pay
their debts?
– Crime Database : Who are likely to commit what
kind of crime?
• OLAP provides top-down, query-driven analysis
– Data mining provides bottom-up, discovery-driven
analysis
Business Intelligence
• BI takes advantage of data mining and data
warehousing to help organizations gather
their information in a timelier and in a more
valuable manner

• BI keeps the organization:


– informed about the market trends,
– alerts to new market potentials,
– helps to determine how competitors are doing

• Without such information and knowledge


the organization may suffer false growth or
setbacks
Data Mining & Business
Intelligence
Data Mining vs. Knowledge Discovery in
Databases
• KDD is often used as a synonym for Data
Mining.
– Some author define KDD as the whole process
involving: data selection  data pre-processing:
cleaning  data transformation  mining  result
evaluation  visualization
– Data Mining, on the other hand, refer to the
modeling step using the various techniques to
extract useful information/pattern from the data.
• KDD is the process of finding useful
information and patterns in data
• DM is the use of algorithms to extract hidden
patterns & knowledge in data
Stages in data mining: The KDD
process
CRoss Industry Standard Process
for Data Mining (CRISP-DM)
Hybrid Knowledge Discovery
Process
Origins of Data Mining
pre 1960 1960’s 1970’s 1980’s 1990’s

Hardware
(sensors, storage, computation)

Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning

“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis

Visualization (viz) Data Databases (DB)


Mining

Human Computer
High-Performance
Interaction (HCI)
Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of
data mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective
a measure such as ROI is used
– ROI compares costs of DM techniques against
savings or benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate
recall, precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation
• Scalability
issues
– Applicability of data mining techniques to perform
well with massive real world data sets
– Techniques should also work regardless of the
amount of available main memory
• Real World Data
– Real world data are noisy and have many missing
attribute values. Algorithms should be able to work
even in the presence of these problems
• Updates
– Database can not be assumed to be static. The data
is frequently changing.
– However, many data mining algorithms work with
static data sets. This requires that the algorithm be
completely rerun any time the database changes.
Data Mining implementation
issues
• High dimensionality:
– A conventional database schema may be composed of
many different attributes. The problem here is that all
attributes may not be needed to solve a given DM problem.
– The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
– The solution is dimensionality reduction (reduce the
number of attributes). But, determining which attributes
are not needed is a tough task!
• Overfitting
– The size and representativeness of the dataset determines
whether the model associated with a given database states
fits to also future database states.
– Overfitting occurs when the model does not fit to the future
states which is caused by the use of small size and
unbalanced training database.
Data Mining implementation
issues
• Ease of Use of the DM tool
– Since data mining problems are often not precisely
stated, interfaces may be needed with both domain
and technical experts
– Although some techniques may work well, they may
not be accepted by users if they are difficult to use or
understand

• Application
– Determining the intended use for the information obtained
from the DM tool is a challenge.
– Indeed, how business executives can effectively use the
output is sometimes considered the most difficult part.
Because the results are of a type that have not previously
been known.
– Business practices may have to be modified to determine
how to effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms &
architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database

• Data miner that handle large, heterogeneous data


(including multimedia data, spatial data, …)
• Presentation of DM results
– To easily view and understand the output of the DM
algorithms there is a need to use knowledge
representation (decision tree, rules, equations, semantic
networks) and visualization techniques (such as graphs,
bar charts, etc.).
• Integration of DM functions into traditional DBMS in
order to design an intelligent database

You might also like