1 DM intro
1 DM intro
DATA
Dispersed Explicit
elements
Depth of meaning
INFORMATION
Patterned data
KNOWLEDGE
Validated platform for action
WISDOM
Implicitly knowing how to generate,
access and integrate knowledge Tacit
TRUTH
Data/Information Overload
• Data is being produced (generated & collected) at
alarming rate because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned
texts & image platforms to satellite remote sensing
systems
– Popular use of WWW as a global information system
• With the phenomenal rate of growth of data, users
expect more sophisticated useful and valuable
information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants
detailed information about customers past
purchasing behavior and prediction of future
purchases
Too much data & too little
knowledge
• There is a need to extract knowledge (useful
information) from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
– Data mining can automate the process of finding patterns
& relationships in raw data and the results can be utilized
for decision support. That is why data mining is used,
especially in science and business areas.
• If we know how to reveal valuable knowledge hidden
in raw data, data might be one of our most valuable
assets.
– Data mining is the tool that involves retrospective analysis
to extract diamonds of knowledge from historical data &
predict outcome of the future.
What is data mining?
• Data Mining is a technology that uses various
techniques to discover hidden knowledge
from heterogeneous and distributed
historical data stored in large databases,
warehouses and other massive information
repositories so to find patterns in data that are:
– valid: not only represent current state, but also
hold on new data with some certainty
– novel: non-obvious to the system that are
generated as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to
interpret the pattern
Why DM Now?
• Four main reasons why DM now?
– The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
• How to manage the high turnover rate of
professionals?
• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
Bread.
Bread (association rules)
Data Mining works with Data
Warehouse
• Data Warehouse provides the
Enterprise with a memory
Hardware
(sensors, storage, computation)
Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning
“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis
Human Computer
High-Performance
Interaction (HCI)
Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of
data mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective
a measure such as ROI is used
– ROI compares costs of DM techniques against
savings or benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate
recall, precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation
• Scalability
issues
– Applicability of data mining techniques to perform
well with massive real world data sets
– Techniques should also work regardless of the
amount of available main memory
• Real World Data
– Real world data are noisy and have many missing
attribute values. Algorithms should be able to work
even in the presence of these problems
• Updates
– Database can not be assumed to be static. The data
is frequently changing.
– However, many data mining algorithms work with
static data sets. This requires that the algorithm be
completely rerun any time the database changes.
Data Mining implementation
issues
• High dimensionality:
– A conventional database schema may be composed of
many different attributes. The problem here is that all
attributes may not be needed to solve a given DM problem.
– The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
– The solution is dimensionality reduction (reduce the
number of attributes). But, determining which attributes
are not needed is a tough task!
• Overfitting
– The size and representativeness of the dataset determines
whether the model associated with a given database states
fits to also future database states.
– Overfitting occurs when the model does not fit to the future
states which is caused by the use of small size and
unbalanced training database.
Data Mining implementation
issues
• Ease of Use of the DM tool
– Since data mining problems are often not precisely
stated, interfaces may be needed with both domain
and technical experts
– Although some techniques may work well, they may
not be accepted by users if they are difficult to use or
understand
• Application
– Determining the intended use for the information obtained
from the DM tool is a challenge.
– Indeed, how business executives can effectively use the
output is sometimes considered the most difficult part.
Because the results are of a type that have not previously
been known.
– Business practices may have to be modified to determine
how to effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms &
architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database