1 DM Intro
1 DM Intro
• Datasets
– UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
8
Discussion
Review 5+ literatures (books and articles) & write a report
(overview, significance, steps involved, applications, review of 2+
related local and international research works and concluding
remarks) and present in class within 10 minutes. Send for the
class PPT, Doc & PDF, and one or two major
sources; cc at: [email protected]
1. Data Warehouses, Data Mining and Business Intelligence
2. Predictive Modeling
3. Descriptive Modeling
4. Data Mining Models (like CRISP, Hybrid, & other models)
5. Text Mining
6. Web Mining
7. Sentiment/opinion mining
8. Knowledge Mining
9. Multimedia Data Mining
What is data mining?
• Data Mining is a technology that uses various
techniques to discover hidden knowledge from
heterogeneous and distributed historical data
stored in large databases, warehouses and other
massive information repositories so to find patterns
in data that are:
– valid: not only represent current state, but also hold
on new data with some certainty
– novel: non-obvious to the system that are generated
as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to interpret
the pattern
Why DM Now?
• Four main reasons why DM now?
– The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
• How to manage the high turnover rate of
professionals?
• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with Bread.
(association rules)
Data Mining works with Data Warehouse
Hardware
(sensors, storage, computation)
Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning
“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis
Human Computer
High-Performance
Interaction (HCI)
Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of data
mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective a
measure such as ROI is used
– ROI compares costs of DM techniques against savings or
benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate recall,
precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation issues
• Scalability
– Applicability of data mining techniques to perform well with
massive real world data sets
– Techniques should also work regardless of the amount of
available main memory
• Real World Data
– Real world data are noisy and have many missing attribute
values. Algorithms should be able to work even in the
presence of these problems
• Updates
– Database can not be assumed to be static. The data is
frequently changing.
– However, many data mining algorithms work with static data
sets. This requires that the algorithm be completely rerun any
time the database changes.
Data Mining implementation issues
• High dimensionality:
– A conventional database schema may be composed of many
different attributes. The problem here is that all attributes may not
be needed to solve a given DM problem.
– The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
– The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a
tough task!
• Overfitting
– The size and representativeness of the dataset determines
whether the model associated with a given database states fits to
also future database states.
– Overfitting occurs when the model does not fit to the future states
which is caused by the use of small size and unbalanced training
database.
Data Mining implementation issues
• Ease of Use of the DM tool
– Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical
experts
– Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand
• Application
– Determining the intended use for the information obtained from
the DM tool is a challenge.
– Indeed, how business executives can effectively use the output is
sometimes considered the most difficult part. Because the results
are of a type that have not previously been known.
– Business practices may have to be modified to determine how to
effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms & architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database
• Data miner that handle large, heterogeneous data
(including multimedia data, spatial data, …)
• Presentation of DM results
– To easily view and understand the output of the DM
algorithms there is a need to use knowledge
representation (decision tree, rules, equations, semantic
networks) and visualization techniques (such as graphs,
bar charts, etc.).
• Integration of DM functions into traditional DBMS in
order to design an intelligent database