DM Intro - 1
DM Intro - 1
(DM)
Data/Information Overload
• Data is being produced (generated & collected) at
alarming rate because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned
texts & image platforms to satellite remote sensing
systems
– Popular use of WWW as a global information system
• With the phenomenal rate of growth of data, users
expect more sophisticated useful and valuable
information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants
detailed information about customers past purchasing
behavior and prediction of future purchases
Too much data & too little knowledge
• There is a need to extract knowledge (useful information)
from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
– Data mining can automate the process of finding patterns &
relationships in raw data and the results can be utilized for
decision support. That is why data mining is used, especially in
science and business areas.
• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with Bread.
(association rules)
Data Mining works with Data Warehouse
• Data mart
– A subset of a data warehouse for small and medium-size
businesses or departments within larger companies
Data Warehouse Stores Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data warehousing
• Data warehouse is an integrated, subject-oriented,
time-variant, non-volatile database that provides
support for decision making.
• Integrated → centralized, consolidated database that
integrates data derived from the entire organization.
• Consolidates data from multiple & diverse sources with
diverse formats.
• Helps managers to better understand the company’s
operations.
Hardware
(sensors, storage, computation)
Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning
“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis
Human Computer
High-Performance
Interaction (HCI) Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of data
mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective a
measure such as ROI is used
– ROI compares costs of DM techniques against savings or
benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate recall,
precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation issues
• Scalability
–Applicability of data mining techniques to perform well with
massive real world data sets
–Techniques should also work regardless of the amount of
available main memory
• Real World Data
–Real world data are noisy and have many missing attribute
values. Algorithms should be able to work even in the
presence of these problems
• Updates
–Database can not be assumed to be static. The data is
frequently changing.
–However, many data mining algorithms work with static data
sets. This requires that the algorithm be completely rerun any
time the database changes.
Data Mining implementation issues
• High dimensionality:
–A conventional database schema may be composed of many
different attributes. The problem here is that all attributes may
not be needed to solve a given DM problem.
–The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
–The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a
tough task!
• Overfitting
–The size and representativeness of the dataset determines
whether the model associated with a given database states fits to
also future database states.
–Overfitting occurs when the model does not fit to the future states
which is caused by the use of small size and unbalanced training
database.
Data Mining implementation issues
• Ease of Use of the DM tool
–Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical
experts
–Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand
• Application
– Determining the intended use for the information obtained from
the DM tool is a challenge.
– Indeed, how business executives can effectively use the output is
sometimes considered the most difficult part. Because the results
are of a type that have not previously been known.
– Business practices may have to be modified to determine how to
effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms & architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database