0% found this document useful (0 votes)
39 views

DM Intro - 1

Data mining is used to extract useful knowledge from large amounts of data. As data is increasingly collected electronically, there is too much data for humans to analyze alone. Data mining uses algorithms to automatically find patterns and relationships in raw data. The results can be used for decision making. Data mining is now widely used because computing power and data storage are cheaper, commercial data mining tools exist, and organizations collect massive amounts of data. Data warehouses store historical data from across an organization to support data mining and business intelligence applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

DM Intro - 1

Data mining is used to extract useful knowledge from large amounts of data. As data is increasingly collected electronically, there is too much data for humans to analyze alone. Data mining uses algorithms to automatically find patterns and relationships in raw data. The results can be used for decision making. Data mining is now widely used because computing power and data storage are cheaper, commercial data mining tools exist, and organizations collect massive amounts of data. Data warehouses store historical data from across an organization to support data mining and business intelligence applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to Data Mining

(DM)
Data/Information Overload
• Data is being produced (generated & collected) at
alarming rate because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned
texts & image platforms to satellite remote sensing
systems
– Popular use of WWW as a global information system
• With the phenomenal rate of growth of data, users
expect more sophisticated useful and valuable
information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants
detailed information about customers past purchasing
behavior and prediction of future purchases
Too much data & too little knowledge
• There is a need to extract knowledge (useful information)
from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
– Data mining can automate the process of finding patterns &
relationships in raw data and the results can be utilized for
decision support. That is why data mining is used, especially in
science and business areas.

• If we know how to reveal valuable knowledge hidden in


raw data, data might be one of our most valuable assets.
– Data mining is the tool that involves retrospective analysis to
extract diamonds of knowledge from historical data & predict
outcome of the future.
What is data mining?
• Data Mining is a technology that uses various
techniques to discover hidden knowledge from
heterogeneous and distributed historical data
stored in large databases, warehouses and other
massive information repositories so to find patterns
in data that are:
– valid: not only represent current state, but also hold
on new data with some certainty
– novel: non-obvious to the system that are generated
as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to interpret
the pattern
Why DM Now?
• Four main reasons why DM now?
– The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
• How to manage the high turnover rate of
professionals?

– Massive data collection: produced at alarming rate &


is being warehoused
– The computing power is available and is affordable
– DM commercial products and machine learning
algorithms are available
Why use DM Now: Massive data collection
• Massive data collection: large databases
(data warehouses) are growing at
unprecedented rates to manage the explosive
growth in stored data.
• Examples of massive data sets
– Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
– MEDLINE text database: 17 million published
articles
– Retail transaction data: EBay, Amazon, Wal-Mart:
order of 100 million transactions per day
• Visa, MasterCard: similar or larger numbers
Why use DM Now: Powerful computers
• Powerful computers: The computing power is available and
is also affordable
–The need for improved computational engines can now be
met in a cost-effective manner with parallel multiprocessor
computer technology.
• Technological Driving Factors
– Larger, cheaper memory (in hundred GBs, not in MBs)
• Moore’s law for magnetic disk density
“capacity doubles every 18 months”
• Storage cost per byte falling rapidly
– Faster, cheaper processors (in GHz, not in MHz)
• the CRAY of 15 years ago is now on your desk
– Success of Relational Databases and the World Wide Web
• everybody is a “data owner”
Why DM Now: DM algorithms
• Commercial products (for data mining) are
available
– Data mining algorithms have been matured & there are
reliable tools that consistently outperform older statistical
methods.
– New ideas in machine learning/statistics
• Boosting, SVMs, decision trees, non-parametric Bayes,
text models, etc
– Existence of around 20-30 mining tool vendors
– Existence of many embedded products
• Fraud detection
• Customer relationship management
• Health care
• E-commerce applications
Example: Why Data Mining
• Customer relationship management:
– Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor?

• Fraud detection/Network intrusion detection


– Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?

Data Mining helps extract such useful information


Database Processing vs. Data Mining Processing
Database Data mining Comments
Query Well defined • Poorly defined The data miner might
Structured • No precise not know what he
Query query language exactly wants to see
Language
Data Operational Non-Operational The data have been
data data cleansed and modified
to better support the
mining process

Output Precise and Not a subset of The output is some


Subset of database hidden useful patterns
database & knowledge in the
database
Query Examples
• Database
– Find all credit applicants with first name ‘Alex’.
– Identify customers who have purchased more than Birr
10,000 in the last month.
– Find all customers who have purchased Bread

• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with Bread.
(association rules)
Data Mining works with Data Warehouse

• Data Warehouse provides the


Enterprise with a memory

• Data Mining provides the


Enterprise with intelligence
Data Warehouse as part of Data Mining
Data Mining & Business Intelligence
Data Warehouse
• Data warehouse
– A data warehouse is a relational database management system
responsible for the collection and storage of data to support
management decision making and problem solving.
– It enables managers and other business professionals for data
mining, online analytical processing, market research and
decision support
– Current evolution of Decision Support Systems (DSSs)

• Data mart
– A subset of a data warehouse for small and medium-size
businesses or departments within larger companies
Data Warehouse Stores Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data warehousing
• Data warehouse is an integrated, subject-oriented,
time-variant, non-volatile database that provides
support for decision making.
• Integrated → centralized, consolidated database that
integrates data derived from the entire organization.
• Consolidates data from multiple & diverse sources with
diverse formats.
• Helps managers to better understand the company’s
operations.

• Subject-Oriented → Data warehouse contains data


organized by topics.
• E.g. Sales, marketing, finance, etc.
Data warehousing
• Time variant → In contrast to the operational
database that focus on current transactions, the
data warehouse represent the flow of data through
time.
– Data warehouse contains data that reflect what happened
last week, last month, past five years, and so on.

✓ Non volatile → Once data enter the data


warehouse, they are never removed. Because the
data in the warehouse represent the company’s
entire history.
✓ Because data is added all the time, warehouse is
growing.
Database & data warehouse: Differences
• The data warehouse and operational environments
are separated. Data warehouse receives its data from
operational databases.
–Data warehouse environment is characterized by read-only
transactions to very large data sets.
–Operational environment is characterized by numerous
update transactions to a few data entities at a time.
–Data warehouse contains historical data over a long time
horizon.
• Ultimately Information is created from data warehouses. Such
Information becomes the basis for rational decision making.
• The data found in data warehouse is analyzed to discover
previously unknown data characteristics, relationships,
dependencies, or trends.
Data Processing Technologies
• OLAP – Online Analytical Processing
– refers to an advanced data analysis environment that
supports decision making.
– Access to multidimensional databases providing
managerially useful display techniques
• Data mining tools analyze the data, uncover problems or
opportunities hidden in the data relationships.
• E.g.: Credit system : who are likely not to pay their
debts?
– Crime Database : Who are likely to commit what kind
of crime?
• OLAP provides top-down, query-driven analysis
– Data mining provides bottom-up, discovery-driven analysis
Data Mining vs. Knowledge Discovery in
Databases
• KDD is often used as a synonym for Data Mining.
– Some author define KDD as the whole process involving:
data selection ➔ data pre-processing: cleaning ➔ data
transformation ➔ mining ➔ result evaluation ➔
visualization
– Data Mining, on the other hand, refer to the modeling
step using the various techniques to extract useful
information/pattern from the data.
• KDD is the process of finding useful information and
patterns in data
• DM is the use of algorithms to extract hidden
patterns & knowledge in data
Stages in data mining: The KDD process
CRoss Industry Standard Process for
Data Mining (CRISP-DM)
Origins of Data Mining
pre 1960 1960’s 1970’s 1980’s 1990’s

Hardware
(sensors, storage, computation)

Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning

“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis

Visualization (viz) Data Databases (DB)


Mining

Human Computer
High-Performance
Interaction (HCI) Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of data
mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective a
measure such as ROI is used
– ROI compares costs of DM techniques against savings or
benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate recall,
precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation issues
• Scalability
–Applicability of data mining techniques to perform well with
massive real world data sets
–Techniques should also work regardless of the amount of
available main memory
• Real World Data
–Real world data are noisy and have many missing attribute
values. Algorithms should be able to work even in the
presence of these problems
• Updates
–Database can not be assumed to be static. The data is
frequently changing.
–However, many data mining algorithms work with static data
sets. This requires that the algorithm be completely rerun any
time the database changes.
Data Mining implementation issues
• High dimensionality:
–A conventional database schema may be composed of many
different attributes. The problem here is that all attributes may
not be needed to solve a given DM problem.
–The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
–The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a
tough task!
• Overfitting
–The size and representativeness of the dataset determines
whether the model associated with a given database states fits to
also future database states.
–Overfitting occurs when the model does not fit to the future states
which is caused by the use of small size and unbalanced training
database.
Data Mining implementation issues
• Ease of Use of the DM tool
–Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical
experts
–Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand

• Application
– Determining the intended use for the information obtained from
the DM tool is a challenge.
– Indeed, how business executives can effectively use the output is
sometimes considered the most difficult part. Because the results
are of a type that have not previously been known.
– Business practices may have to be modified to determine how to
effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms & architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database

• Data miner that handle large, heterogeneous data


(including multimedia data, spatial data, …)
• Presentation of DM results
– To easily view and understand the output of the DM
algorithms there is a need to use knowledge representation
(decision tree, rules, equations, semantic networks) and
visualization techniques (such as graphs, bar charts, etc.).
• Integration of DM functions into traditional DBMS in
order to design an intelligent database
Assignment
Review different literatures (books and articles) & write a report
(overview, significance, steps involved, applications, review of
2+ related local and international research works and
concluding remarks) and present in the class.
1. Data Warehouses
2. Data Mining & Knowledge discovery in databases
3. Exploratory Data Analysis
4. Predictive Modeling
5. Descriptive Modeling
6. Data Mining Models (like CRISP, SEMA & other three models)

You might also like