0% found this document useful (0 votes)
31 views

1 DM Intro

jhjghjgh

Uploaded by

fikru
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

1 DM Intro

jhjghjgh

Uploaded by

fikru
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Business Intelligence

and Data Mining


Why the focus shifts to “Knowledge”
• We are living in dynamic/complex environment,
which is characterized by:
– Competitors: very strong competition
– Market: the market is volatile
• The business landscape is changing rapidly and non-
linearly
– Customers: customers reaches to the level of
prosumers
• Prosumer are more educated consumer, who provide
feedback regarding products/services they need
– Professional: high turnover rate of professionals
• Diminishing Individual Experience
2
Data, Information, Knowledge, Wisdom & Truth
• What is Data and Information? Are they different from
Knowledge? Wisdom? Truth?
• fact != data != information != knowledge != wisdom != truth
• Data: Unorganized and unprocessed
facts; static; a set of discrete facts about
events
–No meaning attached to it as a result of
which it may have multiple meaning
–Example: what does “Alex” mean?

• Information: Aggregation of data that


makes decision making easier.
– Meaning is attached and contextualized
– Answers questions: what, who, when, where
Data, Information, Knowledge, Wisdom & Truth
• What is Data and Information? Are they different from
Knowledge? Wisdom? Truth?
• fact != data != information != knowledge != wisdom != truth
• Knowledge: includes facts about the real
world entities and the relationship
between them. It is an Understanding
gained through experience
– Answer ‘how’ question
• Wisdom: embodies principles, insight and
moral by integrating knowledge
– Answer ‘why’ question
• Truth: making the mind think and belief
in doing what is true for all not for narrow
Data/Information Overload
• Data is being produced (generated & collected) at
alarming rate because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned
texts & image platforms to satellite remote sensing
systems
– Popular use of WWW as a global information system
• With the phenomenal rate of growth of data, users
expect more sophisticated useful and valuable
information
– A marketing manager is no longer satisfied with a
simple listing of marketing contacts, but wants
detailed information about customers past purchasing
behavior and prediction of future purchases
Too much data & too little knowledge
• There is a need to extract knowledge (useful information)
from the massive data.
– The competitive pressures are strong, which needs useful
information for prediction
• Facing too enormous volumes of data, human analysts with
no special tools can no longer make sense.
– Data mining can automate the process of finding patterns &
relationships in raw data and the results can be utilized for
decision support. That is why data mining is used, especially in
science and business areas.
• If we know how to reveal valuable knowledge hidden in
raw data, data might be one of our most valuable assets.
– Data mining is the tool that involves retrospective analysis to
extract diamonds of knowledge from historical data & predict
outcome of the future.
The Way Forward
Topics Areas covered
Meaning of Data Mining; Essence of DM; Relationship
Introducing Data between Data Mining, Data Warehousing and On-line
mining (DM) Analytical Processing; Issues in DM; The KDD/DM Process
Model
Data Preparation Data Exploration; Quality Data Preparation; Major Tasks in
for Knowledge Data Preprocessing; Data Cleaning; Data Integration; Data
Discovery Reduction; Data Transformation
DM tasks: Concepts of Classification; K-Nearest Neighbour; Decision
classification Trees; NaiveBayes; Neural Networks
Overview of Clustering; Partitioning algorithms: K-Means &
DM tasks:
K-Medoids; Hierarchical Clustering: Agglomerative & Divisive
Clustering
Algorithms; Single-link, Double link & Average link clustering
DM tasks: Overview of Pattern Discovery; Frequent Pattern Finding and
association rules Association Rules Discovery; aPriori algorithm; Pattern-Growth
discovery Approach
7
Reference
• Jiawei Han and Micheline Kamber, (2006), Data Mining:
Concepts and Techniques, 2nd edition, Morgan Kaufmann.
• M.H.Dunham, (2002) Data Mining, Introductory and Advanced
Topics, Prentice Hall.
• Tan, Steinbach, Kumar, (2006), Introduction to Data Mining,
Addison-Wesley, ISBN 0-321-32136-7
• Chakrabarti, Soumen (2003). Mining the Web: Discovering
Knowledge from Hypertext, Morgan Kaufmann Publishers.
• Scime, Anthony (2005). Web Mining: Applications and
Techniques, Idea group Inc.

• Datasets
– UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
8
Discussion
Review 5+ literatures (books and articles) & write a report
(overview, significance, steps involved, applications, review of 2+
related local and international research works and concluding
remarks) and present in class within 10 minutes. Send for the
class PPT, Doc & PDF, and one or two major
sources; cc at: [email protected]
1. Data Warehouses, Data Mining and Business Intelligence
2. Predictive Modeling
3. Descriptive Modeling
4. Data Mining Models (like CRISP, Hybrid, & other models)
5. Text Mining
6. Web Mining
7. Sentiment/opinion mining
8. Knowledge Mining
9. Multimedia Data Mining
What is data mining?
• Data Mining is a technology that uses various
techniques to discover hidden knowledge from
heterogeneous and distributed historical data
stored in large databases, warehouses and other
massive information repositories so to find patterns
in data that are:
– valid: not only represent current state, but also hold
on new data with some certainty
– novel: non-obvious to the system that are generated
as new facts
– useful: should be possible to act on the item or
problem
– understandable: humans should be able to interpret
the pattern
Why DM Now?
• Four main reasons why DM now?
– The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
• How to manage the high turnover rate of
professionals?

– Massive data collection: produced at alarming rate &


is being warehoused
– The computing power is available and is affordable
– DM commercial products and machine learning
algorithms are available
Why DM Now: Massive data collection
• Massive data collection: large databases (data
warehouses) are growing at unprecedented rates
to manage the explosive growth in stored data.
• Examples of massive data sets
– Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
– MEDLINE text database: 17 million published
articles
– Retail transaction data: EBay, Amazon, Wal-Mart:
order of 100 million transactions per day
• Visa, MasterCard: similar or larger numbers
Why DM Now: Powerful computers
• Powerful computers: The computing power is available and
is also affordable
– The need for improved computational engines can now be
met in a cost-effective manner with parallel multiprocessor
computer technology.
• Technological Driving Factors
– Larger, cheaper memory (in hundred GBs, not in MBs)
• Moore’s law for magnetic disk density
“capacity doubles every 18 months”
• Storage cost per byte falling rapidly
– Faster, cheaper processors (in GHz, not in MHz)
• the CRAY of 15 years ago is now on your desk
– Success of Relational Databases and the World Wide Web
• everybody is a “data owner”
Why DM Now: DM algorithms
• Commercial products (for data mining) are available
– Data mining algorithms have been matured & there are
reliable tools that consistently outperform older statistical
methods.
– New ideas in machine learning/statistics
• Boosting, SVMs, decision trees, non-parametric Bayes,
text models, etc
– Existence of around 20-30 mining tool vendors
– Existence of many embedded products
• Fraud detection
• Customer relationship management
• Health care
• E-commerce applications
Example: Why Data Mining
• Customer relationship management:
– Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor?
• Credit ratings:
– Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
• Targeted marketing:
– Identify likely responders to sales promotions

• Fraud detection/Network intrusion detection


– Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?
Data Mining helps extract such useful information
Database Processing vs. Data Mining Processing
Database Data mining Comments
Query Well defined • Poorly defined The data miner might
Structured • No precise not know what he
Query query language exactly wants to see
Language
Data Operational Non-Operational The data have been
data data cleansed and modified
to better support the
mining process

Output Precise and Not a subset of The output is some


Subset of database hidden useful patterns
database & knowledge in the
database
Query Examples
• Database
– Find all credit applicants with first name ‘Alex’.
– Identify customers who have purchased more than Birr
10,000 in the last month.
– Find all customers who have purchased Bread

• Data Mining
– Find all credit applicants who have no credit risks.
(classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with Bread.
(association rules)
Data Mining works with Data Warehouse

• Data Warehouse provides the


Enterprise with a memory

• Data Mining provides the


Enterprise with intelligence
Data Warehouse
• Data warehouse
– A data warehouse is a relational database management
system responsible for the collection and storage of
data to support management decision making and
problem solving.
– It enables managers and other business professionals to
undertake data mining, online analytical processing,
market research and decision support
– Current evolution of Decision Support Systems (DSSs)
• Data mart
– A subset of a data warehouse for small and medium-
size businesses or departments within larger companies
Data Warehouse Stores Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data Warehouse as part of Data Mining
Data warehousing
• Data warehouse is an integrated, subject-oriented,
time-variant, non-volatile database that provides
support for decision making.
• Integrated  centralized, consolidated database that
integrates data derived from the entire organization.
• Consolidates data from multiple & diverse sources with
diverse formats.
• Helps managers to better understand the company’s
operations.

• Subject-Oriented  Data warehouse contains data


organized by topics.
• E.g. Sales, marketing, finance, etc.
Data warehousing
• Time variant  In contrast to the operational
database that focus on current transactions, the data
warehouse represent the flow of data through time.
– Data warehouse contains data that reflect what happened
last week, last month, past five years, and so on.

 Non volatile  Once data enter the data


warehouse, they are never removed. Because the
data in the warehouse represent the company’s
entire history.
 Because data is added all the time, warehouse is
growing.
Database & data warehouse: Differences
• The data warehouse and operational environments
are separated. Data warehouse receives its data from
operational databases.
– Data warehouse environment is characterized by read-only
transactions to very large data sets.
– Operational environment is characterized by numerous
update transactions to a few data entities at a time.
– Data warehouse contains historical data over a long time
horizon.
• Ultimately Information is created from data warehouses. Such
Information becomes the basis for rational decision making.
• The data found in data warehouse is analyzed to discover
previously unknown data characteristics, relationships,
dependencies, or trends.
Data Processing Technologies
• OLAP – Online Analytical Processing
– refers to an advanced data analysis environment that
supports decision making.
– Access to multidimensional databases providing
managerially useful display techniques
• Data mining tools analyze the data, uncover problems or
opportunities hidden in the data relationships.
• E.g.: Credit system : who are likely not to pay their
debts?
– Crime Database : Who are likely to commit what
kind of crime?
• OLAP provides top-down, query-driven analysis
– Data mining provides bottom-up, discovery-driven analysis
Business Intelligence
• BI takes advantage of data mining and data
warehousing to help organizations gather their
information in a timelier and in a more valuable
manner

• BI keeps the organization:


– informed about the market trends,
– alerts to new market potentials,
– helps to determine how competitors are doing

• Without such information and knowledge the


organization may suffer false growth or setbacks
Data Mining & Business Intelligence
Data Mining vs. Knowledge Discovery in
Databases
• KDD is often used as a synonym for Data Mining.
– Some author define KDD as the whole process involving:
data selection  data pre-processing: cleaning  data
transformation  mining  result evaluation 
visualization
– Data Mining, on the other hand, refer to the modeling
step using the various techniques to extract useful
information/pattern from the data.
• KDD is the process of finding useful information and
patterns in data
• DM is the use of algorithms to extract hidden
patterns & knowledge in data
Stages in data mining: The KDD process
CRoss Industry Standard Process for
Data Mining (CRISP-DM)
Hybrid Knowledge Discovery Process
Origins of Data Mining
pre 1960 1960’s 1970’s 1980’s 1990’s

Hardware
(sensors, storage, computation)

Relational
Databases Data
AI Pattern Machine Mining
Recognition Learning

“Flexible Models”
EDA
“Pencil
“Data Dredging”
and Paper”
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis

Visualization (viz) Data Databases (DB)


Mining

Human Computer
High-Performance
Interaction (HCI)
Parallel Computing
Information
retrieval
Data Mining Metrics
• How to measure the effectiveness or usefulness of data
mining approach?
• Return on Investment (ROI)
– From an overall business or usefulness perspective a
measure such as ROI is used
– ROI compares costs of DM techniques against savings or
benefits from its use
• Accuracy in classification
– Analyze true positive and false positive to calculate recall,
precision of the system
– Measure percentage of correct classification
• Space/Time complexity
– Running time: how fast the algorithm runs
– Storage or memory space requirement
Data Mining implementation issues
• Scalability
– Applicability of data mining techniques to perform well with
massive real world data sets
– Techniques should also work regardless of the amount of
available main memory
• Real World Data
– Real world data are noisy and have many missing attribute
values. Algorithms should be able to work even in the
presence of these problems
• Updates
– Database can not be assumed to be static. The data is
frequently changing.
– However, many data mining algorithms work with static data
sets. This requires that the algorithm be completely rerun any
time the database changes.
Data Mining implementation issues
• High dimensionality:
– A conventional database schema may be composed of many
different attributes. The problem here is that all attributes may not
be needed to solve a given DM problem.
– The use of unnecessary attributes may increase the overall
complexity and decrease the efficiency of an algorithms.
– The solution is dimensionality reduction (reduce the number of
attributes). But, determining which attributes are not needed is a
tough task!
• Overfitting
– The size and representativeness of the dataset determines
whether the model associated with a given database states fits to
also future database states.
– Overfitting occurs when the model does not fit to the future states
which is caused by the use of small size and unbalanced training
database.
Data Mining implementation issues
• Ease of Use of the DM tool
– Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical
experts
– Although some techniques may work well, they may not be
accepted by users if they are difficult to use or understand

• Application
– Determining the intended use for the information obtained from
the DM tool is a challenge.
– Indeed, how business executives can effectively use the output is
sometimes considered the most difficult part. Because the results
are of a type that have not previously been known.
– Business practices may have to be modified to determine how to
effectively use the information uncovered
Focus area
• Designing an efficient DM algorithms & architectures
– that is scalable to the number of features and instances
extracted from the high dimensional database
• Data miner that handle large, heterogeneous data
(including multimedia data, spatial data, …)
• Presentation of DM results
– To easily view and understand the output of the DM
algorithms there is a need to use knowledge
representation (decision tree, rules, equations, semantic
networks) and visualization techniques (such as graphs,
bar charts, etc.).
• Integration of DM functions into traditional DBMS in
order to design an intelligent database

You might also like