0% found this document useful (0 votes)
43 views

Lecture 5 Introduction To Data Mining Business Intelligence

This document provides an introduction to data mining and business intelligence. It discusses key concepts like data mining, knowledge discovery in databases (KDD), and the CRISP-DM process for data mining projects. It also covers trends leading to large data sets, examples of big data, data storage predictions, and the evolution of database technology to support data mining and analysis.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Lecture 5 Introduction To Data Mining Business Intelligence

This document provides an introduction to data mining and business intelligence. It discusses key concepts like data mining, knowledge discovery in databases (KDD), and the CRISP-DM process for data mining projects. It also covers trends leading to large data sets, examples of big data, data storage predictions, and the evolution of database technology to support data mining and analysis.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to Data Mining &

Business Intelligence

Lecture 5 Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Objectives
On completion of this lecture you should know:
❑ What is Data Mining?
❑ The basic ramifications of Data Mining
❑ KDD, Data Query and Data Mining
❑ Basic understanding of PDCA cycle
❑ Current applications of Data Mining
Trends leading to Data Flood
❑ More data is generated:
❑ Bank, telecom, other business transactions ...
❑ Scientific data: astronomy, biology, etc
❑ Web, text, and ecommerce
Big Data Examples
❑ The New York Stock Exchange generates about one
terabyte of new trade data per day.
❑ The statistic shows that 500+terabytes of new data
get ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
❑ A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches
up to many Petabytes.
Storage and Analys is a big problem
Data and storage predictions for the year 2025
❑ The storage industry will ship 42ZB of capacity.
❑ 90ZB of data will be created on IoT devices by 2025.
❑ By 2025, 49 percent of data will be stored in public cloud
environments.
❑ Nearly 30 percent of the data generated will be consumed in real-
time by 2025.
Evolution of Database Technology
❑ 1960s:
❑ Data collection, database creation, IMS and network DBMS
❑ 1970s:
❑ Relational data model, relational DBMS implementation
❑ 1980s:
❑ RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
❑ 1990s—2000s:
❑ Data mining and data warehousing, multimedia databases, and
Web databases
Data Mining: A Definition
Please ask yourself:

What is gold mining?


What Is Data Mining?
❑ Data mining (knowledge discovery in databases):
❑ Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases
❑ Alternative names and their “inside stories”:
❑ Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
❑ What is not data mining?
❑ (Deductive) query processing.
❑ Expert systems or statistical programs
Data Mining (DM)
The process of employing one or more computer learning techniques
to automatically analyze and extract knowledge from data- (Roiger
and Geatz, 2003).
Data mining is the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data using machine
learning, statistical and visualization techniques –(Frawley et al., 1992)

Many experts agree that data mining should not be automatic –


human intervention and interpretation is essential.
Data Mining is a subset of Business Intelligence (BI)
Knowledge Discovery in Databases (KDD)
❑ Data Mining (DM) is one step of the KDD process.
❑ DM is an information extraction process and KDD is making sense
of the information.
❑ But now no distinction is made between the two.
❑ The application of the scientific method occurs in DM.
An Example
❑ A leading supermarket chain employs a data mining expert to analyze
local buying patterns.
❑ Analysis: When a customer buys honey on Friday or Sunday, he/she
also buys bread usually.
❑ Observation: On Friday, more people buy honey and bread together.
❑ Business Benefit: The supermarket chain can use this information in
various ways to increase revenue. For instance, they can move the
bread shelf closer to the honey shelf and make sure that bread and
honey are sold at full price during the weekend.
Stages of Data Mining/KDD

Data mining: the core of knowledge discovery process.


Steps of a KDD Process
❑ Learning the application domain:
❑ Relevant prior knowledge and goals of application
❑ Creating a target data set: data selection
❑ Data cleaning and preprocessing: (may take 60% of effort!)
❑ Data reduction and transformation:
❑ Find useful features, dimensionality/variable reduction, invariant representation.
❑ Choosing functions of data mining
❑ Summarization, classification, regression, association, clustering.
❑ Choosing the mining algorithm(s)
❑ Data mining: search for patterns of interest
❑ Pattern evaluation and knowledge presentation
❑ Visualization, transformation, removing redundant patterns, etc.
❑ Use of discovered knowledge
Plan-Do-Check-Act (PDCA) cycle

Act Plan

Check Do

Plan-Do-Check-Act (PDCA) cycle of Scientific method.


Data Mining Lifecycle
Problem identification

Taking Action
KDD or data mining
Collation of data

Data preprocessing
lifecycle in the
framework of PDCA
Interpretation of the
Discovered knowledge Choosing an algorithm

cycle.
Act Plan

Check Do

Iteration

Model construction
and Evaluation
Data
processing
Steps of DM
Problem Identification:
❑ The problem should be meaningful.
❑ We also need to set the level of expectation for the solution, say
80% or 98% satisfaction.
❑ Without business understanding and requirements, useful data
mining cannot be done.
Steps of DM (Cont.)
Collection of Data:
❑ The problem definition provides us with the scope of relevant
data.
❑ A data mining technique may require millions and often billions
of cases of data.
❑ However, typically, a data mining technique is applied to a few
hundred or a few thousand instances of data.
Steps of DM (Cont.)
Data Preprocessing:
It depends on source -
❑ If the data comes from a data warehouse, no pre-processing of data is
usually required because the warehouse data has already been filtered,
cleaned and missing values taken care of.
❑ For transactional data, it needs to be organized and cleaned such that
a data mining technique can be readily applied.
❑ The data has to be made consistent across sources. For example, in
one database male and female may be represented as M and F, and in
another database it may be represented as 1 and 0. Such anomalies
have to be removed and any representation has to be made uniform.
Steps of DM (Cont.)
Algorithm Selection:
❑ Now-a-days quite a good number of data mining algorithms are
available for public use.
❑ In general, parametric algorithms are relatively more suited for
the data mining task. This involves choosing the optimal
parameters to receive the best solution.
Steps of DM (Cont.)
Data Processing:
❑ This may involve data normalization, data transformation or data
integration.
❑ Some algorithms cannot work with categorical data, some cannot
work with numerical data, and yet, some others cannot work with
either unless the values meet certain criteria.
❑ Another important part of this task is data splitting, which is
about deciding which part of data is to be used for model
building (training data) and which part for model testing (test
data).
❑ This step is identified as data preparation in CRISP-DM.
Steps of DM (Cont.)
Model Construction and Evaluation:
❑ Model evaluation or testing is an important step for maximizing
the amount of information that can be extracted from the dataset.
❑ If we see the model performances to be unacceptable, we follow
the iterative path of choosing a different data mining algorithm or
having a different set of features from the dataset.
Steps of DM (Cont.)
Discovering Knowledge:
❑ Final stage of DM.
❑ Verify the quality of Knowledge.
❑ If satisfied, go ahead for implementation.
Steps of DM (Cont.)
Taking Action:
❑ We may act based on the discovered knowledge, which could
bring rewards.
❑ Taking action can simply mean applying the model to new
instances.
❑ This step is identified as deployment in CRISP-DM.
Data Mining and Business Intelligence
Types of Knowledge
❑ Shallow knowledge:
❑ It is simply what makes up a computer's response. For example,
we may learn that Our Stock Exchange generally follows the
lead of Wall Street, but we wouldn't necessarily know why.
❑ Deep knowledge:
❑ It is the underlying reason behind such relationships. For
Example, which gene is responsible for diabetics.
Steps of Data Mining for Business
Cross-Industry Standard Process for Data Mining (CRISP-DM):
❑ Business Understanding
❑ Data understanding
❑ Data Preparation
❑ Modelling
❑ Evaluation
❑ Deployment
Data Query versus Data Mining
❑ Data Query
❑ A list of customers who used MasterCard to buy medicine from a
pharmacy.
❑ A list of employees who will reach retiring age next year.
❑ A list of residents in a locality who became diabetic before reaching
the age of 50.
❑ Find all customers who have purchased diapers.
Data Query versus Data Mining
❑ Data Mining
❑ Develop a profile of MasterCard holders who will take
advantage of the forthcoming sale promotion of the pharmacy.
❑ Develop a list of employees, who are likely to avail themselves
of the voluntary early retirement scheme when they reach the
retiring age.
❑ Construct some rules about the lifestyle of residents of a locality
which may reduce the occurrence of diabetes at an early age.
❑ Find all items which are frequently purchased with diapers.
What are the Problems Suitable for Data-Mining ?
The areas where data mining applications are likely to be successful
have these characteristics:
❑ Require knowledge-based decisions
❑ Have a changing environment
❑ Have sub-optimal current methods
❑ Have accessible, sufficient, and relevant data
❑ Provides high payoff for the right decisions
The Learning Process
What is Learning?
It’s a process to gather knowledge.

Four Levels of Learning –


❑ Facts (simple truths)
❑ Concepts (relationships)
❑ Procedures (algorithms)
❑ Principles (All pervading truths)
Types of Learning
❑ Supervised Learning
❑ Learning with the help of Supervisor
❑ Example
❑ In a biomedical study, medical records for a set of healthy
patients and patients with heart disease have been collected.
❑ Data mining technique to this study would be to learn what
combination of attributes – obesity, high-cholesterol, smoking
habit, etc. – characterizes patients with heart disease and
distinguishes them from healthy patients.
Types of Learning (Cont.)
❑ Supervised learning data structure.

Obesity High- … Smoker Class


Cholesterol

Patient 1 Yes Yes … Yes Sick


… … … … …

Patient m
No No … No Healthy
Types of Learning (Cont.)
❑ Unsupervised Learning
❑ Learning without Supervisor

❑ Example
❑ A credit card company wants to promote credit card insurance.
Types of Learning (Cont.)

❑ Unsupervised learning data structure.

Home Life … Income


Insurance insurance range

Person 1 Yes Yes … 50-60K


… … … … …
Person m Yes No … 40-50K
Types of Learning (Cont.)
❑ Reinforcement Learning
❑ Leaning from incidence
❑ Example
❑ Some players have trouble arriving on time to the practice
match.
❑ To lift the team spirit coach orders all the players to run 5 extra
laps in the stadium. The coach claims that this application had
to be given only once a year.
The History of Data Mining
❑ 1700-1939:
❑ First Generation of Data Mining.
❑ It was based on Statistics.
❑ 1940-1989:
❑ Second Generation of Data Mining.
❑ First introduction of Artificial Intelligence (AI) in Data Mining.
❑ 1990-onwards:
❑ Third Generation of Data Mining.
❑ People introduced better techniques by combining Statistics and AI.
Data Mining Strategies
Classification
Example
❑ A bank wishes to determine the credit risk of a credit card
applicant. The application is either approved or rejected.
Data Mining Strategies Cont.
Association
Example
❑ A leading supermarket chain had 100,000 point-of-sale
transactions last month. An association rule miner observes that
25,000 of these transactions include both banana and bread and
8,000 transactions include three items – banana, bread and honey.
Data Mining Strategies Cont.
Clustering
Example
❑ Clustering could be used by an insurance company to group
important customers according to age, types of policies
purchased, duration of membership, and prior claims history.
Data Mining Strategies Cont.
Estimation
Example
❑ We are interested in estimating the blood sugar level of a new
hospital patient.
Data Mining Strategies Cont.
Novelty Detection
Example
❑ The heartbeat record of a healthy patient to an untrained eye is
either plain noise or full of features or spikes.
Data Mining Strategies Cont.
Sequence Detection
Example
❑ Thrombosis is a potential complication of collagen diseases.
Statistics, Machine Learning and Data Mining
❑ Statistics:
❑ More theory-based
❑ More focused on testing hypotheses
❑ Machine learning
❑ More heuristic
❑ Focused on improving performance of a learning agent
❑ Also looks at real-time learning and robotics –areas not part of data
mining
❑ Data Mining and Knowledge Discovery
❑ Integrates theory and heuristics
❑ Focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Applications
❑ Banking: loan/credit card approval
❑ Predict good customers based on old customer profiles
❑ Customer relationship management (CRM)
❑ Identify those who are likely to leave for a competitor.
❑ Targeted marketing
❑ Identify likely respondents to promotions
❑ Fraud detection: telecommunications, financial transactions
❑ From an online stream of events identify fraudulent transactions
❑ Manufacturing and production
❑Automatically adjust knobs when process parameter changes
Applications Cont’d
❑ Medicine: disease outcome, effectiveness of treatments
❑ Analyze patient disease history: find relationship between
disease and symptoms.
❑ Molecular/Pharmaceutical: identify new drugs
❑ Scientific data analysis
❑ Identify new galaxies by searching for sub clusters
❑ Web site/store design and promotion
❑ Find affinity of a visitor to pages and modify layout
Challenges of Data Mining
❑ Size of dataset
❑ High dimensionality
❑ Over-fitting
❑ Missing and noisy data
❑ Rapidly changing data
❑ Mixed dataset
❑ Human intervention and interpretation
Future of Data Mining
❑ Credit risk assessment, customer relationships management, attrition
of small business customers, early weather warning, stock price
prediction, quick machinery fault detection, brain tumor prediction and
other such issues are already seeing the introduction of data mining
technology in their solution strategies.

❑ The long-term prospects are truly exciting. Data mining technology


has already opened a new dimension in medical research. For
example, a gene data analyst can tell us who has breast cancer and
who does not.
Summary
❑ Data mining: discovering interesting patterns from large amounts of
data
❑ A natural evolution of database technology, in great demand, with
wide applications
❑ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
❑ Mining can be performed in a variety of information repositories
❑ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Thank You!

You might also like