Lecture 5 Introduction To Data Mining Business Intelligence
Lecture 5 Introduction To Data Mining Business Intelligence
Business Intelligence
Lecture 5 Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Objectives
On completion of this lecture you should know:
❑ What is Data Mining?
❑ The basic ramifications of Data Mining
❑ KDD, Data Query and Data Mining
❑ Basic understanding of PDCA cycle
❑ Current applications of Data Mining
Trends leading to Data Flood
❑ More data is generated:
❑ Bank, telecom, other business transactions ...
❑ Scientific data: astronomy, biology, etc
❑ Web, text, and ecommerce
Big Data Examples
❑ The New York Stock Exchange generates about one
terabyte of new trade data per day.
❑ The statistic shows that 500+terabytes of new data
get ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
❑ A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches
up to many Petabytes.
Storage and Analys is a big problem
Data and storage predictions for the year 2025
❑ The storage industry will ship 42ZB of capacity.
❑ 90ZB of data will be created on IoT devices by 2025.
❑ By 2025, 49 percent of data will be stored in public cloud
environments.
❑ Nearly 30 percent of the data generated will be consumed in real-
time by 2025.
Evolution of Database Technology
❑ 1960s:
❑ Data collection, database creation, IMS and network DBMS
❑ 1970s:
❑ Relational data model, relational DBMS implementation
❑ 1980s:
❑ RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
❑ 1990s—2000s:
❑ Data mining and data warehousing, multimedia databases, and
Web databases
Data Mining: A Definition
Please ask yourself:
Act Plan
Check Do
Taking Action
KDD or data mining
Collation of data
Data preprocessing
lifecycle in the
framework of PDCA
Interpretation of the
Discovered knowledge Choosing an algorithm
cycle.
Act Plan
Check Do
Iteration
Model construction
and Evaluation
Data
processing
Steps of DM
Problem Identification:
❑ The problem should be meaningful.
❑ We also need to set the level of expectation for the solution, say
80% or 98% satisfaction.
❑ Without business understanding and requirements, useful data
mining cannot be done.
Steps of DM (Cont.)
Collection of Data:
❑ The problem definition provides us with the scope of relevant
data.
❑ A data mining technique may require millions and often billions
of cases of data.
❑ However, typically, a data mining technique is applied to a few
hundred or a few thousand instances of data.
Steps of DM (Cont.)
Data Preprocessing:
It depends on source -
❑ If the data comes from a data warehouse, no pre-processing of data is
usually required because the warehouse data has already been filtered,
cleaned and missing values taken care of.
❑ For transactional data, it needs to be organized and cleaned such that
a data mining technique can be readily applied.
❑ The data has to be made consistent across sources. For example, in
one database male and female may be represented as M and F, and in
another database it may be represented as 1 and 0. Such anomalies
have to be removed and any representation has to be made uniform.
Steps of DM (Cont.)
Algorithm Selection:
❑ Now-a-days quite a good number of data mining algorithms are
available for public use.
❑ In general, parametric algorithms are relatively more suited for
the data mining task. This involves choosing the optimal
parameters to receive the best solution.
Steps of DM (Cont.)
Data Processing:
❑ This may involve data normalization, data transformation or data
integration.
❑ Some algorithms cannot work with categorical data, some cannot
work with numerical data, and yet, some others cannot work with
either unless the values meet certain criteria.
❑ Another important part of this task is data splitting, which is
about deciding which part of data is to be used for model
building (training data) and which part for model testing (test
data).
❑ This step is identified as data preparation in CRISP-DM.
Steps of DM (Cont.)
Model Construction and Evaluation:
❑ Model evaluation or testing is an important step for maximizing
the amount of information that can be extracted from the dataset.
❑ If we see the model performances to be unacceptable, we follow
the iterative path of choosing a different data mining algorithm or
having a different set of features from the dataset.
Steps of DM (Cont.)
Discovering Knowledge:
❑ Final stage of DM.
❑ Verify the quality of Knowledge.
❑ If satisfied, go ahead for implementation.
Steps of DM (Cont.)
Taking Action:
❑ We may act based on the discovered knowledge, which could
bring rewards.
❑ Taking action can simply mean applying the model to new
instances.
❑ This step is identified as deployment in CRISP-DM.
Data Mining and Business Intelligence
Types of Knowledge
❑ Shallow knowledge:
❑ It is simply what makes up a computer's response. For example,
we may learn that Our Stock Exchange generally follows the
lead of Wall Street, but we wouldn't necessarily know why.
❑ Deep knowledge:
❑ It is the underlying reason behind such relationships. For
Example, which gene is responsible for diabetics.
Steps of Data Mining for Business
Cross-Industry Standard Process for Data Mining (CRISP-DM):
❑ Business Understanding
❑ Data understanding
❑ Data Preparation
❑ Modelling
❑ Evaluation
❑ Deployment
Data Query versus Data Mining
❑ Data Query
❑ A list of customers who used MasterCard to buy medicine from a
pharmacy.
❑ A list of employees who will reach retiring age next year.
❑ A list of residents in a locality who became diabetic before reaching
the age of 50.
❑ Find all customers who have purchased diapers.
Data Query versus Data Mining
❑ Data Mining
❑ Develop a profile of MasterCard holders who will take
advantage of the forthcoming sale promotion of the pharmacy.
❑ Develop a list of employees, who are likely to avail themselves
of the voluntary early retirement scheme when they reach the
retiring age.
❑ Construct some rules about the lifestyle of residents of a locality
which may reduce the occurrence of diabetes at an early age.
❑ Find all items which are frequently purchased with diapers.
What are the Problems Suitable for Data-Mining ?
The areas where data mining applications are likely to be successful
have these characteristics:
❑ Require knowledge-based decisions
❑ Have a changing environment
❑ Have sub-optimal current methods
❑ Have accessible, sufficient, and relevant data
❑ Provides high payoff for the right decisions
The Learning Process
What is Learning?
It’s a process to gather knowledge.
❑ Example
❑ A credit card company wants to promote credit card insurance.
Types of Learning (Cont.)