Chapter 3-IB
Chapter 3-IB
INTELLIGENCE
Definition of Data Mining
• Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.
Definition of Data Mining
• Novel means that the patterns are not previously known to the
user within the context of the system being analyzed.
■ Types of patterns
1. Association
2. Prediction
3. Cluster (segmentation)
4. Sequential (or time series) relationships
Data Mining Applications
■ Consolidation of databases and other data repositories into a single location in the
form of a data warehouse.
■ Significant reduction in the cost of hardware and software for data storage and
processing.
Data Mining Applications
■ Entertainment industry
– analyze viewer data to decide what programs to show during prime
time and how to maximize returns by knowing where to insert
advertisements;
– predict the financial success of movies before they are produced to
make investment decisions and to optimize the returns;
– forecast the demand at different locations and different times to
better schedule entertainment events and to optimally allocate
resources
Data Mining Applications
■ Sports.
■ Healthcare.
■ Insurance.
■ Travel industry
■ Government and defense.
■ Brokerage and securities trading.
■ Retailing and logistics.
Characteristics and Objectives of DM
The following are the major characteristics and objectives
of data mining:
▪Source of data for DM is often (but not always) a consolidated
data warehouse
▪DM environment is usually a client-server or a Web-based
information systems architecture
▪Data is the most critical ingredient for DM which may include
soft/unstructured data.
Characteristics and Objectives of DM
– the data sources (e.g., where the relevant data are stored and in what form;
what the process of collecting the data is-automated versus manual; who
the collectors of the data are and how often the data are updated)
– the variables (e.g., What are the most relevant variables? Are the variables
independent of each other- do they stand as a complete information source
without overlapping or conflicting information?)
Step 2: Data Understanding:
■ Data sources for data selection can vary. Normally, data sources for
business applications include:
– demographic data (such as income, education, number of
households, and age),
– sociographic data (such as hobby, club membership, and
entertainment),
– transactional data (sales record, credit card spending, issued
checks), and so on.
Step 2: Data Understanding:
■ Data can be categorized as quantitative and qualitative.
■ Qualitative data, also known as categorical data, contains both nominal and
ordinal data.
–Nominal data has finite nonordered values (e.g., gender data, which has two
values: male and female).
–Ordinal data has finite ordered values. For example, customer credit ratings are
considered ordinal data because the ratings can be excellent, fair, and bad.
Step 3: Data Preparation/ Data Preprocessing
- The goal is taking the data identified in the previous step and
preparing it for analysis by data mining methods.
- Compared to the other steps, data preprocessing consumes the most
time and effort. ( why?)
- The real-world data is
- incomplete (lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data)
- noisy (containing errors or outliers)
- inconsistent (containing discrepancies in codes or names).
Step 3: Data
Preparation
Stages
Step 3: Data Preparation
■ Data Preparation involves four main steps:
1. Data Consolidation
■ in other cases one might choose to create new variables based on the
existing ones to magnify the information found in variables in the data
set.
Step 3: Data Preparation
4. Data Reduction:
■ Even though data miners like to have large data sets, too much data is also
a problem.
■ some data sets may include millions of records. Even though computing power is
increasing exponentially, processing such as large records may not be practical or
feasible. In such cases, one may need to sample a subset of the data for analysis.
■ The analyst should be careful in selecting a subset of the data that reflects the
essence of the complete data set and is not specific to a subgroup or subcategory.
Step 4: Model Building
■ In this step, various modeling techniques are selected and applied to a prepared
data set in order to address the specific business need.
■ Some methods may have specific requirements on the way that the data is to be
formatted; thus, stepping back to the data preparation step is often necessary.
■ Depending on the business need, the data mining task can be classification, an
association, or a clustering type,
Step 5: Testing and Evaluation
■ In step 5, the developed models are assessed and evaluated for their accuracy.
■ This step assesses the degree to which the selected model (or models) meets the
business objectives and, if so, to what extent .
■ Another option is to test the developed model(s) in a real-world scenario if time and
budget constraints permit.
■ This step is a critical and challenging task. No value is added by the data mining task
until the business value obtained from discovered patterns is identified and
recognized.
Step 5: Testing and Evaluation
■ The success of this step depends on the interaction among data analysts,
business analysts, and decision makers (such as business managers).
■ Because data analysts may not have the full understanding of the data
mining objectives and what they mean to the business
■ and the business analysts and decision makers may not have the technical
knowledge to interpret the results of sophisticated mathematical solutions,
interaction among them is necessary.
Step 6: Deployment
■ Depending on the requirements, the deployment phase can be as simple as
generating a report or as complex as implementing a repeatable data mining
process across the enterprise.
■ The deployment step may also include maintenance activities for the deployed
models.
■ Over time, the models (and the patterns embedded within them) built on the
old data may become obsolete, irrelevant, or misleading.
SEMMA
■ SEMMA ="sample, explore, modify, model, and assess."
■ Sample= Beginning with a statistically representative sample of the
data.
– Popular classification tasks include credit approval (i.e., good or bad credit, risk),
■ A tree can be “learned” by splitting the source set into subsets based on an attribute
value test (Inputs) . This process is repeated on each derived subset in a recursive
manner called ”recursive partitioning”.
■ The basic idea is to ask questions whose answers would provide the most
information.
■ Decision trees are one of the most popular machine learning algorithms, and it can
handle both categorical and numerical data.
The tree has three types of nodes:
■ Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.
■ Cluster analysis has been used extensively for fraud detection and market
segmentation of customers in CRM systems.
■ Clustering can also be used for outlier detection, where outliers (values that are
“far away” from any cluster) may be more interesting than common cases.
Cluster Analysis for Data Mining
■ Applications of outlier detection include the detection of credit card fraud and
the monitoring of criminal activities in e-commerce.
– Ex, unusual cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.
■ clustering is framed in unsupervised learning; that is, for this type of algorithm we only
have one set of input data (not labelled), about which we must obtain information, without
previously knowing what the output will be.
■ On the other hand, classification belongs to supervised learning, which means that we
know the input data (labeled in this case) and we know the possible output of the
algorithm.
Association Rule Mining
■ Association rule mining (also known as affinity analysis or market-basket
analysis) is a popular data mining method.
■ Part of machine learning family
■ It aims to find interesting relationships (affinities) between variables (items)
in large databases.
■ The input to market-basket analysis is simple point-of-sale transaction data,
where a number of products and/or services purchased together
■ Then they are tabulated under a single transaction instance.
■ The outcome of the analysis is invaluable information.
Association Rule Mining
(1) putting the items next to each other to make it more convenient for the
customers to pick them up together and not forget to buy one when buying the
others (increasing sales volume);
(2) promoting the items as a package (do not put one on sale if the other(s) are on
sale)
(3) placing them apart from each other so that the customer has to walk the aisles
to search for it, and by doing so potentially seeing and buying other items.
Association Rule Mining
Are all association rules interesting and useful?
■ A Generic Rule: X → Y [S%, C%]
■ X, Y: products and/or services
■ X: Left-hand-side (LHS) or (antecedent),
■ Y: Right-hand-side (RHS) or (consequent)
■ S: Support: how often X and Y go together
■ C: Confidence: how often Y go together with the X
■ Example: {Laptop Computer, Antivirus Software} → {Extended Service
Plan} [30%, 70%]
Association Rule Mining
■ Support refers to the percentage of baskets where the rule was true (both left
and right side products were present).
■ Confidence measures what percentage of how often the products on the RHS
go together with the products on the LHS
■ Lift measures how much more frequently the left-hand item is found with the
right than without the right.
Apriori Algorithm
• Finds subsets that are common to at least a minimum number of the itemsets
• uses a bottom-up approach
– frequent subsets are extended one item at a time (the size of frequent
subsets increases from one-item subsets to two-item subsets, then three-
item subsets, and so on), and
– groups of candidates at each level are tested against the data for
minimum support
– see the figure…
Apriori Algorithm
Data Mining Myths
Myths Reality
Data mining provides instant, crystal-ball-like Data mining is a multistep process that requires
predictions. thoughtful, proactive design and use.
Data mining is not yet viable for business The current state-of-the-art is ready to go for
applications. almost any business.
Data mining is only for large firms that have lots If the data accurately reflect the business or its
of customer data customers, a company can use data mining.
Common Data Mining Mistakes
■ selecting the wrong problem for data mining
■ Being sloppy about keeping track of the data mining procedure and results
■ Running mining algorithms repeatedly and blindly, without thinking about the
next stage.