MOD 5 BUSAN
MOD 5 BUSAN
The increase in the use of data-mining techniques in business has been caused largely by three events:
• The explosion in the amount of data being produced and electronically tracked.
• The ability to electronically warehouse these data.
• The affordability of computer power to analyze the data.
Observation Set of recorded values of variables associated with a single entity.
Unsupervised learning • A descriptive data-mining technique used to identify relationships between
observations.
• Thought of as high-dimensional descriptive analytics.
• There is no outcome variable to predict; instead, qualitative
assessments are used to assess and compare the results.
Cluster Analysis • Goal of clustering is to segment observations into similar groups based on
observed variables.
• Can be employed during the data-preparation step to identify variables or
observations that can be aggregated or removed from consideration.
• Used to identify outliers.
market segmentation Commonly used in marketing to divide customers into different homogenous
groups; known as
• Clustering methods:
hierarchical clustering Bottom-up ___ starts with each observation belonging to its own cluster and then
sequentially merges the most similar clusters to create a series of nested clusters.
k-means clustering assigns each observation to one of k clusters in a manner such that the observations
assigned to the same cluster are as similar as possible.
• Both methods depend on how two observations are similar—hence, we have to measure similarity
between observations.
Measuring Similarity Between Observations:
When observations include numeric variables,
Euclidean distance is the most common method to measure dissimilarity between observations.
• is a financial advising company that provides personalized
financial advice to its clients.
• would like to segment its customers into several groups (or
clusters) so that the customers within a group are similar and
dissimilar with respect to key characteristics.
• For each customer, KTC has an observation of seven variables:
Age, Female, Income, Married, Children, Car Loan, Mortgage.
• Example: The observation u = (61, 0, 57881, 1, 2, 0, 0)
corresponds to a 61-year-old male with an annual income of
$57,881, married with two children, but no car loan and no
mortgage.
Euclidean distance becomes smaller as a pair of observations become more similar with respect to their
variable values.
Euclidean distance • is highly influenced by the scale on which variables are measured:
• Common to standardize the units of each variable j of each
observation u.
• When clustering observations solely on the basis of categorical variables
encoded as 0–1, a better measure of similarity between two observations
can be achieved by counting the number of variables with matching
values.
The simplest overlap measure is called the ___ and is computed as:
k-Means Clustering Suitable when you know how many clusters you want and you have a larger data
set (e.g., more than 500 observations).
k-Means Clustering Partitions the observations,
which is appropriate if trying to summarize the data with k “average” observations
that describe the data with the minimum amount of error.
Association rules • : If-then statements which convey the likelihood of certain items being
purchased together.
market basket analysis • Although association rules are an important tool in __, they are also
applicable to other disciplines.
Antecedent • The collection of items (or item set) corresponding to the if portion of the
rule.
Consequent • The item set corresponding to the then portion of the rule.
Support count • The item set corresponding to the then portion of the rule.
Confidence Helps identify reliable association rules:
Lift ratio Measure to evaluate the efficiency of a rule
Evaluating Association • An association rule is ultimately judged on how actionable it is and how
Rules: well it explains the relationship between item sets.
• For example, Walmart mined its transactional data to uncover strong
evidence of the association rule, “If a customer purchases a Barbie doll,
then a customer also purchases a candy bar.”
• An association rule is useful if it is well supported and explains an
important previously unknown relationship.
Text • like numerical data, may contain information that can help solve problems
and lead to better decisions.
Text mining • is the process of extracting useful information from text data.
Text data • is often referred to as unstructured data because in its raw form, it cannot
be stored in a traditional structured database (rows and columns).
Audio and video data are also examples of unstructured data
Data mining with text data is more challenging than data mining with traditional numerical data, because it
requires more preprocessing to convert the text to a format amenable for analysis.
Voice of the Customer at • Triad solicits feedback from its customers through a follow-up e-
Triad Airline: mail the day after the customer has completed a flight.
• Survey asks the customer to rate various aspects of the flight and
asks the respondent to type comments into a dialog box in the e-
mail; includes:
• Quantitative feedback from the ratings.
• Comments entered by the respondents which need to be
analyzed.
corpus A collection of text documents to be analyzed is called a
Voice of the Customer at • To be analyzed, text data needs to be converted to structured data
Triad Airline: (rows and columns of numerical data) so that the tools of
descriptive statistics, data visualization and data mining can be
applied.
• Think of converting a group of documents into a matrix of rows
and columns where the rows correspond to a document and the
columns correspond to a particular word.
presence/absence or binary • is a matrix with the rows representing documents and the
term-document matrix columns representing words.
• Entries in the columns indicate either the presence or the
absence of a particular word in a particular document.
Voice of the Customer at • Creating the list of terms to use in the presence/absence matrix
Triad Airline (cont.): can be a complicated matter:
• Too many terms results in a matrix with many columns,
which may be difficult to manage and could yield
meaningless results.
• Too few terms may miss important relationships.
• Term frequency along with the problem context are often used as
a guide.
• In Triad’s case, management used word frequency and the
context of having a goal of satisfied customers to come up with
the following list of terms they feel are relevant for categorizing
the respondent’s comments: delayed, flight, horrible, recline,
rude, seat, and service.
Preprocessing Text Data • The text-mining process converts unstructured text into numerical
for Analysis: data and applies quantitative techniques.
• Which terms become the headers of the columns of the term-
document matrix can greatly impact the analysis.
Tokenization is the process of dividing text into separate terms, referred to as tokens:
• Symbols and punctuations must be removed from the
document, and all letters should be converted to
lowercase.
• Different forms of the same word, such as “stacking,”
“stacked,” and “stack” probably should not be
considered as distinct terms.
Stemming is the process of converting a word to its stem or root word.
Preprocessing Text Data • The goal of preprocessing is to generate a list of most-relevant
for Analysis (cont.): terms that is sufficiently small so as to lend itself to analysis:
• Frequency can be used to eliminate words from
consideration as tokens.
• Low-frequency words probably will not be very useful
as tokens.
• Consolidating words that are synonyms can reduce the
set of tokens.
• Most text-mining software gives the user the ability to
manually specify terms to include or exclude as tokens.
• The use of slang, humor, and sarcasm can cause interpretation
problems and might require more sophisticated data cleansing
and subjective intervention on the part of the analyst to avoid
misinterpretation.
• Data preprocessing parses the original text data down to the set of
tokens deemed relevant for the topic being studied.
• When the documents in a corpus contain many words and when
the frequency of word occurrence is important to the context of
the business problem, preprocessing can be used to develop a
frequency term-document matrix.
frequency term-document is a matrix whose rows represent documents and columns represent tokens, and the
matrix entries in the matrix are the frequency of occurrence of each token in each
document.
Movie Reviews: • A new action film has been released, and we now have a sample
of 10 reviews from movie critics.
• Using preprocessing techniques, we have reduced the number of
tokens to only two: “great” and “terrible.”
• Table 4.8 displays the corresponding frequency term-document
matrix.
• To demonstrate the analysis of a frequency term-document matrix
with descriptive data mining, we apply k-means clustering with k
= 2 to the frequency term-document matrix to obtain the two
clusters in Figure 4.5.
census • collects data from every element in the population of interest.
• Many potential difficulties associated with taking a census; it may be:
• Expensive.
• Time consuming.
• Misleading.
• Unnecessary.
• Impractical.
Statistical inference • uses sample data to make estimates of or draw conclusions about one or
more characteristics of a population.
sampled population • uses sample data to make estimates of or draw conclusions about one or
more characteristics of a population.
frame is a list of elements from which the sample will be selected.
Parameter • A measurable factor that defines a characteristic of a population, process,
or system.
• Sampling from a Finite Population:
Statisticians recommend selecting a probability sample when sampling from a finite
population because a probability sample allows you to make valid statistical
inferences about the population.
simple random sample • of size n from a finite population of size N is a sample selected
such that each possible sample of size n has the same probability
of being selected.
• Sampling from an • With an infinite population, you cannot select a simple random
Infinite sample because you cannot construct a frame consisting of all the
Population: elements.
• Statisticians recommend selecting what is called a random
sample.
Random Sample (Infinite • of size n from an infinite population is a sample selected such that
Population): the following conditions are satisfied:
• Each element selected comes from the same population.
• Each element is selected independently.
Selecting a Sample • Care and judgment must be implemented in the selection process for a
random sample from an infinite population:
• Each element selected comes from the same population.
• Each element is selected independently.
• Situations involving sampling from an infinite population are usually
associated with a process that operates over time.
sample statistic. To estimate the value of a population parameter, compute a corresponding
characteristic of the sample—
Calculating sample mean, The sample mean x is the point estimator of the population mean .
sample standard deviation, The sample standard deviation s is the point estimator of the population
and sample proportion is standard deviation .
called point estimation: The sample proportion p is the point estimator of the population proportion p.