MOD 5 BUSAN

The document discusses the rise of data-mining techniques in business due to increased data production, data warehousing capabilities, and affordable computing power. It covers various data-mining methods such as unsupervised learning, cluster analysis, and association rules, as well as techniques for measuring similarity between observations. Additionally, it addresses challenges in text mining and the importance of preprocessing text data for analysis.

Uploaded by

Danielle Mercado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

MOD 5 BUSAN

Uploaded by

Danielle Mercado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

MOD 5 BUSAN

The increase in the use of data-mining techniques in business has been caused largely by three events:
• The explosion in the amount of data being produced and electronically tracked.
• The ability to electronically warehouse these data.
• The affordability of computer power to analyze the data.
Observation Set of recorded values of variables associated with a single entity.
Unsupervised learning • A descriptive data-mining technique used to identify relationships between
observations.
• Thought of as high-dimensional descriptive analytics.
• There is no outcome variable to predict; instead, qualitative
assessments are used to assess and compare the results.
Cluster Analysis • Goal of clustering is to segment observations into similar groups based on
observed variables.
• Can be employed during the data-preparation step to identify variables or
observations that can be aggregated or removed from consideration.
• Used to identify outliers.
market segmentation Commonly used in marketing to divide customers into different homogenous
groups; known as
• Clustering methods:
hierarchical clustering Bottom-up ___ starts with each observation belonging to its own cluster and then
sequentially merges the most similar clusters to create a series of nested clusters.
k-means clustering assigns each observation to one of k clusters in a manner such that the observations
assigned to the same cluster are as similar as possible.
• Both methods depend on how two observations are similar—hence, we have to measure similarity
between observations.
Measuring Similarity Between Observations:
When observations include numeric variables,
Euclidean distance is the most common method to measure dissimilarity between observations.
• is a financial advising company that provides personalized
financial advice to its clients.
• would like to segment its customers into several groups (or
clusters) so that the customers within a group are similar and
dissimilar with respect to key characteristics.
• For each customer, KTC has an observation of seven variables:
Age, Female, Income, Married, Children, Car Loan, Mortgage.
• Example: The observation u = (61, 0, 57881, 1, 2, 0, 0)
corresponds to a 61-year-old male with an annual income of
$57,881, married with two children, but no car loan and no
mortgage.
Euclidean distance becomes smaller as a pair of observations become more similar with respect to their
variable values.
Euclidean distance • is highly influenced by the scale on which variables are measured:
• Common to standardize the units of each variable j of each
observation u.
• When clustering observations solely on the basis of categorical variables
encoded as 0–1, a better measure of similarity between two observations
can be achieved by counting the number of variables with matching
values.
The simplest overlap measure is called the ___ and is computed as:

• A weakness of the matching coefficient is that if two observations both

have a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations.
To avoid misstating similarity due to the absence of a feature, a similarity measure
called __ does not count matching zero entries and is computer as:

Hierarchical Clustering • Determines the similarity of two clusters by considering the

similarity between the observations composing either cluster.
• Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single
cluster.
• Given a way to measure similarity between observations, there
are several clustering method alternatives for comparing
observations in two clusters to obtain a cluster similarity
measure:
• Single linkage.
• Complete linkage.
• Group average linkage.
• Median linkage.
• Centroid linkage.
Single linkage • The similarity between two clusters is defined by the similarity of the pair
of observations (one from each cluster) that are the most similar.
Complete linkage • This clustering method defines the similarity between two clusters as the
similarity of the pair of observations (one from each cluster) that are the
most different.
Group Average linkage • Defines the similarity between two clusters to be the average similarity
computed over all pairs of observations between the two clusters.
Median linkage • Analogous to group average linkage except that it uses the median of the
similarities computer between all pairs of observations between the two
clusters.
Centroid linkage • uses the averaging concept of cluster centroids to define between-cluster
similarity.
Ward’s method • merges two clusters such that the dissimilarity of the observations with the
resulting single cluster increases as little as possible.
McQuitty’s method • considers merging two clusters A and B, the dissimilarity of the resulting
cluster AB to any other cluster C is calculated as: ((dissimilarity between
A and C) + (dissimilarity between B and C)) divided by 2).
dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation.
k-Means Clustering: • Given a value of k, the k-means algorithm randomly assigns each
observation to one of the k clusters.
• After all observations have been assigned to a cluster, the
resulting cluster centroids are calculated.
• Using the updated cluster centroids, all observations are
reassigned to the cluster with the closest centroid.
Hierarchical Clustering Suitable when we have a small data set (e.g., fewer than 500 observations) and
want to easily examine solutions with increasing numbers of clusters.
Hierarchical Clustering Convenient method if you want to observe how clusters are nested.

k-Means Clustering Suitable when you know how many clusters you want and you have a larger data
set (e.g., more than 500 observations).
k-Means Clustering Partitions the observations,
which is appropriate if trying to summarize the data with k “average” observations
that describe the data with the minimum amount of error.
Association rules • : If-then statements which convey the likelihood of certain items being
purchased together.
market basket analysis • Although association rules are an important tool in __, they are also
applicable to other disciplines.
Antecedent • The collection of items (or item set) corresponding to the if portion of the
rule.
Consequent • The item set corresponding to the then portion of the rule.
Support count • The item set corresponding to the then portion of the rule.
Confidence Helps identify reliable association rules:
Lift ratio Measure to evaluate the efficiency of a rule
Evaluating Association • An association rule is ultimately judged on how actionable it is and how
Rules: well it explains the relationship between item sets.
• For example, Walmart mined its transactional data to uncover strong
evidence of the association rule, “If a customer purchases a Barbie doll,
then a customer also purchases a candy bar.”
• An association rule is useful if it is well supported and explains an
important previously unknown relationship.
Text • like numerical data, may contain information that can help solve problems
and lead to better decisions.
Text mining • is the process of extracting useful information from text data.
Text data • is often referred to as unstructured data because in its raw form, it cannot
be stored in a traditional structured database (rows and columns).
Audio and video data are also examples of unstructured data
Data mining with text data is more challenging than data mining with traditional numerical data, because it
requires more preprocessing to convert the text to a format amenable for analysis.
Voice of the Customer at • Triad solicits feedback from its customers through a follow-up e-
Triad Airline: mail the day after the customer has completed a flight.
• Survey asks the customer to rate various aspects of the flight and
asks the respondent to type comments into a dialog box in the e-
mail; includes:
• Quantitative feedback from the ratings.
• Comments entered by the respondents which need to be
analyzed.
corpus A collection of text documents to be analyzed is called a
Voice of the Customer at • To be analyzed, text data needs to be converted to structured data
Triad Airline: (rows and columns of numerical data) so that the tools of
descriptive statistics, data visualization and data mining can be
applied.
• Think of converting a group of documents into a matrix of rows
and columns where the rows correspond to a document and the
columns correspond to a particular word.
presence/absence or binary • is a matrix with the rows representing documents and the
term-document matrix columns representing words.
• Entries in the columns indicate either the presence or the
absence of a particular word in a particular document.
Voice of the Customer at • Creating the list of terms to use in the presence/absence matrix
Triad Airline (cont.): can be a complicated matter:
• Too many terms results in a matrix with many columns,
which may be difficult to manage and could yield
meaningless results.
• Too few terms may miss important relationships.
• Term frequency along with the problem context are often used as
a guide.
• In Triad’s case, management used word frequency and the
context of having a goal of satisfied customers to come up with
the following list of terms they feel are relevant for categorizing
the respondent’s comments: delayed, flight, horrible, recline,
rude, seat, and service.
Preprocessing Text Data • The text-mining process converts unstructured text into numerical
for Analysis: data and applies quantitative techniques.
• Which terms become the headers of the columns of the term-
document matrix can greatly impact the analysis.
Tokenization is the process of dividing text into separate terms, referred to as tokens:
• Symbols and punctuations must be removed from the
document, and all letters should be converted to
lowercase.
• Different forms of the same word, such as “stacking,”
“stacked,” and “stack” probably should not be
considered as distinct terms.
Stemming is the process of converting a word to its stem or root word.
Preprocessing Text Data • The goal of preprocessing is to generate a list of most-relevant
for Analysis (cont.): terms that is sufficiently small so as to lend itself to analysis:
• Frequency can be used to eliminate words from
consideration as tokens.
• Low-frequency words probably will not be very useful
as tokens.
• Consolidating words that are synonyms can reduce the
set of tokens.
• Most text-mining software gives the user the ability to
manually specify terms to include or exclude as tokens.
• The use of slang, humor, and sarcasm can cause interpretation
problems and might require more sophisticated data cleansing
and subjective intervention on the part of the analyst to avoid
misinterpretation.
• Data preprocessing parses the original text data down to the set of
tokens deemed relevant for the topic being studied.
• When the documents in a corpus contain many words and when
the frequency of word occurrence is important to the context of
the business problem, preprocessing can be used to develop a
frequency term-document matrix.
frequency term-document is a matrix whose rows represent documents and columns represent tokens, and the
matrix entries in the matrix are the frequency of occurrence of each token in each
document.
Movie Reviews: • A new action film has been released, and we now have a sample
of 10 reviews from movie critics.
• Using preprocessing techniques, we have reduced the number of
tokens to only two: “great” and “terrible.”
• Table 4.8 displays the corresponding frequency term-document
matrix.
• To demonstrate the analysis of a frequency term-document matrix
with descriptive data mining, we apply k-means clustering with k
= 2 to the frequency term-document matrix to obtain the two
clusters in Figure 4.5.
census • collects data from every element in the population of interest.
• Many potential difficulties associated with taking a census; it may be:
• Expensive.
• Time consuming.
• Misleading.
• Unnecessary.
• Impractical.
Statistical inference • uses sample data to make estimates of or draw conclusions about one or
more characteristics of a population.
sampled population • uses sample data to make estimates of or draw conclusions about one or
more characteristics of a population.
frame is a list of elements from which the sample will be selected.
Parameter • A measurable factor that defines a characteristic of a population, process,
or system.
• Sampling from a Finite Population:
Statisticians recommend selecting a probability sample when sampling from a finite
population because a probability sample allows you to make valid statistical
inferences about the population.
simple random sample • of size n from a finite population of size N is a sample selected
such that each possible sample of size n has the same probability
of being selected.
• Sampling from an • With an infinite population, you cannot select a simple random
Infinite sample because you cannot construct a frame consisting of all the
Population: elements.
• Statisticians recommend selecting what is called a random
sample.
Random Sample (Infinite • of size n from an infinite population is a sample selected such that
Population): the following conditions are satisfied:
• Each element selected comes from the same population.
• Each element is selected independently.
Selecting a Sample • Care and judgment must be implemented in the selection process for a
random sample from an infinite population:
• Each element selected comes from the same population.
• Each element is selected independently.
• Situations involving sampling from an infinite population are usually
associated with a process that operates over time.
sample statistic. To estimate the value of a population parameter, compute a corresponding
characteristic of the sample—
Calculating sample mean, The sample mean x is the point estimator of the population mean .

sample standard deviation, The sample standard deviation s is the point estimator of the population
and sample proportion is standard deviation  .
called point estimation: The sample proportion p is the point estimator of the population proportion p.

The numerical value obtained for x , s, or p is called the point estimate.

• When making inferences, it is important to have a close correspondence between the sampled population
and the target population:
Target population Population about which we want to make inferences.
Sampled population Population from which the sample is taken
Good judgment is a necessary ingredient of sound statistical practice.

AUTOSAR TR CImplementationRules
No ratings yet
AUTOSAR TR CImplementationRules
44 pages
Free PDF Password Cracker v3 0
No ratings yet
Free PDF Password Cracker v3 0
2 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
5 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Summary
No ratings yet
Summary
7 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
BI-Unit-3-Part-1-PPT.ppt
No ratings yet
BI-Unit-3-Part-1-PPT.ppt
51 pages
Chapter 6 DATA MINING R1
No ratings yet
Chapter 6 DATA MINING R1
81 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Unit 4
No ratings yet
Unit 4
65 pages
Block 18 ST3188
No ratings yet
Block 18 ST3188
29 pages
Lecture-9 Cluster Analysis_LAK
No ratings yet
Lecture-9 Cluster Analysis_LAK
4 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Lecture 8 BA
No ratings yet
Lecture 8 BA
19 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Cengage EBA 2e Chapter04
No ratings yet
Cengage EBA 2e Chapter04
35 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
5) - Differentiate Between K-Means and Hierarchical Clustering
No ratings yet
5) - Differentiate Between K-Means and Hierarchical Clustering
4 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
ML Unit V
No ratings yet
ML Unit V
26 pages
Paper 16 - Clustering Applied To Data Structuring and Retrieval
No ratings yet
Paper 16 - Clustering Applied To Data Structuring and Retrieval
6 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
15 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
Slides - Clustering
No ratings yet
Slides - Clustering
13 pages
Marielle Caccam Jewel Refran
No ratings yet
Marielle Caccam Jewel Refran
100 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Unit 3
No ratings yet
Unit 3
58 pages
Clustering
No ratings yet
Clustering
34 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Market Segmentation - Cluster Analysis
No ratings yet
Market Segmentation - Cluster Analysis
18 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
1730098650_ML12_Clustering
No ratings yet
1730098650_ML12_Clustering
34 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
09 - Chapter 1111
No ratings yet
09 - Chapter 1111
22 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
Advanced Mining Techniques
No ratings yet
Advanced Mining Techniques
8 pages
Market Research
No ratings yet
Market Research
88 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
9 1 Geometric Sequences
No ratings yet
9 1 Geometric Sequences
29 pages
Development of Reverse Vending Machine RVM Framewo PDF
No ratings yet
Development of Reverse Vending Machine RVM Framewo PDF
6 pages
ECC SAP-PM-course - With (1)
No ratings yet
ECC SAP-PM-course - With (1)
7 pages
(eBook PDF) Learning Analytics Explained by Niall Sclater download
100% (1)
(eBook PDF) Learning Analytics Explained by Niall Sclater download
55 pages
Seminar
No ratings yet
Seminar
10 pages
More Inverse Kinematics: MEAM 520
No ratings yet
More Inverse Kinematics: MEAM 520
31 pages
NIO256 (Z420) CMOS Battery Replacement and BIOS Settings
No ratings yet
NIO256 (Z420) CMOS Battery Replacement and BIOS Settings
7 pages
IrH7KqNeRfGcyrVoJZ4p - OMR Sheet For B.ed Entrance
No ratings yet
IrH7KqNeRfGcyrVoJZ4p - OMR Sheet For B.ed Entrance
1 page
Communication Engineering-S1
No ratings yet
Communication Engineering-S1
16 pages
DSE7320 Installation Instructions PDF
100% (2)
DSE7320 Installation Instructions PDF
2 pages
RM35TF30: Product Datasheet
No ratings yet
RM35TF30: Product Datasheet
6 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
Chapter-01 (Introduction to Computers).Pptx
No ratings yet
Chapter-01 (Introduction to Computers).Pptx
25 pages
code hw
No ratings yet
code hw
21 pages
2100WD1
100% (1)
2100WD1
66 pages
Stats Pack
No ratings yet
Stats Pack
7 pages
Ed11 Notes - Module 7
No ratings yet
Ed11 Notes - Module 7
15 pages
Product Drawing PDF
No ratings yet
Product Drawing PDF
47 pages
MT.02058.001 - en V2
No ratings yet
MT.02058.001 - en V2
32 pages
Firefox Shortcuts
No ratings yet
Firefox Shortcuts
5 pages
HR110_EN_Col15_R1.3 Business Processes in HCM Payroll
No ratings yet
HR110_EN_Col15_R1.3 Business Processes in HCM Payroll
135 pages
BCAC602 - Lession Plan
No ratings yet
BCAC602 - Lession Plan
2 pages
Lec 9
No ratings yet
Lec 9
25 pages
Agc 200 Designer's Reference Handbook
No ratings yet
Agc 200 Designer's Reference Handbook
376 pages
AltaRica 3.0 Language Specification - V1.2
No ratings yet
AltaRica 3.0 Language Specification - V1.2
119 pages
Voice Over Internet Protocol
No ratings yet
Voice Over Internet Protocol
5 pages
Stereo Magazine Issue 01
No ratings yet
Stereo Magazine Issue 01
64 pages

MOD 5 BUSAN

Uploaded by

MOD 5 BUSAN

Uploaded by

MOD 5 BUSAN

• A weakness of the matching coefficient is that if two observations both

Hierarchical Clustering • Determines the similarity of two clusters by considering the

The numerical value obtained for x , s, or p is called the point estimate.

You might also like