Data mining
Data mining
...“The gold you seek may be sparse and very hard to find…it
may not even exist at all…To find it will require a methodical
and organisational approach.”
...” You will be luckier if you have the right tools and know how
to use them.”
...“No tools automatically generate knowledge but help in the
analysis process.”
...“In most cases you will not find the answer, just part of the
jigsaw puzzle.”
“Data miners are the business detectives.”
How?
Through better understanding of its
data.
Written by JBacon 14/7/13 10
Written by JBacon 14/7/13 11
For organisational learning to take place, data
from many sources must be gathered together
and organised in a consistent and useful way –
hence, Data Warehousing.
whilst
Data Mining techniques make use of the data in a
Data Warehouse.
Customers Orders
Transactions
Vendors Etc…
Data Miners:
Etc… • “Business detectives”
Copied,
organized
summarized
whilst
Data Mining makes use of the data available
and accuracy has to be balanced with
timeliness.
Data Mining.
Not so …
Data Mining is an iterative, learning process
Data Mining takes conscientious, long-term hard
work and commitment, but…
◦ Turning:
Data into Information
Information into Action
Action into Value
Typical examples:
features from dates and/or time
features from telephone numbers, addresses,
product codes, identification numbers, etc.
Interval variable.
binary ordinal
nominal
In
Statistical Inference we are drawing
conclusions (inferring something) about the
population based on results from sample(s).
Inferential
statistics– to enable us to
draw conclusions (infer) from the data
(sample) and make decisions (about the
population) based on these conclusions.
written by JBacon 14/7/13 9
Qualitative data, such as products, channels,
regions, and descriptions are a main focus of
data mining.
The data are often represented in a frequency
distribution and then represented graphically in:
Bar Chart – bars show number of times different
values occur.
Note The length
of bar is in
proportion to
frequency.
17 17 14 16 15 24 12 20 17 17 13 21 15
14 14 20 21 9 15 22 19 27 19.
Calculate the:
a) range,
b) inter-quartile range,
c) sample variance and sample standard
deviation.
27 1 9.6957 1 x 94.09
Total 23 374.87
_
c) Sample Variance = average of squared deviations
= f i ( xi x) 2
f i 1
= 374.87 = 17.04 (squared £)
23-1
But variance is in squared units!
Symmetrical distribution.
The normal distribution is a symmetrically
bell-shaped distribution, about its mean,
where: mean=median=mode.
p>0.05 p<0.05
Do not reject H0 Reject Ho at 5% sig. level
x = 14 + 5 + 15 +….+ 21 = 396
x
Mean: x n = 396/23 = 17.2174, i.e.) £17.22
var iance
fi 1
= 1876.094 = 85.277 (£ squared)
22
Standard deviation
= £9.23
xi fi Deviation Frequency
from mean, multiplied by
(xi-17.2174) deviation
correct to 4dp squared
5 1 -12.2174 1 x 149.265
6 1 -11.2174 1 x 125.830
7 2 -10.2174 2 x 104.395
: : ……: ……:
33 2 15.7826 2 x 249.091
Total 23 1876.094
Phone call Mean Standard
costs deviation
Male £17.30 £4.13
Female £17.22 £9.23
The average monthly call costs (approx £17)
are similar for males and females.
However there is more than twice as much
variation in the costs of phone calls made by
females than males.
Ina boxplot the middle half of the values in a
distribution are represented by a box, i.e.) the
‘spread’ of the box is the inter-quartile range.
Background reading.
❖ But
is susceptible to missing values and
skewed distributions of the data.
%change in production
variable and is called the
6
independent variable (x).
From the scatterplot, as 5
production increases, the %
change in production also 4
e.g.)CUSTDET1:
Does total dining (dependent variable) depend
upon:
➢ 1 independent variable.
If so, simple linear regression (SLR).
❖ And,
what general criteria are used when
making this selection?
Answer: 5% significance level.
output.