Lecture3 More of Chapter 2
Lecture3 More of Chapter 2
Mining
1
Homework Assignment:
2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Chapter 2: Data
3
What is Data? Attributes
4
●Object is also known as record, point, case,
sample,
Types of Attributes:
6
sense
Ratio = true zero exists, division makes sense
Types of Attributes:
● Some examples:
–Nominal
Examples: ID numbers, eye color, zip codes
–Ordinal
Examples: rankings (e.g., taste of potato
chips on a scale from 1-10), grades, height in
{tall, medium, short}
–Interval
Examples: calendar dates, temperatures in
Celsius or Fahrenheit, GRE score
–Ratio
Examples: temperature in Kelvin, length,
time, counts
7
Properties of Attribute Values
●Continuous Attribute
–Has real numbers as attribute values
–Can compute as accurately as instruments
allow
–Examples: temperature, height, or weight
9
–Practically, real values can only be measured
and represented using a finite number of digits
Discrete vs. Continuous (P. 28)
10
In class exercise #3:
Classify the following attributes as binary,
discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative
(interval or ratio). Some cases may have more
than one interpretation, so briefly indicate your
reasoning if you think there may be some
ambiguity.
a) Number of telephones in your house
b) Size of French Fries (Medium or Large or X-
Large)
c) Ownership of a cell phone
d) Number of local phone calls you made in a
month
e) Length of longest phone call
f) Length of your foot
g) Price of your textbook
h) Zip code 11
Types of Data in R
●In R,
12
Types of Data in R
●For example, the IP address in the first
column of
www.stats202.com/stats202log.txt is a
factor
> data<-read.csv("stats202log.txt",
sep=" ",header=F)
> data[,1]
[1] 69.224.117.122 69.224.117.122 69.224.117.122 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164
…
…
[1901] 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11
[1911] 65.57.245.11 67.164.82.184 67.164.82.184 67.164.82.184 171.66.214.36 171.66.214.36 171.66.214.36 65.57.245.11 65.57.245.11 65.57.245.11
[1921] 65.57.245.11 65.57.245.11
73 Levels: 128.12.159.131 128.12.159.164 132.79.14.16 171.64.102.169 171.64.102.98 171.66.214.36 196.209.251.3 202.160.180.150 202.160.180.57 ... 89.100.163.185
> is.factor(data[,1])
[1] TRUE
> data[,1]+10
[1] NA NA NA NA NA NA NA NA …
13
Types of Data in R
> data[,8]
[1] 2867 4583 2295 2867 4583 2295 1379 2294 4432 7134 2296 2297 3219968 1379 2294 4432 7134 2293 2297 2294
…
[1901] 2294 4432 7134 2294 4432 7134 2294 2867 4583 2295 2294 4432 7134 2294 4432 7134 2294 2294 2294 2294
[1921] 2294 2294
Levels: - 1135151 122880 1379 1510 2290 2293 2294 2295 2296 2297 2309 238 241 246 248 250 2725487 280535 2867 3072 3219968 4432 4583 626 7134 7482
> is.factor(data[,8])
[1] TRUE
> is.numeric(data[,8])
[1] FALSE
14
Types of Data in R
> data<-read.csv("stats202log.txt",
sep=" ",header=F, na.strings = "-")
> is.factor(data[,8])
[1] FALSE
> is.numeric(data[,8])
[1] TRUE
15
Types of Data in R
16
Types of Data in R
> zip_codes<-
as.factor(c("94550","00123","43614"))
17
Types of Data in Excel
18
Types of Data in Excel
19
Types of Data in Excel
20
Working with Data in R
Creating Data:
> aa<-c(1,10,12)
> aa
[1] 1 10 12
> aa+10
[1] 11 20 22
> length(aa)
21
[1] 3
Working with Data in R
> bb<-c(2,6,79)
> my_data_set<-
data.frame(attributeA=aa,attributeB=bb)
> my_data_set
attributeA attributeB
1 1 2
2 10 6
3 12 79
22
Working with Data in R
Indexing Data:
> my_data_set[,1]
[1] 1 10 12
> my_data_set[1,]
attributeA attributeB
1 1 2
> my_data_set[3,2]
[1] 79
> my_data_set[1:2,]
attributeA attributeB
1 1 2
23
Working with Data in R
Indexing Data:
> my_data_set[c(1,3),]
attributeA attributeB
1 1 2
3 12 79
Arithmetic:
> aa/bb
[1] 0.5000000 1.6666667 0.1518987
24
Working with Data in R
Summary Statistics:
> mean(my_data_set[,1])
[1] 7.666667
> median(my_data_set[,1])
[1] 10
> sqrt(var(my_data_set[,1]))
[1] 5.859465
25
Working with Data in R
Writing Data:
> write.csv(my_data_set,"my_data_set_file.csv")
Help!:
> ?write.csv
26
Working with Data in Excel
Reading in Data:
27
Working with Data in Excel
Deleting a Column:
(right click)
28
Working with Data in Excel
Arithmetic:
29
Working with Data in Excel
30
Working with Data in Excel
31
Working with Data in Excel
32
Working with Data in Excel
33
Sampling (P.47)
34
Sampling (P.47)
35
Sampling (P.47)
●The simple random sample is the most
common and basic type of sample
●In a simple random sample every item has the
same probability of inclusion and every sample
of the fixed size has the same probability of
selection
●It is the standard “names out of a hat”
36
stratified sampling, cluster sampling, Latin
hypercube sampling)
Sampling in Excel:
●The function rand() is useful.
37
●Sorting is done in Excel by selecting “Sort”
from the “Data” menu
Sampling in Excel:
38
Sampling in Excel:
39
Sampling in Excel:
40
Sampling in R:
●The function sample() is useful.
41
In class exercise #4:
Explain how to use R to draw a sample of 10
observations with replacement from the first
quantitative attribute in the data set
www.stats202.com/stats202log.txt.
42
In class exercise #4:
Explain how to use R to draw a sample of 10
observations with replacement from the first
quantitative attribute in the data set
www.stats202.com/stats202log.txt.
Answer:
> sam<-sample(seq(1,1922),10,replace=T)
> my_sample<-data$V7[sam]
43
In class exercise #5:
If you do the sampling in the previous exercise
repeatedly, roughly how far is the mean of the
sample from the mean of the whole column on
average?
44
In class exercise #5:
If you do the sampling in the previous exercise
repeatedly, roughly how far is the mean of the
sample from the mean of the whole column on
average?
Answer: about 26
> real_mean<-mean(data$V7)
> store_diff<-rep(0,10000)
>
> for (k in 1:10000){
+ sam<-sample(seq(1,1922),10,replace=T)
+ my_sample<-data$V7[sam]
+ store_diff[k]<-abs(mean(my_sample)-
real_mean)
45
+ }
> mean(store_diff)
In class exercise #6:
If you change the sample size from 10 to 100, how
does your answer to the previous question
change?
46
In class exercise #6:
If you change the sample size from 10 to 100, how
does your answer to the previous question
change?
> real_mean<-mean(data$V7)
> store_diff<-rep(0,10000)
>
> for (k in 1:10000){
+ sam<-sample(seq(1,1922),100,replace=T)
+ my_sample<-data$V7[sam]
+ store_diff[k]<-abs(mean(my_sample)-
real_mean)
+ }
> mean(store_diff)
47
The square root sampling relationship:
●When you take samples, the differences
between the sample values and the value using
the entire data set scale as the square root of
the sample size for many statistics such as the
mean.
48
matter, and not the size of the whole data set
(the population) since this relationship
Sampling (P.47)
●Sampling can be tricky or ineffective when
the data has a more complex structure than
simply independent observations.
●For example, here is a “sample” of words
from a song. Most of the information is lost.