Lecture 1 - Data Representation: Dr. Salman Saeed
Lecture 1 - Data Representation: Dr. Salman Saeed
[email protected]
How decisions are made?
Cases
Variables
Right-hand bat Left Hand Bat Right Hand Bat
Leg-break googly Right-arm off-break Right Arm Off-break
6ft 0in 6ft 2in 6ft 4 in
1405 1510 1176
97 17 10
36 Years, 1 month 36 years, 6 months 35 years, 9 months
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Variables & Cases
Cases
Variables
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Variables & Cases
Case Case
Characteristics Characteristics
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Variables – Whose values change across cases
Variation
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Constants – Whose values do not vary across cases
No Variation (Constants)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Similar
Difference Order
Intervals
Nominal ✓
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Similar
Difference Order
Intervals
Nominal ✓
Categorical
Ordinal ✓ ✓
Interval ✓ ✓ ✓
• Using numbers to measure observations, that are:
– mutually exclusive
– exhaustive
– have some explicit relationship among them
– the relationship between the categories is known and exact
• A common and constant unit of measurement has been established between the
categories
• Examples: Temperature, IQ
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Similar Meaningful
Difference Order
Intervals Zero Point
Nominal
✓
Categorical
Ordinal
✓ ✓
Interval
✓ ✓ ✓
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Similar Meaningful
Difference Order
Intervals Zero Point
Nominal
✓
Categorical
Ordinal
✓ ✓
Interval
✓ ✓ ✓
Quantitative
Ratio
✓ ✓ ✓ ✓
• Using numbers to measure observations, that are:
– mutually exclusive
– exhaustive
– have some explicit relationship among them
– the relationship between the categories is known and exact
– There is a meaningful zero point. The numbers originate from a specified point.
• Example, weight, area, speed, length etc.
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Similar Meaningful
Difference Order
Intervals Zero Point
Nominal
✓
Categorical
Ordinal
✓ ✓
Interval
✓ ✓ ✓
Quantitative
Ratio
✓ ✓ ✓ ✓
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Levels of Measurements
Levels of Measurement
Statistical Methods
• Sometimes the distinctions between levels of measurement get blurred
• An ordinal variable with 10 categories or more is allowed to be used as an interval
variable if the categories are named as numbers
• Example: Ratings of a player
• Similarly, interval variables are sometimes treated as ratio variables
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Presentation
Data Presentation
Study
Data
Variables Cases
Variables
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix
Variables
Age Weight Bowling Style Average
Player 1 23 65 RAF 45.76
Player 2 27 75 LAF 42.54
Player 3 25 81 LAM 22.84
Player 4 34 69 LAS 38.78
Player 5 33 73 RAS 27.4
Player 6 24 72 RAM 20.3
Player 7 24 85 RAF 44.36
Player 8 20 71 LAF 29.42
Player 9 19 71 LAM 21.09
Player 10 27 79 LAS 42.24
Player 11 28 69 RAS 45.25
Player 12 33 79 RAM 44.16
……. - - - -
……. - - - -
……. - - - -
Player 400 23 81 RAM 45.98
What kind of variables do we have in each column?
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix
Variables
Age Weight Bowling Style Average
Player 1 23 65 RAF 45.76
Player 2 27 75 LAF 42.54
Player 3 25 81 LAM 22.84
Player 4 34 69 LAS 38.78
Player 5 33 73 RAS 27.4
Player 6 24 72 RAM 20.3
Player 7 24 Observation
85 RAF 44.36
Player 8 20 71 LAF 29.42
Player 9 19 71 LAM 21.09
Player 10 27 79 LAS 42.24
Player 11 28 69 RAS 45.25
Player 12 33 79 RAM 44.16
……. - - - -
……. - - - -
……. - - - -
Player 400 23 81 RAM 45.98
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix
Variables
Age Weight Bowling Style Average
Player 1 23 65 RAF 45.76
Player 2 27 75 LAF 42.54
Player 3 25 81 LAM 22.84
Player 4 34 69 LAS 38.78
Player 5 33 73 RAS 27.4
Player 6 24 72 RAM 20.3
Player 7 24 85 RAF 44.36
Player 8 20 71 LAF 29.42
Player 9 19 71 LAM 21.09
Player 10 27 79 LAS 42.24
Player 11 28 69 RAS 45.25
Player 12 33 79 RAM 44.16
Player 13 31 76 RAM 23.40
Player 14 32 74 RAF 32.40
Player 15 32 66 LAF 25.00
Player 16 29 77 RAM 49.20
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix
Variables
Data Matrix Age Weight Bowling Style Average
Player 17 29 80 RAF 36.70
Player 18 37 77 LAF 45.90
Player 19 33 85 LAM 45.40
Player 20 28 85 LAS 44.60
Player 21 24 80 RAS 30.30
Player 22 37 74 RAM 32.60
Player 23 25 67 RAF 51.10
Player 24 18 78 LAF 41.00
Player 25 35 75 LAM 28.80
Player 26 23 66 LAS 42.80
Player 27 24 75 RAS 28.40
Player 28 24 69 RAM 42.90
Player 29 34 66 RAM 34.80
Player 30 18 83 RAF 43.30
Player 31 18 81 LAF 25.20
Player 32 30 73 RAM 38.80
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix
Variables
Age Weight Bowling Style Average
Player 33 38 78 RAF 28.50
Player 34 19 72 LAF 45.80
Player 35 18 72 LAM 28.40
Player 36 25 72 LAS 27.40
Player 37 24 75 RAS 50.00
Player 38 31 - RAM 52.00
Player 39 32 83
May be Eliminated in RAF 36.60
Player 40 32 subsequent81Analysis - 27.70
Player 41 27 80 LAM 43.40
Player 42 - 72 LAS 47.30
Player 43 29 71 RAS 53.30
Player 44 20 81 RAM 31.40
Player 45 36 72 RAM 40.50
Player 46 23 69 RAF 28.40
Player 47 24 84 LAF 42.00
Player 48 38 67 RAM 32.90
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Data Matrix Age Weight Runs Average
Player 1 35 80 4868 17.77
Huge
Player 40 23 76 4950 21.06
Player 41 27 81 2423 19.69
Player 42 19 78 3224 18.73
Player 43 28 69 3712 43.20
Player 44 20 76 4193 23.25
Player 45 21 71 4302 42.88
Player 46 26 72 4242 23.42
Player 47 25 77 4031 32.56
Player 48 28 72 4153 17.73
Player 49 21 78 4522 37.81
Player 50 37 85 4377 37.53
Player 51 30 84 1445 38.94
Player 52 36 71 1912 18.29
Player 53 24 69 4873 41.22
Player 54 24 85 4484 18.88
Player 55 20 67 1175 28.13
Player 56 36 70 5297 18.09
Player 59
27
Very little
Player 58 19
31
67
65
73
3281
3647
3163
29.52
37.57
20.78
Player 60 27 76 3408 36.79
Player 61 28 69 4833 38.25
information
Player 62
Player 63
Player 64
Player 65
37
21
24
24
83
81
84
81
4972
3981
1240
3833
44.62
34.77
35.17
19.55
or insight
Player 66
Player 67
Player 68
Player 69
24
22
28
32
85
72
84
68
4308
3440
1910
3453
21.34
40.24
36.71
40.14
Player 70 21 85 5130 28.86
Player 71 21 77 1728 21.71
Player 72 24 74 3823 33.97
Player 73 36 70 3950 33.79
Player 74
Player 75
Player 76
Hidden
17
35
32
77
72
83
3813
1590
4015
38.51
33.22
31.66
Player 77 20 82 3612 43.33
information
Player 78
Player 79
Player 80
28
27
36
78
66
76
4474
4528
3853
24.92
19.95
17.93
Player 81 24 79 5376 40.34
in the data
Player 82
Player 83
25
37
78
73
4557
5375
19.09
18.64
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• Frequency Tables show how values are distributed over the cases
• First we list all the possible values of the variable, and in the next column we count how many cases
have those values
Bowling Style Frequency
Left Arm Fast 66
Left Arm Medium 119
Right Arm Fast 140
Right Arm Medium 25
Other 50
Total 400
• Sum of 400 means that we don’t have any missing data for this variable
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• Frequency Tables show how values are distributed over the cases
• We get percentages by dividing each frequency value by the total, and multiply by 100
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• Frequency Tables show how values are distributed over the cases
• Cumulative Percentage is simply the sum of all frequency percentages above it + it’s own value.
• The first value has nothing above it, so it will have the same value.
• The next value will be sum of above values (16.5) plus its own (29.75), so (16.5+29.75 = 46.25)
• Similarly, the next value will be (16.5+29.75+35=81.25)
• Notice that the last value will always be 100.
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• Frequency Tables work well with Categorical data, i.e. ordinal or nominal.
• Let’s see how it would look like with quantitative data
• We know the first step in frequency table is to make a list of all possible values of the variable
We can see the problem here – a very large list, and each value has very low frequency compared to
the total – hence, the frequency table won’t give us the insights that we are looking for.
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• The previous example was discrete data.
• Now let’s try the same with continuous data
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Frequency Table
• This problem is solved by converting quantitative data into categorical data – by using intervals
• For this, we divide total range of values into categories that we can manage.
In this way we also reduce the number of rows in our frequency table.
Quantitative variables can always be re-coded into Categorical variables, but the reverse is not possible
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Graphs
• Frequency tables are great but looking at numbers
usually do not reveal the important information specially
if you’re not paying good attention
Left Arm Fast Left Arm Medium Right Arm Fast Right Arm Medium Other
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Bar Graph
Bowling Style Frequency Percentage Cumulative Percentage
Left Arm Fast 66 16.5 16.5
Left Arm Medium 119 29.75 46.25
Right Arm Fast 140 35 81.25
Right Arm Medium 25 6.25 87.5
Other 50 12.5 100
Total 400 100
Frequency
160
140
120
100
80
60
40
20
0
Left Arm Fast Left Arm Medium Right Arm Fast Right Arm Medium Other
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Pie Chart vs Bar Graph
• Its easy to see in a pie chart that around 50% of players are Left Arm bowlers, or about 30% are
Right Arm Fast bowlers – We can’t deduce that information from bar graphs
• In bar graphs we can see the total number of players with Right Arm fast bowling, but we can’t
read that in the pie chart
Frequency Frequency
160
140
120
100
80
60
40
20
Left Arm Fast Left Arm Medium 0
Right Arm Fast Right Arm Medium Left Arm Left Arm Right Arm Right Arm Other
Other Fast Medium Fast Medium
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Pie Chart vs Bar Graph
• When the number of categories is high, then pie charts
do not reveal much information or insight
• While
Frequency Frequency
120
100
80
60
40
20
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Graphs
Summarize
Data
Categorical Quantitative
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Dot Plots
Player Physical Height (cm)
Player 1 199
Player 2 185
Player 3 158
Player 4 164
Player 5 191
Player 6 187
Player 7 176
Player 8 194
Player 9 184
Player 10 180
Dot Plot
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Dot Plots
Player Physical Height (cm)
Player 1 199
Player 2 185
Player 3 158
Player 4 164
Player 5 191
Player 6 187
Player 7 176
Player 8 194
Player 9 184
Player 10 180
Dot Plot
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Dot Plots
Player Physical Height (cm)
Player 1 199
Player 2 185
Player 3 158
Player 4 164
Player 5 191
Player 6 187
Player 7 176
Player 8 194
Player 9 184
Player 10 180
Dot Plot
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Dot Plots
Player Physical Height (cm)
Player 1 199
Player 2 185
Player 3 158
Player 4 164
Player 5 191
Player 6 187
Player 7 176
Player 8 194
Player 9 184
Player 10 180
Dot Plot
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Dot Plots
Dot Plot for 150 cases
13
12
11
10
9
8
7
6
5
4
3
2
1
0
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
• Number of dots above a value is the frequency of that value in the data matrix
• Such a plot looks good for discrete data, but imagine what it would look like if it was continuous
data
• First of all, there is zero probability that an exact value of a continuous variable will be repeated
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histograms
Dot Plot for 150 cases
13
12
11
10
9
8
7
6
5
4
3
2
1
0
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
• Histograms can be used for both discrete as well as continuous data
• There are two transformations required going from dot plots to histograms
– First, dots are replaced with bars
– Second, the width of the bar doesn’t represent a value, but a range.
– The height of the bar represents the frequency of values within that range
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histograms
Dot Plot
13
12
11
10
9
8
7
6
5
4
3
2
1
0
155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
• There are two transformations required going from dot plots to histograms
– First, dots are replaced with bars
– Second, the width of the bar doesn’t represent a value, but a range.
– The height of the bar represents the frequency of values within that range
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histograms
Dot Plot
13
12
11
10
9
8
7
6
5
4
3
2
1
0
168
173
196
155
156
157
158
159
160
161
162
163
164
165
166
167
169
170
171
172
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
197
198
199
200
Physical Height (cm)
• There are two transformations required going from dot plots to histograms
– First, dots are replaced with bars
– Second, the width of the bar doesn’t represent a value, but a range.
– The height of the bar represents the frequency of values within that range
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histograms vs Frequency Charts
Histogram for 150 cases
13
12 Underlying Continuous Scale
11
10
9
8
7
6
5
4
3
2
1
0
165
173
181
189
155
156
157
158
159
160
161
162
163
164
166
167
168
169
170
171
172
174
175
176
177
178
179
180
182
183
184
185
186
187
188
190
191
192
193
194
195
196
197
198
199
200
Physical Height (cm)
Frequency Chart for 150 cases
13
12 Underlying Discrete Scale
11
10
9
8
7
6
5
4
3
2
1
0
162
155
156
157
158
159
160
161
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histogram bin variation
Histogram – bin size = 1
13
12
11
10
9
8
7
6
5
4
3
2
1
0
165
173
181
189
155
156
157
158
159
160
161
162
163
164
166
167
168
169
170
171
172
174
175
176
177
178
179
180
182
183
184
185
186
187
188
190
191
192
193
194
195
196
197
198
199
200
Physical Height (cm)
Histogram – bin size = 1
35
30
25
20
15
10
5
0
155
162
167
156
157
158
159
160
161
163
164
165
166
168
169
170
171
172
173
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Histogram bin variation
Histogram – bin size = 2.5
35
30
25
20
15
10
5
0
197.5
175
155
157.5
160
162.5
165
167.5
170
172.5
177.5
180
182.5
185
187.5
190
192.5
195
Physical Height (cm)
Histogram – bin size = 5
60
50
40
30
20
10
0
165
190
155
160
170
175
180
185
195
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Shape of the Histogram
13
12
11
10
9
8
7
6
5
4
3
2
1
0
155
176
197
158
161
164
167
170
173
179
182
185
188
191
194
200
Physical Height (cm)
16-Jul-2020 Lecture # 01 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar