0% found this document useful (0 votes)
22 views22 pages

Assignment 1

The document presents an assignment by Madamanchi Chandana that includes statistical analysis and calculations on various datasets, including box plots, five-figure summaries, and frequency distributions. It covers topics such as median, mean, standard deviation, and skewness for different groups of mice, as well as murder rates across states in the U.S. Additionally, R code is provided for data analysis and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views22 pages

Assignment 1

The document presents an assignment by Madamanchi Chandana that includes statistical analysis and calculations on various datasets, including box plots, five-figure summaries, and frequency distributions. It covers topics such as median, mean, standard deviation, and skewness for different groups of mice, as well as murder rates across states in the U.S. Additionally, R code is provided for data analysis and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DA241 Assignment

Madamanchi Chandana

220150007

1)q1=33
q2=54
q3=73
a)In a box plot every inter quartile has 25% data in it.So the
interval for which 50% data lies is 33-73.
b)As the median is the middle value of data given
median=54,immediate number greater than median is
greater than approximately 50% data that is 55.
c)As 25% of data is greater than q3 so immediate
number less than q3 that is 72.So value such that
approximately 25% of data is greater than it is 72.
2)
a)The most comman outcome is 2hrs.
b)% of people watching TV no more than 2 hrs=% of
people (0hrs+1hrs+2hrs)= 6.25% + 25.5% + 26%
= 57.75%
3)
a)five-figure summaries:
i)normal mice: 14 , 104 , 124.5 , 260.25 , 655
ii)alloxan diabetic mice: 13 , 76.25 , 139.50 , 251 , 499
iii)insulin treatment mice: 18 , 45 , 82 , 132 , 465
b) mean & standard deviation:
i)normal mice: 186.1 & 158.8349
ii)alloxan diabetic mice: 181.8333 &144.8493
iii)insulin treatment mice: 112.8947 & 105.7896
c) skewness:
i)normal mice: 1.503592
ii)alloxan diabetic mice: 1.039132
iii)insulin treatment mice: 2.122482
d)Boxplot

code for 3)
normal_mice=c(156,282,197,297,116,127,119,29,253,122,3
49,110,143,64
,26,86,122,455,655,14)
alloxanDiabetic_mice=c(391,46,469,86,174,133,13,499,168,
62,127,276,
176,146,108,276,50,73)
Insulin_treatment_mice=c(82,100,98,150,243,68,228,131,7
3,18,20,100,72,
133,465,40,46,34,44)
quantile(normal_mice)
quantile(alloxanDiabetic_mice)
quantile(Insulin_treatment_mice)
mean(normal_mice)
sd(normal_mice)
mean(alloxanDiabetic_mice)
sd(alloxanDiabetic_mice)
mean(Insulin_treatment_mice)
sd(Insulin_treatment_mice)
install.packages("moments")
library("moments")
skewness(normal_mice)
skewness(alloxanDiabetic_mice)
skewness(Insulin_treatment_mice)
boxplot(normal_mice,alloxanDiabetic_mice,Insulin_treatm
ent_mice)
4)
class interval Frequency(f) Relative
frequency(f/20)
400 - 600 10 0.50
600 — 800 3
0.15
800 — 1000 2 0.10
1000 — 1200 3 0.15
1200 — 1400 0 0.00
1400 — 1600 0
0.00
1600 - 1800 0 0.00
1800 — 2000 1
0.05
2000 — 2200 1
0.05
(here while calculating frequency lower limit is excluded
and upper limit is included)
a)
b)

c)
d)mean=(500*10+700*3+900*2+1100*3+1900*1+2100*1)/(20)
=16200/20
810
from the formulae given in question.
e) n = 20 , n/2 = 10
median class is 600-800
here, L = 600, n/2 = 10 , F = 10 , f = 3 , w = 200
so, median = 600 + ((10-10)/3)*200
=600
R code:
collections <-
c(2024,1810,1200,1200,1050.3,918.18,858.43,769.89,655.81,623.3
3,
600,586.85,585,565.10,556.74,495.50,475.50,460,456.89,432.50)
hist(collections,main="Histogram",xlab="gross collections in
crores",
ylab="Number of movies",col="green",border="black")
install.packages("HistogramTools")
library(HistogramTools)
PlotRelativeFrequency(hist(collections))
x <- c(500,700,900,1100,1300,1500,1700,1900,2100)
y <- c(10,3,2,3,0,0,0,1,1)
plot(x,y,type="l",main="frequency polygon",
xlab="collection in crores",ylab="Number of movies")
5)
a)state" "abb" "region" "population" "total"
b) state abb region population
total
1 Alabama AL South 4779736
135
2 Alaska AK West 710231
19
3 Arizona AZ West 6392017
232
4 Arkansas AR South 2915918
93
5 California CA West 37253956 1257
6 Colorado CO West 5029196
65
7 Connecticut CT Northeast 3574097 97
8 Delaware DE South 897934
38
9 District of Columbia DC South 601723 99
10 Florida FL South 19687653 669
11 Georgia GA South 9920000
376
12 Hawaii HI West 1360301
7
13 Idaho ID West 1567582
12
14 Illinois IL North Central 12830632 364
15 Indiana IN North Central 6483802 142
16 Iowa IA North Central 3046355 21
17 Kansas KS North Central 2853118 63
18 Kentucky KY South 4339367
116
19 Louisiana LA South 4533372
351
20 Maine ME Northeast 1328361
11
21 Maryland MD South 5773552
293
22 Massachusetts MA Northeast 6547629
118
23 Michigan MI North Central 9883640 413
24 Minnesota MN North Central 5303925 53
25 Mississippi MS South 2967297
120
26 Missouri MO North Central 5988927 321
27 Montana MT West 989415
12
28 Nebraska NE North Central 1826341 32
29 Nevada NV West 2700551
84
30 New Hampshire NH Northeast 1316470
5
31 New Jersey NJ Northeast 8791894
246
32 New Mexico NM West 2059179
67
33 New York NY Northeast 19378102
517
34 North Carolina NC South 9535483 286
35 North Dakota ND North Central 672591 4
36 Ohio OH North Central 11536504 310
37 Oklahoma OK South 3751351
111
38 Oregon OR West 3831074
36
39 Pennsylvania PA Northeast 12702379
457
40 Rhode Island RI Northeast 1052567 16
41 South Carolina SC South 4625364 207
42 South Dakota SD North Central 814180 8
43 Tennessee TN South 6346105
219
44 Texas TX South 25145561
805
45 Utah UT West 2763885
22
46 Vermont VT Northeast 625741
2
47 Virginia VA South 8001024 250
48 Washington WA West 6724540
93
49 West Virginia WV South 1852994 27
50 Wisconsin WI North Central 5686986 97
51 Wyoming WY West
563626 5
murder_rate
1 2.8244238
2 2.6751860
3 3.6295273
4 3.1893901
5 3.3741383
6 1.2924531
7 2.7139722
8 4.2319369
9 16.4527532
10 3.3980688
11 3.7903226
12 0.5145920
13 0.7655102
14 2.8369608
15 2.1900730
16 0.6893484
17 2.2081106
18 2.6732010
19 7.7425810
20 0.8280881
21 5.0748655
22 1.8021791
23 4.1786225
24 0.9992600
25 4.0440846
26 5.3598917
27 1.2128379
28 1.7521372
29 3.1104763
30 0.3798036
31 2.7980319
32 3.2537239
33 2.6679599
34 2.9993237
35 0.5947151
36 2.6871225
37 2.9589340
38 0.9396843
39 3.5977513
40 1.5200933
41 4.4753235
42 0.9825837
43 3.4509357
44 3.2013603
45 0.7959810
46 0.3196211
47 3.1246001
48 1.3829942
49 1.4571013
50 1.7056487
51 0.8871131
c)

d) average murder rate = 2.779125


27 states are below the average.
e)

From the bar graph District of colombia and Vermont have highest
and
lowest murder rates respectively.
f) median population size = 4339367
In the given data of population there are outliers.
So it is better to use median because value of mean will get
affected by
these outliers but median is the middle value of data which will not
change much with outliers.
g)

h)
bar graph is in e) part of the question.
Bar graph can provide maximum and minimum murder rates.
Box plot gives five figure summary, it also gives information about
the
outliers in the data set.
We cannot visualise median from bar plot whereas box plot gives
information of median also.
i) range = 16.13313
Range gives difference between minimum and maximum values i.e.
whole
data set lies in this range.So if range is large then data has lot of
variations.
Here range of murder rates is large so murder rates in United
States has lot
of variations.
R code
install.packages("dslabs")
library(dslabs)
data(murders)
murders
names(murders)
install.packages("dplyr")
library(dplyr)
newdata <-
dplyr::mutate(murders,murder_rate=(total*100000)/population)
newdata
install.packages("ggplot2")
library(ggplot2)
ggplot(murders)+geom_histogram(mapping=aes(x=population/
100000),binwidth=10,col="black")
dplyr::summarise(newdata,mean=mean(murder_rate))
dplyr::filter(newdata,murder_rate<2.779125)
ggplot(newdata)+geom_col(mapping =
aes(x=murder_rate,y=state),fill="red")
dplyr::summarise(murders,median=median(population))
ggplot(murders)+geom_boxplot(mapping=aes(x=region,y=populati
on))
ggplot(newdata)+geom_col(mapping=aes(x=murder_rate,y=state))
ggplot(newdata)+geom_boxplot(mapping=aes(y=murder_rate))
dplyr::summarise(newdata,range=max(murder_rate)-min(murder_r
ate))
6)a) averages will be about the same because both the data sets
are from
same population.
b) median will also be about the same because both the data sets
are
from same population.
c) Surveyor with 1000 samples is more likely to have tallest
sample
because bigger sample should have larger range.
d) Surveryor with 1000 samples is more likely to have smallest
sample
because bigger sample should have larger range.
7)Using formula for mean and median:
i) first data set
Class interval Frequency Cumulative Frequency
-0.5 - 0.5 3
3
0.5 - 1.5 185
188
1.5 - 2.5 365
553
2.5 - 3.5 269
822
3.5 - 4.5 129
951
4.5 - 5.5 36
987
5.5 - 6.5 7
994
6.5 - 7.5 5
999
7.5 - 8.5 1
100
Mean =
(3*0+184*1+365*2+269*3+129*4+36*5+6*7+5*7+1*8)/1000
= 2503/1000
= 2.503
here n=1000 , n/2 = 500
hence 1.5 — 2.5 is median class
L = 1.5 , F = 188 , f = 365 , w = 1
median = 1.5 + ((500-188)/365)*1
= 2.354
ii)second data set
Class interval Frequency Cumulative Frequency
-3.5 - -2.5 7
7
-2.5 - -1.5 49
56
-1.5 - -0.5 217
273
-0.5 - 0.5 410
683
0.5 - 1.5 239
922
1.5 - 2.5 72
994
2.5 - 3.5 5
999
3.5 - 4.5 1
1000
Mean = (7*-3+49*-2+217*-1+410*0+239*1+72*1+5*3+1*4)/ 1000
=66/1000
n=1000,n/2=500
hence -0.5 — 0.5 is median class
L = -0.5 , F = 273 , f = 410 , w = 1
median = - 0.5 + ((500-273)/410)*1
= 0.053
iii)third data set
Class interval Frequency Cumulative Frequency
0.15— 0.25 9
9
0.25 — 0.35 15
24
0.35 — 0.45 48
72
0.45 — 0.55 100
172
0.55 - 0.65 155
327
0.65 — 0.75 216
543
0.75 — 0.85 243
786
0.85 — 0.95 179
965
0.95 — 1.05 35
1000
Mean =
(9*0.2+15*0.3+48*0.4+100*0.5+155*0.6+216*0.7+243*0.8+179*0.
9+35
*1)/ 1000
= 710.2/1000
= 0.7102
n = 1000 , n/2 = 500
hence 0.65 — 0.75 is median class
L = 0.65 , F = 327 , f = 216 , w = 0.1
median = 0.65 + ((500-327)/216)*0.1
= 0.73
a) mean : i > iii > ii
b) median : i > iii > ii
c) i) mean > median
ii) mean > median
iii) median > mean
8)
i) first data
Class interval Frequency Mi * Fi
fi*(mi-mean)^2
-0.5 - -0.4 1 -0.45
0.418
-0.4 - -0.3 5 -1.75
1.492
-0.3 - -0.2 21 -5.25
4.182
-0.2 - -0.1 49 -7.35
5.876
-0.1 - 0 87 -4.35
5.277
0 - 0.1 160 8
3.424
0.1 - 0.2 184 27.6 0.394
0.2 - 0.3 175 43.75 0.504
0.3 - 0.4 164 57.4 3.874
0.4 - 0.5 93 41.85 5.985
0.5 -0.6 36 19.8 4.503
0.6 - 0.7 18 11.7 3.705
0.7 - 0.8 6 4.5 1.839
0.8 - 0.9 1 0.85 0.427
Total = 1000 Total = 196.3 Total = 41.904
Standard deviation (sd) = sqrt((total(fi*(mi-mean)^2))/n-1)
= sqrt(41.904/999)
= 0.204
ii) second data
Class interval Frequency Mi * Fi fi*(mi —
mean )^2
-5 - -4 7
-31.5 213.602
-4 - -3 19
-66.5 388.864
-3 - -2 33
-82.5 409.813
-2 - -1 100
-150 637.057
-1 - 0 142
-71 329.805
0-1 195
97.5 53.542
1-2 180
270 40.783
2-3 167
417.5 363.822
3-4 94
329 576.274
4-5 36
162 434.972
5-6 16
88 320.553
6-7 9
45.5 269.879
7-8 1
75 41.938
8-9 1
8.5 55.890
Total = 1000 Total = 1024
Total = 4136.798
Standard deviation (sd) = sqrt((total(fi*(mi-mean)^2))/n-1)
= sqrt(4136.798/999)
=2.034
iii) third data
Class interval Frequency Mi * Fi fi*(mi —
mean )^2
-12 - -10 3 -33
429.411
-10 - -8 12 -108
1191.375
-8 - -6 31 -217
1966.184
-6 - 4 66 -330
2347.573
-4 - -2 121 -363
1901.308
-2 - 0 168 -168
648.025
0-2 212
212 0.274
2-4 151 453
625.939
4-6 126
630 2052.451
6-8 61
427 2222.431
8 - 10 39 351
2518.514
10 - 12 10
110 1007.212
Total = 1000 Total = 964
Total = 16910.
Standard deviation (sd) = sqrt((total(fi*(mi-mean)^2))/n-1)
= sqrt(16910.7/999)
= 4.114
standard deviation order : i < ii < iii.

You might also like