BAN5
BAN5
dplyr commands: filter() chooses rows based on column values. | arrange(sort()) changes the order
of the rows. | select() changes whether or not a column is included. | rename() changes the name of
columns. | mutate() changes the values of columns and creates new columns. |
summarise(summary() / str() / head()) collapses a group into a single row. | group_by() allows to
group by a modified column. | ungroup() removes grouping | count() count values | Find the
"previous" lag() or "next" lead() values, comparing values behind of or ahead of the current values. |
Join dplyr commands with %>% | #uses data.frame() data structures# ggplot() commands:
ggplot(data, mapping = aes(x-axis, y-axis)) | geom includes line, point, bar. | add color or size. |
facet_wrap groups plots. | Join ggplot commands with + | #graphs#
Histograms: created using the hist(x = data, main = , xlab = ) function. | Box graph: used to display
information in form of distribution of data, based on five sets (minimum, first quartile, median, third
quartile, and maximum): boxplot(x, horizontal, xlab, main) | plot(x = , y = , pch = , col = rgb( , , , ),
main = , xlab = , ylab = ): Generic command for plotting. | adds lines: abline(a = , b = , h = , v = , reg
= , coef = , untf = ) | lines(density(), lwd = , col = ) | #16#rgd uses color positions like examples:
(0,0,0,0.02)# | #graphs distribution# browseVignettes() used for more info about packages.
#Intro/Program# Help: Use ? (before command) for R Documentation, and use args() for information
about command layout.
Relationships: ! means exclude, == means same, >= means bigger/equal, <= means smaller/equal, &
means AND, | means OR, $ specify column of table,
Import: read_delim() specify the type of delimiter such as ”|”, skip & locale can be used. |
read_fwf(file = , col_types = , col_positions = fwf_positions(start = , end = , col_names = )) uses
widths & positions to import selected columns, column names can be given for multiple columns. |
read_csv(file = , col_names = , col_types = , skip = ) | #column types always needed# | #quotation
marks# | #sn#municipality#hhincome# | #col types formats examples: ncnn OR # | #skip is used to
skip rows like example 3 or 8#
Remove empty entries from data with na.omit(). | Use unique() to remove duplicate rows.
Create a lookup table: Create variable #lut#. | Create named list with values #use quotation marks#. |
Use variable positions to add new column.
The prop.table() used to calculate value of each cell in a table as a proportion of all values:
prop.table(x = data, margin = #1 = row, 2 = column, default is NULL #)
Output commands: concatenating the list, cat() performs much less conversion than print(). | paste():
Takes multiple elements from the multiple vectors and concatenates them into a single element. | #\n
create a new line#
The function pnorm(), compute probabilities from known bounding values. | The function qnorm()
aims to do the opposite. | dnorm() gives the density | pnorm() gives the distribution function |
qnorm() gives the quantile function | rnorm() generates random deviates.
prop.test(x = , n = , conf.level = ) can be used for testing the null that the proportions (probabilities of
success) in several groups are the same, or that they equal certain given values.
table() uses cross-classifying factors to build a contingency table of the counts at each combination of
factor levels. class = identify class. | convert = convert class. | colnames = rename columns. |
as.POSIXlt = used to convert time & date. | nrow = count rows. | length = length of object. | write =
create file. | table = Table Creation. | merge = Merge Data Frames. | paste = Concatenate Strings. |
cbind = Combine R Objects by Rows or Columns. | diff = Lagged differences.
| difftime = Time Intervals / Differences between 2 times. | round = Rounding of Numbers. | scan =
Read Data Values. | read.table = Reads a file in table format and creates a data frame from it. |
complete.cases = same as na.omit
Calculation CI: #percentage from data# pData calc by s / n #n = population#s = sample# | #Standard
Error# SE calc by ((p * (1 - p)) / n)^(1/2) | #Confidence Level# CL | z calc by #use qnorm()#p calc by
(1 - CL) / 2# | #Calculated confidence interval# CI calc by (p – (z * SE), p + (z * SE))
Categorical data: independence? | at least 10 successes and failures each. | normal-D? | use S, F,
hist() to test data set inference |
Confidence interval: S = successes | F = failures | n = data size | P_data = probability from data
#observed# | SE
#Standard Error# is calc by ((P_data * (1 - P_data)) / n)^(1/2) | CL = confidence level | P_given is calc
by ((1 - CL) / 2) #assumed# | Z uses qnorm() | CI = Calculated confidence interval
Hypothesis test: H0 = null H #status quo## | HA = alternative H #research Q## | assume H0 TRUE till
proven FALSE | reject H0 | SE is calc by ((P_given * (1 - P_given)) / n)^(1/2) | Z is calc by ((P_data –
P_Given) / SE) | use pnorm() and convert to percentage to get P_value | if P_value is smaller than
0.05 | reject H0 | #optional# use prop.test(x = , n = , conf.level = )
Numerical data: independence? | sample size smaller than 30. | near normal-D? | use n, hist(),
boxplot() to test data set inference
Single mean: t-D | df calc by n – 1 | pop mean calc by #sample mean# x +/- (t*) SE | SE calc by (s / (n
^ (1/2)))
Paired: 2 sets | connection | H0 is mean diff = 0 | HA is mean diff not = 0 | SE diff calc by (s diff / (n diff
^ (1/2))) | T_value calc by ((mean diff – 0) / SE diff)
Not paired: 2 sets | each set must meet inference | mean 1 – mean 2 | pop mean calc by #diff
between samples means#
(x1 – x2) +/- (t*) SE | SE diff calc by ((((s1 ^2 / n1) + (s2 ^2 / n2))^ (1/2))) | t* is smallest data set – 1
#n1 – 1 OR n2 - 1# | T_value calc by ((x1 – x2 – 0) / SE diff)
Confidence interval: X calc by mean() | SD calc by sd() | n is data set size | SE calc by sd / (n ^(1/2)) |
abs() used to make + | t uses qt(p = , df = , lower.tail = ) | CI calc by (x – (t*SE), x + (t*SE)) |
#optional# use t.test(x = , conf.level = )
Hypothesis test: X2 calc by mean() | SD2 calc by sd() | n2 is data set 2 size | df calc by smallest data
set size - 1 | SE diff calc by ((sd^2)/n. + (sd2^2)/n2) ^ (1/2) | T_value calc by ((x – x2)-0)/SE diff |
P_value calc by pt(q = , df = , lower.tail = ) and times 2 | if P_value is smaller than 0.05 | reject H0 |
#optional# use prop.test(x = , y = , alternative = , paired = )