0% found this document useful (0 votes)
16 views

BA Notes

The document discusses various R packages, functions and statistical tests for data analysis. It covers popular packages like dplyr and ggplot2 for data manipulation and visualization. It also covers common base R functions and statistical tests for different data types and analyses like correlation, comparison of groups, linear models and more.

Uploaded by

Avitansh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

BA Notes

The document discusses various R packages, functions and statistical tests for data analysis. It covers popular packages like dplyr and ggplot2 for data manipulation and visualization. It also covers common base R functions and statistical tests for different data types and analyses like correlation, comparison of groups, linear models and more.

Uploaded by

Avitansh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Packages:

• dplyr: A powerful package for data manipulation in R. It provides verbs like filter,
select, mutate, and arrange to efficiently clean, transform, and summarize data
frames.
• pacman: A package manager for R, making it easy to install, update, and manage other R
packages. It streamlines the process compared to the base install.packages function.

• stringr: Offers a collection of functions for string manipulation in R. It provides tools for
cleaning, extracting, replacing, and formatting text data.

• tm: A text mining package for manipulating and analyzing text data in R. It provides
tools for cleaning, tokenization (splitting text into words), stemming/lemmatization
(reducing words to their base forms), document-term matrix creation, and more.

• ggplot2: A popular package for creating elegant and customizable statistical graphics in
R. It offers a grammar-based approach to build complex plots.

• RColorBrewer: Provides color palettes designed for scientific visualization, ensuring


good readability and differentiation between colors in plots.

• wordcloud: Generates word clouds, a visual representation of word frequencies, from


text data. Words with higher frequencies appear larger in the cloud.

• SnowballC: Implements snowball stemming algorithms for various languages, reducing


words to their base forms (e.g., "running" to "run").

• NLP (Natural Language Processing): This broad term encompasses various packages and
techniques for dealing with text data, including tm, SnowballC, and others mentioned
here.

• plyr (deprecated): A data manipulation package offering similar functionalities to dplyr,


but generally considered less efficient and user-friendly. Consider using dplyr for most
data wrangling tasks.

Base R Functions:

• -c: Combines vectors into a single vector.


• -data.frame: Creates a data frame, a tabular structure that holds different data types in
columns.
• read.delim: Reads data from a delimited file (e.g., CSV, TSV) into an R data frame.
• names: Gets or sets the names of columns in a data frame.
• head: Returns the first few rows of a data frame or vector.
• tail: Returns the last few rows of a data frame or vector.
• -merge: Merges two data frames based on common columns (deprecated, consider
dplyr::inner_join or merge function).

• -t: Transposes a matrix or data frame, swapping rows and columns.

• attach(conc): Attaches a data frame or environment to the search path (generally


discouraged due to potential naming conflicts, consider using the data frame by name).

Statistical Functions:

• mean(conc): Calculates the mean (average) of a vector or numeric column in a data


frame.
• range(conc): Returns the minimum and maximum values in a vector or numeric column.
• quantile(conc): Computes quantiles (e.g., quartiles) of a vector or numeric column.
• sd(conc): Calculates the standard deviation of a vector or numeric column.
• var(conc): Calculates the variance of a vector or numeric column.
• skewness(conc): Measures the asymmetry of a distribution (positive skew = right-tailed,
negative skew = left-tailed).
• kurtosis(conc): Measures the "peakedness" of a distribution compared to a normal
distribution.
• moments: Computes various statistical moments (mean, variance, skewness, kurtosis,
etc.) of a vector or numeric column.
• shapiro.test: Tests for normality (whether a distribution follows a normal bell curve).
• qqplotr: Creates a quantile-quantile (Q-Q) plot to visually compare a distribution to a
reference distribution (often normal).

Linear Model Functions:

• t.test(uptake, mu=45): Performs a one-sample t-test to compare the mean of uptake to a


specified value (45 here).
• t.test(uptake, mu=45, alternative="less", conf.level=0.95): Performs a one-tailed t-test
to check if the mean of uptake is less than 45 with a 95% confidence level.
• -lm (Linear Model): Estimates the straight-line relationship between a continuous
variable you're predicting (dependent) and one or more influencing variables
(independent).
Think of it as a straight ruler measuring how things change together.
• -glm (Generalized Linear Model): More flexible modeler for non-linear relationships
and different data types (e.g., yes/no, counts).
Imagine it as a bendy measuring tape that can handle curves and different shapes.
ANOVA and Multiple Comparison Functions:

• -aov: (deprecated) Performs analysis of variance (ANOVA) to compare means between


groups. Consider the lm function for linear models and ANOVA.
• tapply: Applies a function to subsets of data based on a grouping variable.
• TukeyHSD: Performs Tukey's Honestly Significant Difference (HSD) test, a post-hoc
multiple comparison test after ANOVA to see which groups differ significantly.
• plot(TukeyHSD(anovairis)) (assuming anovairis is an ANOVA object): Creates a plot
(often a bar chart) showing significant pairwise differences between groups based on
Tukey's HSD test.

Correlation and Visualization Functions

• corplot: (package depends on which one you're using) Creates a correlation matrix plot,
visualizing pairwise correlations between variables.
• corrgram: (package depends on which one you're using) Similar to corplot, creates a
correlation matrix plot.
• psych: A package offering various psychometric functions, including correlation
analysis.
• lattice: A package for creating trellis graphics, including correlation plots.
• corr.ci: Computes confidence intervals for correlation coefficients.
• splom: Creates a scatter plot matrix, visualizing relationships between all pairs of
variables.
• cor.test: Performs a correlation test to assess the statistical significance of a correlation
coefficient.

Text Mining ( tm) Package Functions:

• VCorpus: VCorpus in tm is an in-memory collection of text documents. Think of it like a


temporary workspace for your text data. It's fast for processing but disappears when you close
R. Use it for smaller datasets or initial text cleaning and analysis.
• tm_map(): This function is like a "for loop" for your corpus. It allows you to apply the
same cleaning or transformation operation (like removing punctuation) to each document
in the corpus efficiently.
• removePunctuation(), removeNumbers(), removeWords(): These functions are your
text cleaning crew. They help you get rid of unwanted elements like punctuation marks,
numbers, and common stop words (e.g., "the", "a").
• stemDocument(): This function reduces words to their base forms (e.g., "running"
becomes "run"). Think of it as decluttering your vocabulary for analysis.
• stripWhitespace(): This function removes extra spaces between words, ensuring a
cleaner representation of your text data.
• PlainTextDocument(): This function converts raw text data into a format suitable for
analysis within the tm package. It's like preparing your text for processing.
• DocumentTermMatrix(): This function creates a table summarizing the frequency of
words (terms) across all documents in your corpus. It's like a giant word count table,
revealing patterns of word usage.
• removeSparseTerms(): This function helps you deal with infrequent words in your
document-term matrix. By removing words that appear in very few documents, you can
focus on the more relevant terms and reduce noise.

Other Functions:

• abline: Adds a horizontal or vertical line to an existing plot, often used to emphasize
specific values (e.g., mean, median).
• wilcox.test: Performs a non-parametric Wilcoxon signed-rank test for paired data,
comparing medians.
• kruskal.test: Performs a non-parametric Kruskal-Wallis test for comparing medians
across multiple groups.

Parametric vs. Non-Parametric Tests


Choosing the right statistical test depends on whether your data follows a normal distribution
(parametric) or not (non-parametric). Here's a simplified guide:

Measuring Correlation (Strength and Direction of a Relationship):

• Parametric: Use Pearson's correlation. It assesses linear relationships between two


continuous variables.
• Non-parametric: Use Spearman's rank correlation. It works for both continuous and
ordinal data, measuring the strength and direction of the monotonic relationship.

Comparing Groups (Pairwise):

• Parametric: Use a t-test. It compares the means of two independent groups assuming
normally distributed data.
• Non-parametric: Use the Mann-Whitney U test. This test compares the medians of two
independent groups without assuming normality.

Comparing Multiple Groups:

• Parametric: Use ANOVA (Analysis of Variance). It compares the means of three or


more independent groups assuming normally distributed data.
• Non-parametric: Use the Kruskal-Wallis test. This test compares the medians of three
or more independent groups, making no assumptions about normality.

Comparing Paired Samples:

• Parametric: Use a paired t-test. It compares the means of two related groups assuming
normally distributed data.
• Non-parametric: Use the Wilcoxon signed-rank test. This test compares the medians
of two related groups without assuming normality.

These tests provide essential tools for understanding relationships and making informed
conclusions from your data, considering its underlying characteristics.

You might also like