Session 1 Course Overview and Intro to R
Session 1 Course Overview and Intro to R
3
Agenda
Session 1
1. Introduction: Context and Importance of Statistics
2. The Process of Statistics
3. Course Overview
• Aims, References, and Requirements
4. Data Collection Methods
5. Introduction to Exploratory Data Analysis
6. Overview of Statistical Inference
7. Introduction to R
• Basic Operations and Data Structures
• Installing and Loading Packages
4
0. Introduction: Context and
Importance of Statistics
“You can not manage what you don’t measure”
- Bill Hewlett, co-founder of Hewlett Packer (paraphrased Lord Kelvin)
Advances in ICT
• synergy between computing & communications
• Implications:
– More DATA collected
– More DATA stored
– More DATA accessible
and distributed
Information is power !
DATA: “the new oil”
5 a driver of growth and change
0. Introduction: Context and
Importance of Statistics
hyperconnectivity
6
0.1. Tsunami of Data
Awash in a flood of data !!! : “drowning in numbers”
From the beginning of recorded time
until 2003, we created 5 billion
gigabytes (exabytes) of data. In 2012,
data was reported to double every 40
months since the 1980s, with about 5
exabytes were then being created
every 2 days
This 2024, 625 million videos are
being watched on Tiktok every 60
seconds. Netflix’s 278 million
subscribers are watching 650 million
hours of content every day.
Every minute, 231 million emails are
7 being sent, 6.3 million searches are
0.2. Big Challenge is not to be
DRIP (Data Rich but Information Poor)
• Transforming
8
0.3. Importance of Statistics
• Statistics translates data into decisions,
theories, and knowledge
• Essential skill in the 21st century as data
increases in every field
• Employees are expected to take larger roles in
drawing insights from data
• Statistics extends to all parts of the scientific
decision-making process
• Crucial in any field that intends to use data to
reach conclusions
9
1. Fundamentals of Statistics
Course
• Course Objective: To
understand and apply
basic statistical methods
– Master basic concepts on
probability and statistics
– Develop skills using
statistical environment R
– Interpret results from use of
statistical methods with R
10
1. Fundamentals
of Statistics • Syllabus overview: 14 weeks,
covering descriptive and
Course inferential statistics
• Textbook: Basic Statistics
With R (Reaching Decisions
With Data) by Stephen C.
Loftus (2022)
• Computing tools: R, Rstudio
• Assessment: Problem sets
(20%), Quizzes (50%), Final
exam (30%)
11
1.1. Course Coverage
• Descriptive statistics (Weeks 1-4)
• Probability and Probability Models Normal
distribution (Week 5-9)
• Sampling & estimation (Weeks 10-11)
• Hypothesis tests (Weeks 11-12)
• Regression analysis (Weeks 12-13)
• Final exam (Week 14)
12
1.2. Class Policies
• Attendance and
participation
requirements
• Academic integrity
and collaboration
guidelines
• Communication
channels and office
hours
13
Quotable Quote
“The alternative to good statistics is not "no
statistics", it's bad statistics. People who argue
against statistical reasoning often end up
backing up their arguments with whatever
numbers they have at their command, over- or
under-adjusting in their eagerness to avoid
anything systematic.” – Bill James
14
2. Statistics
Crucial for informed
• science of decision-making
collecting,
analyzing,
interpreting, and Developed from needs
of government,
presenting data probability theory, and
• framework for computing advances
data-driven
Insights
• process of Essential in today's
data-driven world for
uncovering extracting insights
patterns in data
15
amid uncertainty
2.1. The Process of Statistics
1. Hypothesis/Questions: Formulate questions or develop
hypotheses about phenomena
2. Data Collection: Gather relevant data through experiments
or observational studies
3. Data Description: Summarize data using descriptive statistics
and visualizations
4. Statistical Inference: Draw conclusions about populations
based on sample data
5. Theories/Decisions: Develop theories or make decisions
based on statistical results
• This process is cyclical, continuously seeking better
solutions and decisions
16
2.1.1. Realities about Statistics
• The man in the street distrusts statistics and
despises [his image of] statisticians, those who
diligently collect irrelevant facts and figures
and use them to manipulate society.
“There are three kinds of lies: lies, damned lies, and statistics”
– Mark Twaine
17
2.1.2. Florence Nightingale on Statistics
• “...the most important science in the whole
world: for upon it depends the practical
application of every other science and of
every art: the one science essential to all
political and social administration, all
education, all organization based on
experience, for it only gives results of our
experience.”
• “To understand God's thoughts, we must
study statistics, for these are the measures
of His purpose.”
18
Statistics
Purposes of not Stat-is-eeks
The world
Using before
The world
after analysis
Statistics analysis
USE DATA to
– Describe
Data Data
– Explain collection interpretation
– Predict
– Make Decisions Data organization &
preliminary analysis
19
Data is the foundation of any statistical analysis.
Data analytics
20
2.1.3. The Role of Statistics in
Decision-Making
21
2.1.4. Examples of Statistics in
Everyday Life
• Weather forecasts
• Sports performance metrics
• Political polls
• Health and fitness tracking
• Product ratings and reviews
• Financial planning and investment
22
2.1.5. Overview of Statistics Application in
This Course
Throughout this course, you will learn to apply statistical concepts
and techniques to real-world problems. Here's an overview of how
we will approach this:
1. Data Collection and Presentation:
– Design surveys and experiments
– Collect and organize data
– Create effective visual representations (charts, graphs)
2. Descriptive Statistics:
– Calculate and interpret measures of central tendency and variability
– Analyze distributions and identify outliers
– Examine relationships between variables
3. Probability and Sampling:
– Apply basic probability concepts
23
– Understand sampling distributions
2.1.5. Overview of Statistics Application in
This Course
4. Inferential Statistics:
– Construct and interpret confidence intervals
– Perform hypothesis tests
– Analyze relationships using correlation and regression
5. Practical Applications:
– Work with real-world datasets
– Use statistical software (R) for analysis
– Interpret results in context of business and management
scenarios
– Develop critical thinking skills for data-driven decision making
By the end of this course, you will be equipped with the statistical
tools and knowledge to analyze data, draw meaningful
24
conclusions, and make informed decisions in various professional
2.2. Basic Statistical Concepts
• Population vs Sample: All items of interest vs
a subset (or portion)
• Parameter vs Statistic: Population summary
measure vs Sample summary measure
25
2.2. Basic Statistical
Concepts
• Data: the bedrock of Statistics
– Data are facts or figures about an
object or phenomenon that can be
used for reporting, calculations,
planning or analysis. We can observe,
describe, measure, and count objects
around us. We can also record what
we find about the weather, the
environment, and the economy. This is
what data is.
– Data can be anything from numbers
and text to images and videos.
– When you keep track of how many
books you
26 read each month or how many friends
Descriptive statistics
27
2.2.1. Descriptive Statistics
• Collect data
– e.g., Survey
• Present data
– e.g., Tables and graphs
• Characterize data X i
– e.g., Sample mean = n
Examples
• Market basket analysis
• Discovers associations between products
28
2.2.2. Inferential Statistics
• Estimation
– e.g.: Estimate the population
proportion living in poverty
using the sample proporetion
• Hypothesis testing
– e.g.: Test the claim that the
population poverty incidence
is 15 percent (when the
sample
Inference proportion
is drawing is 15.5 %) and/or making decisions
conclusions
concerning a population based on sample results.
29
2.2.2.1. Poverty Monitoring in PH
Income poverty rates and GDP per capita: 2000 to 2023
4000.00 30
26.6 26.3
3500.00 25.2
23.5 25
3000.00
GDP per capita (in USD)
18.1 20
Proportion of Population Living below the National Poverty Line, Total (%)
Gross Domestic Product per Capita (current $)
• The proportion of the Filipinos in poverty rose to 18.1% in 2021, up from 16.7% in
2018, thus partially reversing earlier gains in poverty reduction from 2006 to 2018
and highlighting the pandemic's impact on vulnerable populations (PSA 2022).
• Recently released data shows that the poverty incidence has reduced to 15.5% in
2023.
• The government is targeting for the poverty rate to fall to nine percent by 2028,
30
2.3. Why We Need Data
• Measure performance
• Evaluate standards
• Input for studies
• Support decision making
31
2.4. Components of Data
• Objects
(observational
units)
– rows in a data frame
• Variables
(characteristics
measured)
– columns in data frame
• Scales
(measurement
32
schemes)
2.5. Types of Variables
Example of
Variables
Discrete Variables : Example of Categorical
Number of children Variables : Sex (Male,
in a family (1, 2, 3, Female), Blood type (A, B,
etc.), Number of AB, O), or Eye color
cars in a parking Quantitative (Brown, Blue, Green)
Categorical
lot, Number of or Numerical Note: Categorical data
books in a library. represent groups or
Note : Discrete data categories without a
represent numerical order
countable values,
often whole Discrete Continuous
numbers.
Primary Secondary
Data Collection Data Compilation
Print or Electronic
34
34
2.6.1. Traditional Data Sources
Official statistics are typically sourced from
surveys (censuses and sample surveys),
and administrative data; new data sources
such as big data are also been
experimented with to fill in “data gaps”.
A survey is a systematic method for gathering
information from a target population of
interest for purposes of producing quantitative
descriptors of the attributes of the population (
Groves et al. 2009).
35
2.6.1. Traditional Data Sources
a) A census is a survey with data collection involving a complete
enumeration of the population. Although a census can
provide a reliable baseline data on the structure and key
characteristics of the target population, its scope and range
yield a high cost for its conduct. Furthermore, the large
number of entities enumerated in a field operation
predisposes censuses to data collection errors. Errors also
arise from differences in understanding of concepts,
definitions, and instructions of both field enumerators and
respondents. As with other data sources, changes in census
methodologies can also make it challenging to compare data
from one census to another.
36
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
Census • Complete enumeration • Costly
• Source of statistics for • Robust staffing
entire population • Restricted periodicity
• Provides basis for area • Longer lag time in
and/or list frames of producing results
sample surveys
37
2.6.1. Traditional Data Sources
– Often, data are collected in a survey only from a subset of the
population, and in this case, the survey is called a sample
survey. When sample surveys use probability sampling for
selecting respondents, each respondent represents a certain
number of entities in the larger aggregate of entities to which
the respondent belongs. Consequently, we can make
statistically valid conclusions (called inferences) about the
entire population, particularly the attributes of the population
using information from the sample survey.
38
2.6.1. Traditional Data Sources
Survey statistics can be used as estimates of the corresponding
descriptors of the population parameters. For instance, in a
labour force survey (LFS), the proportion of sample respondents
who have a job can be an estimate of the share of the entire
population who have a job.
40
2.6.1. Traditional Data Sources
Precision of estimators measured by Standard Error
•How large a sample would be necessary to estimate the true
proportion within ±3%, with 95% confidence?
(Bootstrap the Standard Error with p = 0.5)
Solution:
Standard Error (SE) for
Estimating p
42
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
Sample • Relatively easy to • Non-responses
administer • Sampling errors could
Survey • Cost-effective be large especially for
• Wider scope sub-national
disaggregates
• Response biases,
coverage biases and
other non-sampling
errors
• Need for an
adequately trained
statistical human
resources
43
2.6.1. Traditional Data Sources
– Administrative data are data holdings collected typically by
line ministries and local governments for the purposes of
administering taxes, benefits or services. Processes in an
administrative system typically involve registration, transaction
and record keeping which derive data as by-products.
Examples include health, pension and employment data in a
social security system; income or expenditure records of tax
authorities; data on registered unemployment, active labour
market programs, social benefits from a social protection
program; labour inspection records (pertaining to
occupational injuries). Administrative data, unlike surveys that
are designed for statistical purposes, are often mere by-
products of their original purpose of registration, transaction,
and record keeping in the administration of taxes, benefits or
44
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
• Full count of clients of • Not designed for statistical
Administrative administrative system purposes
data • Better data coverage and • Units may not satisfy data
availability user needs or may not use
• Low-cost data collection a definition of unit which is
• Reduced response burden compatible with other data
to data suppliers source/s
• Timely statistical outputs • Scope of system may be
• Up-to-date and more too narrow by design, or
frequent, often too broad to include
longitudinal, data groups not of interest
• Needs strong coordination
among NSOs, other
government agencies, and
other data owners.
• Confidentiality issue
• Missing data
• Different time periods
45
2.6.2. Big Data
While big data has no definition, o not readily defined (first used
these digital footprints have 3Vs
(Gartner, 2001) :
in mid-1990s in lunch
conversations @ Silicon
Valley);
o refer to digital data by-
products (exhaust) from
electronic gadgets, internet
search/social media, sensors
and tracking devices
o data increasing due to
plus two extra V’s = 5V’s: 3Vs plus increased capacity to collect,
veracity and value. store, retrieve, use and re-use
46 data.
2.6.2. Big Data
Data Source Advantages Disadvantages
Big data • Large volume of data • Data privacy and
• Wide variety of data security
types • Accessibility
• Timely (in fact, near • Challenges in
real time) data technological
• Improves accuracy and infrastructure
granularity of statistics • Requires new skill sets
of human resources for
management and
analytics
• Coverage and
representativeness
47
2.6.2.1. Examples of Big Data Analytics
48
2.6.2.1. Examples of Big Data Analytics
49
2.6.2.1. Examples of Big Data Analytics
50
2.6.2.1. Examples of Big Data Analytics
51
2.6.2.1. Examples of Big Data Analytics
Official SAE
EBLUP Estimates
53
2.6.2.2.Utilizing Big Data in Business
Predictive Modeling Association
Rules and Collaborative Filtering:
Amazon using customer database to
inform clients that “customers who
bought Product A also bought
Product B, and Product C …”
Sentiment Analysis: Social media
data, such as tweets on Twitter, are
https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/ scrutinized in terms of “polarity”
(i.e., positive, negative, or neutral)
of sentiments on a product.
Text Analysis: In Japan call center,
agents input “what customers say”
and instructions are then given to
call center agents on workstations
Frontier tech of Industry 4.0 changing business on “what to say”
models (the rise of Platform Economy) and
making more use of data esp. big data
54
2.6.3. Data Collection Methods
• Observational Studies:
– Passive data collection
– Less expensive, fewer ethical concerns
– Cannot establish causality
– Examples: Surveys, existing data analysis
• Designed Experiments
– Active data collection
– More expensive, more ethical concerns
– Can establish causality
– Examples: Clinical trials, A/B testing
• Sampling Techniques
– Simple Random Sampling (SRS)
– Other methods: Stratified, Cluster, etc.
55
2.7. Survey Research
• Choose response mode
• Identify categories
• Formulate clear questions
• Pilot test the survey
56
2.7.1. Survey Response Modes
• Personal interview
• Telephone interview
• Mail survey
• Online survey
57
2.7.2. Survey Questionnaires
• Use clear, unambiguous language
• Avoid leading questions
• Use universally accepted definitions
• Cover all possible response options
• Have a “cover” letter
– State survey goals and purpose
– Explain importance of response
– Assure anonymity
– Offer incentives for participation
58
2.7.3. Reasons for Sampling
• Less time consuming than a census
• Lower cost
• More practical for large populations
59
2.7.4. Types of Sampling Methods
• Probability samples
– Simple random:
• Equal chance of selection for each unit
• Can use random number table or generator
• Selection with or without replacement
– Systematic
• Select every kth unit from population
• k = population size / sample size
• Random start point
N = 64
n=8 First
Group
60 k=8
2.7.4. Types of Sampling Methods
• Probability samples (cont’d)
– Stratified
• Divide population into groups (strata)
• Take random sample from each stratum
• Combine samples
– Cluster Population
divided into 4
• Divide population into clusters clusters.
• Randomly select clusters
• Sample all units in selected clusters
61
2.7.4. Types of Sampling Methods
• Non-Probability samples
– Convenience Sampling: Example: A researcher stands outside
a mall and surveys the first 100 people who walk by. The
sample is selected based on the convenience of accessibility.
– Judgmental or Purposive Sampling: Example: A researcher
selects experts in a particular field to interview about their
opinions on a new technology, based on their expertise.
– Snowball Sampling: Example: A researcher studying a rare
medical condition starts by interviewing a few patients and
then asks them to refer others who have the condition. The
sample grows through referrals from participants.
62
2.7.4. Types of Sampling Methods
• Non-Probability samples (cont’d)
– Quota Sampling: Example: A researcher sets a quota to
interview 50% males and 50% females for a survey, but within
each group, participants are selected non-randomly (such as
through convenience sampling).
– Self-Selection Sampling: Example: People are invited to
participate in an online survey, and only those who are
interested and willing choose to respond. There is no control
over who decides to participate.
63
2.7.4.1. Examples of Survey Designs
Source: Jeff Pitblado, Associate Director, Statistical Software at StataCorp LP. 2009 Canadian Stata Users
Group Meeting. Available at http://www.stata.com/meeting/canada09/ca09_pitblado_handout.pdf
64
2.7.5. Advantages of Sampling (over Census)
• Simple random: Easy to implement
• Systematic: Simple procedure
• Stratified: Ensures representation across groups
• Cluster: Cost-effective for large populations
• Non-probability : Cost-effective and quick for
exploratory research or when a random sample is
not feasible.
65
2.7.6. Disadvantages of Sampling
• Simple random: May not represent subgroups
well
• Systematic: Potential for bias with cyclical data
• Stratified: Requires prior knowledge of
population
• Cluster: Lower precision, larger samples needed
• Non-probability : Potential for bias and lack of
generalizability to the larger population.
66
2.7.7. Evaluating Survey Quality
• What is the purpose of the survey?
• Is the survey based on a probability sample?
• Total survey error divided into
– Sampling error : always exists (for probability
surveys)
– Non-sampling error
• Coverage error – appropriate frame
• Nonresponse error – follow up
• Measurement error – good questions elicit good
responses
67
2.7.8. Types of Survey Errors
• Coverage error
Excluded
from frame.
• Nonresponse error
Follow up on non-
responses.
• Measurement error Bad Question!
• Sampling error
Chance
differences from
sample to
sample.
68
2.8. The Role of Probability in Statistics
• Foundation for understanding uncertainty
and variability in data
• Data as realizations of chance processes
(e.g., sample averages)
• Essential for making inferences about
populations from samples
69
2.8.1. Probability as the Basis for Statistics
• Provides mathematical framework for
quantifying uncertainty
• Allows modeling of random phenomena in real-
world situations
• Underpins key statistical concepts (e.g.,
distributions, hypothesis testing)
70
2.8.2. Understanding Variability Through
Probability
72
Introduction to
and
73
1. Introduction: Computing Resources
R Commercial Packages
Many different datasets (and other One datasets available at a given time
“objects”) available at same time Datasets are rectangular
Datasets can be of any dimension
Functions are proprietary
Functions can be modified
Experience is passive-you choose an
Experience is interactive-you program analysis and they give you everything
until you get exactly what you want they think you need
One stop shopping - almost every Tend to be have limited scope, forcing
analytical tool you can think of is available you to learn additional programs; extra
options cost more and/or require you to
learn a different language (e.g., SPSS
Macros)
R is free and will continue to exist.
They cost money. There is no guarantee
Nothing can make it go away, its price will
never increase. they will continue to exist, but if they do,
you can bet that their prices will always
74
1. Introduction: Computing Resources
CAVEAT:
• “Using R is a bit akin to smoking.
The beginning is difficult, one
may get headaches and even gag
the first few times. But in the
long run, it becomes pleasurable
and even addictive. Yet, deep
R
down, for those willing to be
honest, there is something not
fully healthy in it.” --Francois
Pinard
75
2. R Basics
• To enable us to use R, we firstly discuss its
capabilities, then we describe how to install
it, then we illustrate some basic commands
and how to obtain help.
• Various ways of communicating with R
– Interactively: (through console)
– Batch Processing: (through scripts)
– Point and Click: (through “add ons” Rcmdr,
rattle, deducer)
76
2.1. What is R?
• R is a statistical programming environment for
performing standard & specialized statistical tools
– “environment” : intended to characterize R as a fully planned
and coherent system, rather than an incremental accretion of
very specific and inflexible tools, as is frequently the case with
other data analysis software
• R is a is a free open source statistical package based on
the S language developed at Bell Labs (later
commercially released by Mathsoft as Splus).
• Although R is a programming language, i.e,. generating
computer code to complete tasks is needed, there are
Graphical User Interface (GUI) Add Ons like R
77 Commander, which allow users to “point and click”.
2.1. What is R?
• Initially developed by Robert Gentleman and Ross
Ihaka of University of Auckland; now maintained by
the “R core development team”
– Since 1997: international R-core team
~20 people & 1000s of code writers
and statisticians happy to share their
libraries
– About 2 million R users globally : forums, mailing lists, blogs
• Cross platform compatibility: Windows, MacOS, Linux
• Very powerful for writing programs.
– Many statistical functions are already built in.
– Contributed packages expand the functionality to cutting
edge research.
78
2.1. What is R?
Advantages Disadvantages
oFast and free. oNot user friendly @ start - steep
oState of the art: Statistical researchers learning curve, minimal GUI.
provide their methods as R packages. oNo commercial support; figuring out
SPSS and SAS are years behind R! correct methods or how to use a function
o2nd only to MATLAB for graphics. on your own can be frustrating.
oMx, WinBugs, and other programs use oEasy to make mistakes and not know.
or will use R. oWorking with large datasets is limited by
oActive user community RAM
oExcellent for simulation, programming, oData prep & cleaning can be messier &
computer intensive analyses, etc. more mistake prone in R vs. SPSS or SAS
oForces you to think about your analysis. oSome users complain about hostility on
the R listserve
oInterfaces with database storage
software (SQL)
79
2.1. What is R?
• As of 7 Sept 2024, there are 21,229 add-on packages
(http://cran.r-project.org/src/contrib/PACKAGES.html)
– This is an enormous advantage - new techniques available
without delay, and they can be performed using the R
language you already know.
– Allows you to build a customized statistical program
suited to your own needs.
– Downside = as the number of packages grows, it is
becoming difficult to choose the best package for your
needs, & QC is an issue.
80
80
2.2. Installation
• R home page: http://www.r-project.org/
2. Or click
Start ► Programs ►
R ► Rx64 3.2.1
For 64 bit machines
82
R Console:
2.3.1. Start Up Windows
MENU BAR:
TOOL BAR :
85
2.3.3. Changing GUI Preferences
• Click on Edit ► GUI Preferences
86
2.3.4. Buttons
Button Functions
• Open : Opens R file.
• Load Workspace
• Save: Saves the current data.
• Copy
• Paste
• Copy and Paste
• Stop current computation
• Print
87
2.3.5. Opening a Script Window
• Click on
File ► New Script
gives you a script window.
88
2.3.5. Opening a Script Window
R scripts
◦A text file containing commands that you
would enter on the command line of R
◦To place a comment in a R script, use a hash
mark (#) at the beginning of the line
89
2.3.6. Assignments and Operations
• Arithmetic and Mathematical Operations:
+, -, *, /, ^ are the standard arithmetic operators.
Mod: %%
sqrt, exp, log, log10, sin, cos, tan, …..
• Functions:
– Almost everything in R is done through functions.
Here we only refer to numeric and character
functions that are commonly used in creating or
recoding variables.
– Note that while the examples here apply
functions to individual variables, many can be
applied to vectors and matrices as well.
90
2.3.6. Assignments and Operations
• Numeric Functions:
Function Description
abs(x) absolute value
sqrt(x) square root
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x
• Character Functions:
Function Description
substr(x, start=n1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
91
2.3.6. Assignments and Operations
Probability Functions :
• Notations
Probability Density Function: p
Distribution Function: d
Quantile function: q
Random generation for distribution: r
• Examples:
– Normal distribution:
• dnorm(x, mean=0, sd=1, log = FALSE)
• pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
• qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
• rnorm(n, mean=0, sd=1)
92
2.3.6. Assignments and Operations
Statistical Functions :
Excel R
NORMSDIST pnorm(7.2,mean=5,sd=2)
NORMSINV qnorm(0.9,mean=5,sd=2)
LOGNORMDIST plnorm(7.2,meanlog=5,sdlog=2)
LOGINV qlnorm(0.9,meanlog=5,sdlog=2)
GAMMADIST pgamma(31, shape=3, scale =5)
GAMMAINV qgamma(0.95, shape=3, scale =5)
GAMMALN lgamma(4)
WEIBULL pweibull(6, shape=3, scale =5)
BINOMDIST pbinom(2,size=20,p=0.3)
POISSON ppois(2, lambda =3)
93
2.3.6. Assignments and Operations
Other Useful Functions
Function Description
seq(from , to, by) generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)
94
2.3.6. Assignments and Operations
• Matrix Arithmetic.
* is element wise multiplication
%*% is matrix multiplication
• Assignment
To assign a value to a variable use “<-” or
equal (=) character
95
2.3.6. Assignments and Operations
• Objects can be used in other calculations.
To print object just enter name of object.
• Restrictions for name of object:
Object names cannot contain `strange' symbols like !,
+, -, #.
A dot (.) and an underscore ( ) are allowed, also a
name starting with a dot.
Object names can contain a number but cannot start
with a number.
R is case sensitive, X and x are two different objects,
as well as temp and temP.
96
2.3.6. Assignments and Operations
97
NOTE: R is case-sensitive (y ≠ Y)
2.3.6. Assignments and Operations
We can evaluate truth or falsity of expressions:
2>1
1>2&2>1
generate sequences (and perform operations on
them)
3*(1:5)
We can do matrix operations
a <- 1:3
b <- 3:5
a*b
a%*%b
a%*%t(b)
98
2.3.7. R Objects and Indexing Techniques
The most basic data structure in R is a vector,
a sequence of values of the same type
(numeric, integer, character, logical, complex).
102
2.3.7. R Objects and Indexing Techniques
• Example :
n1 <- 25
n1
typeof(n1)
v1 <- 1:5
v1
is.vector(v1)
v2 <- c("t", "o", "o", "t", "s")
v2
is.vector(v2)
v2 <- c(FALSE, TRUE, TRUE)
103
2.3.7. R Objects and Indexing Techniques
It can be helpful to coerce objects, i.e. change an R data
object from one type to another, e.g.., character vector to
logical, matrix to a data frame, (double precision) numeric
to integer, etc.
(coerce1 <- c(1, "a", TRUE) )
typeof(coerce1)
(coerce2 <- c(5))
typeof(coerce2)
(coerce3 <- as.integer(5)) #coerce numeric to integer
typeof(coerce3)
(coerce4 <- c("1", "2", "3") )
typeof(coerce4)
(coerce5 <- as.numeric(c("1", "2", "3"))) #coerce to numeric
104
2.3.7. R Objects and Indexing Techniques
• Accessing elements of a vector, matrix,
data frame or list is achieved through a
process called indexing. Indexing may be
done by
– a vector of positive integers: to indicate
inclusion
– a vector of negative integers: to indicate
exclusion
– a vector of logical values: to indicate which
are in and which are out
– a vector of names: if the object has a names
105 attribute
2.3.7. R Objects and Indexing Techniques
• Example : producing a random sample of values between
one and five, twenty times and determining which
elements are equal to 1
x <- sample(1:5, 20, rep=T)
x
x == 1
ones <- (x == 1) # parentheses unnecessary
• Suppose we now want to replace the ones appearing in
the sample with zeros and store the values greater than 1
into an object called y
x[ones] <- 0
x
others <- (x > 1) # parentheses unnecessary
y <- x[others]
106
2.3.7. R Objects and Indexing Techniques
• The following command queries the x vector and
reports the position of each element that is greater
than 1
which(x > 1)
• Example : creating a matrix and a data frame, and
accessing elements
value <- rnorm(6)
dim(value) <- c(2,3)
value # notice we now have a matrix
dim(value) <- NULL
value # converted back to a vector
107
2.3.7. R Objects and Indexing Techniques
• Other than the use of the dim function, we could
use matrix
matrix(value,2,3)
matrix(value,2,3,byrow=T) # to fill by rows
• Use the rbind function to bind a row onto an
already existing matrix
value <- matrix(rnorm(6),2,3,byrow=T)
value2 <- rbind(value,c(1,1,2))
value2
• To bind a column onto an already existing matrix,
the cbind function can be used
value3 <- cbind(value,c(1,1,2))
108
2.3.7. R Objects and Indexing Techniques
• The function data.frame converts a matrix or
collection of vectors into a data frame
value3 <- data.frame(value3)
value3
• Row and column names are already assigned to a
data frame but they may be changed using the
names and row.names functions. To view the row
and column names of a data frame:
names(value3)
row.names(value3)
• Alternative labels can be assigned:
names(value3) <- c("C1","C2","C3","C4")
109
2.3.7. R Objects and Indexing Techniques
• Data frames can be indexed by either column
value3 <- data.frame(value3)
value3
value3[, "C1"] <- 0
value3
• Or by row :
value3["R1", ] <- 0
value3
value3[] <- 1:12
value3
110
2.3.7. R Objects and Indexing Techniques
EXERCISE 1: How do we access
(a) the first two rows of the matrix/data frame?
111
2.3.7. R Objects and Indexing Techniques
Solutions to EXERCISE 1: How do we access
(a) the first two rows of the matrix/data frame?
value3[1:2,]
(b) the first two columns of the matrix/data frame?
value3[,1:2]
(c) the elements with a value greater than five (and
ensure that we have a vector produced?)
as.vector(value3[value3>5])
112
2.3.7. R Objects and Indexing Techniques
EXERCISE 2: Execute the following code and think
about why it works the way it does:
a <- 1:3
# vectors have variables of one type
c(1, 2, "three")
# shorter arguments are recycled
(1:3) * 2* (1:4) * c(1, 2)
# warning! (why?)
(1:4) * (1:3)
113
2.3.7. R Objects and Indexing Techniques
• Lists can be created using the list function. Like data
frames, they can incorporate a mixture of modes into
the one list and each component can be of a different
length or size.
L1 <- list(x = sample(1:5, 20, rep=T), y = rep(letters[1:5], 4), z =
rpois(20, 1))
L1
• The first component can be accessed in several ways:
L1[["x"]]
L1$x
L1[[1]]
• What about
114 L1[1] # this is a sublist
2.3.7. R Objects and Indexing Techniques
Each element of a vector, matrix, data frame and
list can be given a name. This can be done by
passing named arguments to the c() function or
later with the names function. Such names can be
helpful giving meaning to your variables.
For example compare the vector
x <- c("red", "green", "blue")
with
capColor = c(huey = "red", duey = "blue", louie = "green")
115
2.3.7. R Objects and Indexing Techniques
As pointed out earlier, elements of a vector,
matrix, data frame and list can be selected or
replaced using the square bracket operator [ ]
which accepts either a vector of names, index
numbers, or a logical.
In the case of a logical, the index is recycled if it is
shorter than the indexed vector.
In the case of numerical indices, negative indices
omit, in stead of select elements.
Negative and positive indices are not allowed in
the same index vector.
116
2.3.7. R Objects and Indexing Techniques
You can repeat a name or an index number, which
results in multiple instances of the same value.
For example compare the vector
x <- c("red", "green", "blue")
with
capColor["louie"] names(capColor)[capColor == "blue"] x
<- c(4, 7, 6, 5, 2, 8) I <- x < 6 J <- x > 7 x[I | J] x[c(TRUE,
FALSE)] x[c(-1, -2)]
117
2.3.7. Workspace
• Objects that you create during an R
session are hold in memory, the
collection of objects that you currently
have is called the workspace.
• This workspace is not saved on disk
unless you tell R to do so. This means
that your objects are lost when you close
R and not save the objects, or worse
when R or your system crashes on you
during a session.
118
2.3.7. Workspace
• When you close the RGui or the R
console window, the system will ask if
you want to save the workspace image.
• If you select to save the workspace image
then all the objects in your current R
session are saved in a file .RData. This is a
binary file located in the working
directory of R, which is by default the
installation directory of R.
119
2.3.8. Seeking Help
• R has a very good help system built in.
• If you know which function you want help
with simply use ?_______ with the
function in the blank.
?hist
args(hist)
• If you don’t know which function to use, then
use help.search(“_______”).
help.search("histogram")
120
2.3.8. Seeking Help
Obtaining Html help
• We can do a search with the Menu bar:
Help ► Html help
• If you want to use search engine
121
2.3.8. Seeking Help
Tutorials
• Each of the following tutorials are in PDF format.
– P. Kuhnert & B. Venables,
An Introduction to R: Software for Statistical Modeling & Computing
– J.H. Maindonald, Using R for Data Analysis and Graphics
– B. Muenchen, R for SAS and SPSS Users
– B. Muenchen, R for Stata Users
– Getting Started in R~Stata
– UCLA’s Data Analysis Using R
– W.J. Owen, The R Guide
– D. Rossiter,
Introduction to the R Project for Statistical Computing for Use at the IT
C
123
2.3.9. Quitting
Three Ways of Quitting from R session
1. Enter in Command Window:
quit()
2. Click on
File ► Exit
3. Click on Close button (X at upper right hand
corner of R console window).
124
2.4.1. Datasets
• R comes with a number of sample datasets
that you can experiment with. Type
data( )
to see the available datasets. The result will
depend on which packages you have
loaded. Type
help(datasetname)
for details on a sample dataset. For ex.,
help("iris")
provides info on the Iris Dataset
125
2.4.2. Packages / Libraries / Add-ons
• One of the strengths of R is that the system can easily
be extended.
• The system allows you to write new functions and package those
functions in a so called `R package' (or `R library').
• The R package may also contain other R objects, for example data
sets or documentation.
• R packages/libraries are bundles of codes that add new functions to
R so we can do new things. We have basic packages (installed with R
but not loaded by default) and contributed (or third party) packages
(that need to be downloaded, installed and loaded separately)
• There is a lively R user community and many R
packages have been written and made available on
CRAN for other users.
• Just a few examples, there are packages for portfolio optimization,
drawing maps, exporting objects to html, time series analysis, spatial
statistics and the list goes on and on.
126
2.4.1. Datasets
• Suppose we would like to work with the Aids2 dataset in
the package / library MASS (Modern Applied Statistics with
S)
data("Aids2", package = "MASS")
• If the search on the dataset was successful,the command
above attaches the data object to the R global environment.
The command
ls()
lists the names of all objects currently stored
in the global environment, and, as the result
of the previous command, a variable named
Aids2 is available for further manipulation.
Now try
print(Aids2)
127
2.4.2. Packages / Libraries / Add-ons
• Contributed Packages can be found
• CRAN
• Crantastic : cratstics
• Github : github.com/trending/R
• There is a lively R user community and
many R packages have been written and
made available on CRAN for other users.
• Just a few examples, there are packages for
portfolio optimization, drawing maps, exporting
objects to html, time series analysis, spatial
statistics and the list goes on and on.
128
2.4.2. Packages / Libraries / Add-ons
129
2.4.2. Packages / Libraries / Add-ons
• hadoop with R (RHadoop)
https://github.com/RevolutionAnalytics/RHadoop/wiki
131
2.4.2. Packages / Libraries / Add-ons
For Visualizing Data
• ggplot2 - R's famous package for making beautiful graphics. ggplot2 lets you use the
grammar of graphics to build layered, customizable plots.
• ggvis - Interactive, web based graphics built with the grammar of graphics.
• rgl - Interactive 3D visualizations with R
• shiny – creates interactive apps that can be installed on website
• htmlwidgets - A fast way to build interactive (javascript based) visualizations with R. Packages
that implement htmlwidgets include:
• leaflet (maps)
• dygraphs (time series)
• DT (tables)
• diagrammeR (diagrams)
• network3D (network graphs)
• threeJS (3D scatterplots and globes).
• googleVis - Let's you use Google Chart tools to visualize data in R. Google Chart tools used to
be called Gapminder, the graphing software Hans Rosling made famous in hie TED talk.
132
2.4.2. Packages / Libraries / Add-ons
For Modeling Data
• tidymodels - A collection of packages for modeling and machine learning using tidyverse
principles. This collection includes rsample, parsnip, recipes, broom, and many other general
and specialized packages listed here.
• car - car's Anova function is popular for making type II and type III Anova tables.
• mgcv - Generalized Additive Models
• lme4/nlme - Linear and Non-linear mixed effects models
• randomForest - Random forest methods from machine learning
• multcomp - Tools for multiple comparison testing
• vcd - Visualization tools and tests for categorical data
• glmnet - Lasso and elastic-net regression methods with cross validation
• survival - Tools for survival analysis
• caret - Tools for training regression and classification models
133
2.4.2. Packages / Libraries / Add-ons
• When you download R, already a number
of packages are downloaded as well.
• To use a function in an R package, that package has
to be attached to the system.
• hen you start R not all of the downloaded packages
are attached, only seven packages are attached to
the system by default.
• You can use the function search to see a
list of packages that are currently attached
to the system, this list is also called the
search path.
search( )
134
2.4.2.1. Attaching Packages
• To attach another package to the system
you can use the menu or the library
function.
Via the menu:
• Select the `Packages' menu and select `Load
package...', a list of available packages on your
system will be displayed. Select one and click `OK',
the package is now attached to your current R
session. Via the library function:
library()
library(MASS)
drivers
135
2.4.2.2. Installing Packages
• IMPORTANT TO NOTE:
Before you download a
new package, make sure to
run R as administrator
– Right click on the shortcut
– Choose “Run as
administrator”:
136
2.4.2.2. Installing Packages
137
2.4.2.2. Installing Packages
139
TIP
• Download and install package pacman. How?
• Also run the command to install and load
specific packages within the pacman:
pacman::p_load(pacman, dplyr, GGally, ggplot2,
ggthemes, ggvis, httr, lubridate, plotly, rio,
rmarkdown, shiny, stringr, tidyr, Rcmdr)
library(datasets)
p_unload(dplyr, tidyr, stringr)
p_unload(all)
detach("package:datasets", unload= TRUE)
cat("\014")
140
Summary and Key Points
NEXT DISCUSSIONS
• Summary Measures
• Tables and Visuals
• Implementing in R
142