0% found this document useful (0 votes)
6 views

Session 1 Course Overview and Intro to R

The document outlines the DAT003M Fundamentals of Statistics course, taught by Dr. Jose Ramon G. Albert, focusing on the importance of statistics in decision-making and data analysis. It covers topics such as data collection methods, exploratory data analysis, and statistical inference, with an introduction to the R programming language for practical applications. The course aims to equip students with essential statistical skills and knowledge to analyze data and make informed decisions.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Session 1 Course Overview and Intro to R

The document outlines the DAT003M Fundamentals of Statistics course, taught by Dr. Jose Ramon G. Albert, focusing on the importance of statistics in decision-making and data analysis. It covers topics such as data collection methods, exploratory data analysis, and statistical inference, with an introduction to the R programming language for practical applications. The course aims to equip students with essential statistical skills and knowledge to analyze data and make informed decisions.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 142

DAT003M Fundamentals of Statistics

Sessions 1: Course Overview and


Intro to R

Jose Ramon G. Albert, Ph.D.


Professorial Lecturer
Mathematics Department
De La Salle University
Senior Research Fellow
Philippine Institute for Development Studies
1 Email: [email protected] ; [email protected]
About the lecturer
Jose Ramon “Toots” Albert, Ph.D.

Dr. Jose Ramon G. Albert is a professional statistician and senior research


fellow of the Philippine Institute for Development Studies (PIDS) . He
finished B.S. Applied Math with concentration in Statistics summa cum
laude from De La Salle University in 1988; MS in Statistics (in 1989) and
PhD in Statistics (in 1993) from the State U of NY @ Stony Brook.
• A former National Statistician of the Philippines (as the Secretary
General of the now defunct National Statistical Coordination Board),
he served as consultant to various government agencies and private
firms in the PH and international organizations.
• He worked in about 30 countries spanning South-East Asia, South
Asia, East Asia, the Pacific, the Americas and the Caribbean, the
Middle East, and Africa on poverty measurement and analysis,
econometric methods, and survey data analysis.
• He has written on poverty, social protection, education, gender and
social inclusion, ICT, innovation, climate disasters and other
2 development issues.
Quotable Quotes
“Things get done only if the data we gather can
inform and inspire those in a position to make a
difference.”
M IKE SCH M OKER

“The pandemic has certainly shone a light on how


important data and analytics are … strong data
and analytics strategies that create insights in
real-time will continue to power smart decisions”
M IKE CAPON E

“Statistics are like bikinis. What they reveal is


suggestive, but what they conceal is vital.”
AARON LEVEN STEIN

3
Agenda
Session 1
1. Introduction: Context and Importance of Statistics
2. The Process of Statistics
3. Course Overview
• Aims, References, and Requirements
4. Data Collection Methods
5. Introduction to Exploratory Data Analysis
6. Overview of Statistical Inference
7. Introduction to R
• Basic Operations and Data Structures
• Installing and Loading Packages

4
0. Introduction: Context and
Importance of Statistics
“You can not manage what you don’t measure”
- Bill Hewlett, co-founder of Hewlett Packer (paraphrased Lord Kelvin)
Advances in ICT
• synergy between computing & communications
• Implications:
– More DATA collected
– More DATA stored
– More DATA accessible
and distributed
Information is power !
DATA: “the new oil”
5 a driver of growth and change
0. Introduction: Context and
Importance of Statistics
 hyperconnectivity

6
0.1. Tsunami of Data
 Awash in a flood of data !!! : “drowning in numbers”
 From the beginning of recorded time
until 2003, we created 5 billion
gigabytes (exabytes) of data. In 2012,
data was reported to double every 40
months since the 1980s, with about 5
exabytes were then being created
every 2 days
 This 2024, 625 million videos are
being watched on Tiktok every 60
seconds. Netflix’s 278 million
subscribers are watching 650 million
hours of content every day.
 Every minute, 231 million emails are
7 being sent, 6.3 million searches are
0.2. Big Challenge is not to be
DRIP (Data Rich but Information Poor)
• Transforming

RAW DATA into Meaningful


INFORMATION
 improve efficiency in
business and
(development)
management
 make “predictions”
8

8
0.3. Importance of Statistics
• Statistics translates data into decisions,
theories, and knowledge
• Essential skill in the 21st century as data
increases in every field
• Employees are expected to take larger roles in
drawing insights from data
• Statistics extends to all parts of the scientific
decision-making process
• Crucial in any field that intends to use data to
reach conclusions
9
1. Fundamentals of Statistics
Course
• Course Objective: To
understand and apply
basic statistical methods
– Master basic concepts on
probability and statistics
– Develop skills using
statistical environment R
– Interpret results from use of
statistical methods with R

10
1. Fundamentals
of Statistics • Syllabus overview: 14 weeks,
covering descriptive and
Course inferential statistics
• Textbook: Basic Statistics
With R (Reaching Decisions
With Data) by Stephen C.
Loftus (2022)
• Computing tools: R, Rstudio
• Assessment: Problem sets
(20%), Quizzes (50%), Final
exam (30%)

11
1.1. Course Coverage
• Descriptive statistics (Weeks 1-4)
• Probability and Probability Models Normal
distribution (Week 5-9)
• Sampling & estimation (Weeks 10-11)
• Hypothesis tests (Weeks 11-12)
• Regression analysis (Weeks 12-13)
• Final exam (Week 14)

12
1.2. Class Policies
• Attendance and
participation
requirements
• Academic integrity
and collaboration
guidelines
• Communication
channels and office
hours
13
Quotable Quote
“The alternative to good statistics is not "no
statistics", it's bad statistics. People who argue
against statistical reasoning often end up
backing up their arguments with whatever
numbers they have at their command, over- or
under-adjusting in their eagerness to avoid
anything systematic.” – Bill James

14
2. Statistics
Crucial for informed
• science of decision-making

collecting,
analyzing,
interpreting, and Developed from needs
of government,
presenting data probability theory, and
• framework for computing advances
data-driven
Insights
• process of Essential in today's
data-driven world for
uncovering extracting insights
patterns in data
15
amid uncertainty
2.1. The Process of Statistics
1. Hypothesis/Questions: Formulate questions or develop
hypotheses about phenomena
2. Data Collection: Gather relevant data through experiments
or observational studies
3. Data Description: Summarize data using descriptive statistics
and visualizations
4. Statistical Inference: Draw conclusions about populations
based on sample data
5. Theories/Decisions: Develop theories or make decisions
based on statistical results
• This process is cyclical, continuously seeking better
solutions and decisions
16
2.1.1. Realities about Statistics
• The man in the street distrusts statistics and
despises [his image of] statisticians, those who
diligently collect irrelevant facts and figures
and use them to manipulate society.

“There are three kinds of lies: lies, damned lies, and statistics”
– Mark Twaine

• One can not go about without statistics.

“Statistics are like bikinis. What they reveal is suggestive, but


what they conceal is vital.” – Aaron Levenstein

17
2.1.2. Florence Nightingale on Statistics
• “...the most important science in the whole
world: for upon it depends the practical
application of every other science and of
every art: the one science essential to all
political and social administration, all
education, all organization based on
experience, for it only gives results of our
experience.”
• “To understand God's thoughts, we must
study statistics, for these are the measures
of His purpose.”

18
Statistics
Purposes of not Stat-is-eeks
The world
Using before
The world
after analysis
Statistics analysis
USE DATA to
– Describe
Data Data
– Explain collection interpretation
– Predict
– Make Decisions Data organization &
preliminary analysis

19
 Data is the foundation of any statistical analysis.
Data analytics

Come on! It can‘t go


wrong every time...

20
2.1.3. The Role of Statistics in
Decision-Making

• Properly transform data to information and


insight
• Quantifies uncertainty and risk
• Enables evidence-based decisions
• Facilitates comparison of options
• Aids in reliable forecasting

21
2.1.4. Examples of Statistics in
Everyday Life
• Weather forecasts
• Sports performance metrics
• Political polls
• Health and fitness tracking
• Product ratings and reviews
• Financial planning and investment

22
2.1.5. Overview of Statistics Application in
This Course
Throughout this course, you will learn to apply statistical concepts
and techniques to real-world problems. Here's an overview of how
we will approach this:
1. Data Collection and Presentation:
– Design surveys and experiments
– Collect and organize data
– Create effective visual representations (charts, graphs)
2. Descriptive Statistics:
– Calculate and interpret measures of central tendency and variability
– Analyze distributions and identify outliers
– Examine relationships between variables
3. Probability and Sampling:
– Apply basic probability concepts
23
– Understand sampling distributions
2.1.5. Overview of Statistics Application in
This Course
4. Inferential Statistics:
– Construct and interpret confidence intervals
– Perform hypothesis tests
– Analyze relationships using correlation and regression
5. Practical Applications:
– Work with real-world datasets
– Use statistical software (R) for analysis
– Interpret results in context of business and management
scenarios
– Develop critical thinking skills for data-driven decision making
By the end of this course, you will be equipped with the statistical
tools and knowledge to analyze data, draw meaningful
24
conclusions, and make informed decisions in various professional
2.2. Basic Statistical Concepts
• Population vs Sample: All items of interest vs
a subset (or portion)
• Parameter vs Statistic: Population summary
measure vs Sample summary measure

25
2.2. Basic Statistical
Concepts
• Data: the bedrock of Statistics
– Data are facts or figures about an
object or phenomenon that can be
used for reporting, calculations,
planning or analysis. We can observe,
describe, measure, and count objects
around us. We can also record what
we find about the weather, the
environment, and the economy. This is
what data is.
– Data can be anything from numbers
and text to images and videos.
– When you keep track of how many
books you
26 read each month or how many friends
Descriptive statistics

• Collecting and describing


data
2.2. Basic
Statistical Inferential statistics
Concepts • Drawing conclusions
and/or making decisions
concerning a population
based only on sample data

27
2.2.1. Descriptive Statistics
• Collect data
– e.g., Survey
• Present data
– e.g., Tables and graphs
• Characterize data X i
– e.g., Sample mean = n
Examples
• Market basket analysis
• Discovers associations between products

28
2.2.2. Inferential Statistics
• Estimation
– e.g.: Estimate the population
proportion living in poverty
using the sample proporetion
• Hypothesis testing
– e.g.: Test the claim that the
population poverty incidence
is 15 percent (when the
sample
Inference proportion
is drawing is 15.5 %) and/or making decisions
conclusions
concerning a population based on sample results.
29
2.2.2.1. Poverty Monitoring in PH
Income poverty rates and GDP per capita: 2000 to 2023
4000.00 30
26.6 26.3
3500.00 25.2
23.5 25
3000.00
GDP per capita (in USD)

18.1 20

Poverty Rate (in %)


2500.00 16.7
15.5
2000.00 15
1500.00
10
1000.00
5
500.00
0.00 0
01 03 05 07 09 11 13 15 17 19 21 23
20 20 20 20 20 20 20 20 20 20 20 20

Proportion of Population Living below the National Poverty Line, Total (%)
Gross Domestic Product per Capita (current $)

• The proportion of the Filipinos in poverty rose to 18.1% in 2021, up from 16.7% in
2018, thus partially reversing earlier gains in poverty reduction from 2006 to 2018
and highlighting the pandemic's impact on vulnerable populations (PSA 2022).
• Recently released data shows that the poverty incidence has reduced to 15.5% in
2023.
• The government is targeting for the poverty rate to fall to nine percent by 2028,
30
2.3. Why We Need Data
• Measure performance
• Evaluate standards
• Input for studies
• Support decision making

31
2.4. Components of Data
• Objects
(observational
units)
– rows in a data frame
• Variables
(characteristics
measured)
– columns in data frame
• Scales
(measurement
32
schemes)
2.5. Types of Variables

Example of
Variables
Discrete Variables : Example of Categorical
Number of children Variables : Sex (Male,
in a family (1, 2, 3, Female), Blood type (A, B,
etc.), Number of AB, O), or Eye color
cars in a parking Quantitative (Brown, Blue, Green)
Categorical
lot, Number of or Numerical Note: Categorical data
books in a library. represent groups or
Note : Discrete data categories without a
represent numerical order
countable values,
often whole Discrete Continuous
numbers.

Example of Continuous Variables :Height (170.5 cm),


Temperature (36.8°C), or Weight (65.2 kg).
Note : Continuous data represent measurements and
can take any value within a given range, including
33 fractions and decimals.
2.6. Data Sources

Primary Secondary
Data Collection Data Compilation

Print or Electronic

Experimentation Records Surveys

Observation Big Data


Market polls Government surveys

34
34
2.6.1. Traditional Data Sources
 Official statistics are typically sourced from
surveys (censuses and sample surveys),
and administrative data; new data sources
such as big data are also been
experimented with to fill in “data gaps”.
 A survey is a systematic method for gathering
information from a target population of
interest for purposes of producing quantitative
descriptors of the attributes of the population (
Groves et al. 2009).
35
2.6.1. Traditional Data Sources
a) A census is a survey with data collection involving a complete
enumeration of the population. Although a census can
provide a reliable baseline data on the structure and key
characteristics of the target population, its scope and range
yield a high cost for its conduct. Furthermore, the large
number of entities enumerated in a field operation
predisposes censuses to data collection errors. Errors also
arise from differences in understanding of concepts,
definitions, and instructions of both field enumerators and
respondents. As with other data sources, changes in census
methodologies can also make it challenging to compare data
from one census to another.

36
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
Census • Complete enumeration • Costly
• Source of statistics for • Robust staffing
entire population • Restricted periodicity
• Provides basis for area • Longer lag time in
and/or list frames of producing results
sample surveys

37
2.6.1. Traditional Data Sources
– Often, data are collected in a survey only from a subset of the
population, and in this case, the survey is called a sample
survey. When sample surveys use probability sampling for
selecting respondents, each respondent represents a certain
number of entities in the larger aggregate of entities to which
the respondent belongs. Consequently, we can make
statistically valid conclusions (called inferences) about the
entire population, particularly the attributes of the population
using information from the sample survey.

38
2.6.1. Traditional Data Sources
Survey statistics can be used as estimates of the corresponding
descriptors of the population parameters. For instance, in a
labour force survey (LFS), the proportion of sample respondents
who have a job can be an estimate of the share of the entire
population who have a job.

Since sample surveys only involve a fraction of the total


population, they are a more cost-effective means of collecting
data. Sampling works: all the blood of a patient is never
extracted in a hospital for a blood test; a blood sample will do!

Further, because sample surveys are administered in a more


controlled manner, they can include more detailed inquiries on
the characteristics that can vary considerably across time (such
39 as employment conditions, wages).
2.6.1. Traditional Data Sources
Sample surveys attract several varieties of errors that affect survey
data quality. There are two sources of survey error: sampling error
and non-sampling error.
Sampling error arises since survey statistics are based on a sample rather than
a complete enumeration, while the latter is bias in survey estimates that is not
traceable to features of the sampling process. Ultimately, the success in the
conduct of a sample survey depends on the percentage of responses and the
quality of the survey responses. A measure of sampling error, called the
standard error, is associated with the standard deviation of the estimator.

40
2.6.1. Traditional Data Sources
Precision of estimators measured by Standard Error
•How large a sample would be necessary to estimate the true
proportion within ±3%, with 95% confidence?
(Bootstrap the Standard Error with p = 0.5)

Solution:
Standard Error (SE) for
Estimating p

Control Margin of Error = 0.03 = 1.96 SE 2

So use n = (1/0.03)2= 1,111



1 1
2 𝑛
Round Up to 1,200
41
2.6.1. Traditional Data Sources
Non-sampling error includes issues on the respondents’ ability to recall
information being asked in a survey question, the honesty of a respondent in
providing a survey response (rather than a “desirable response”) to an
interviewer, and the motivation of respondents (including fear) to provide
answers to the set of survey questions. Survey respondents may also find
responding to surveys time consuming if a questionnaire is lengthy. Response
burden may also arise if the content of one or more questions in a survey are
psychologically invasive and thus lead to emotional stress. As with censuses,
comparability over time is also a challenge for sample surveys because
estimates of key variables may require similar designs and methods that are
highly unlikely to be perfectly replicated. Furthermore, there is a need for
adequately trained staff to manage and administer the sample survey with
minimal deviation from protocols.

42
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
Sample • Relatively easy to • Non-responses
administer • Sampling errors could
Survey • Cost-effective be large especially for
• Wider scope sub-national
disaggregates
• Response biases,
coverage biases and
other non-sampling
errors
• Need for an
adequately trained
statistical human
resources

43
2.6.1. Traditional Data Sources
– Administrative data are data holdings collected typically by
line ministries and local governments for the purposes of
administering taxes, benefits or services. Processes in an
administrative system typically involve registration, transaction
and record keeping which derive data as by-products.
Examples include health, pension and employment data in a
social security system; income or expenditure records of tax
authorities; data on registered unemployment, active labour
market programs, social benefits from a social protection
program; labour inspection records (pertaining to
occupational injuries). Administrative data, unlike surveys that
are designed for statistical purposes, are often mere by-
products of their original purpose of registration, transaction,
and record keeping in the administration of taxes, benefits or
44
2.6.1. Traditional Data Sources
Data Source Advantages Disadvantages
• Full count of clients of • Not designed for statistical
Administrative administrative system purposes
data • Better data coverage and • Units may not satisfy data
availability user needs or may not use
• Low-cost data collection a definition of unit which is
• Reduced response burden compatible with other data
to data suppliers source/s
• Timely statistical outputs • Scope of system may be
• Up-to-date and more too narrow by design, or
frequent, often too broad to include
longitudinal, data groups not of interest
• Needs strong coordination
among NSOs, other
government agencies, and
other data owners.
• Confidentiality issue
• Missing data
• Different time periods

45
2.6.2. Big Data

 While big data has no definition, o not readily defined (first used
these digital footprints have 3Vs
(Gartner, 2001) :
in mid-1990s in lunch
conversations @ Silicon
Valley);
o refer to digital data by-
products (exhaust) from
electronic gadgets, internet
search/social media, sensors
and tracking devices
o data increasing due to
 plus two extra V’s = 5V’s: 3Vs plus increased capacity to collect,
veracity and value. store, retrieve, use and re-use
46 data.
2.6.2. Big Data
Data Source Advantages Disadvantages
Big data • Large volume of data • Data privacy and
• Wide variety of data security
types • Accessibility
• Timely (in fact, near • Challenges in
real time) data technological
• Improves accuracy and infrastructure
granularity of statistics • Requires new skill sets
of human resources for
management and
analytics
• Coverage and
representativeness

47
2.6.2.1. Examples of Big Data Analytics

 2008: Google established a near-real-time flu


tracker that monitors Google searches
regarding the term “flu”
Health Surveillance: Google Flu Trends (J. Ginsburg et al, Nature , 2009)

48
2.6.2.1. Examples of Big Data Analytics

• ASTONISHING re Google Flu Trends:


 Google statistics on flu incidence are aggregates
with a delay of one day, while official statistics
from the US Center for Disease Control take a
week to put together based on administrative
reports from hospitals.
 Flu tracker is quick, accurate and cheap, while
official statistics are not as timely, and have huge
costs.

49
2.6.2.1. Examples of Big Data Analytics

Pulse Laboratory in Jakarta examining Tracking Population Movements with


Twitter data on “rice” with actual price of Digital Traces from Mobile Phone Usage,
rice (Letouze, 2012) e.g. Video below from Geneva

50
2.6.2.1. Examples of Big Data Analytics

“Distribution of Twitter users’ destinations and Trip purpose groupings of latent


origins (Philippines)” by R Roldan, UPINSTAT topics from LDA

51
2.6.2.1. Examples of Big Data Analytics

“Philippine Geotagged Tweets Map Google Mobility Data in PH compared to


and Earth’s City Lights (Philippines)” baseline pre-COVID movements: 2020-
by EF Legara, AIM 2021

NASA’s Earth’s city light capture (right, zoomed


in on the Philippines)
52
2.6.2.1. Examples of Big Data Analytics
Improving Small Area Estimates of Poverty in PH with Satellite Imagery by ADB and
PSA

Using innovative data sources


Conventional small area estimates require further
validation
LANAO DEL SUR 2006 2009 2012 2015

Poverty Incidence 38.6 48.7 67.3 66.3

Coeff. of Variation 18.4 15.5 8.0 4.82

Official SAE
EBLUP Estimates

53
2.6.2.2.Utilizing Big Data in Business
 Predictive Modeling Association
Rules and Collaborative Filtering:
Amazon using customer database to
inform clients that “customers who
bought Product A also bought
Product B, and Product C …”
 Sentiment Analysis: Social media
data, such as tweets on Twitter, are
https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/ scrutinized in terms of “polarity”
(i.e., positive, negative, or neutral)
of sentiments on a product.
 Text Analysis: In Japan call center,
agents input “what customers say”
and instructions are then given to
call center agents on workstations
Frontier tech of Industry 4.0 changing business on “what to say”
models (the rise of Platform Economy) and
making more use of data esp. big data
54
2.6.3. Data Collection Methods
• Observational Studies:
– Passive data collection
– Less expensive, fewer ethical concerns
– Cannot establish causality
– Examples: Surveys, existing data analysis
• Designed Experiments
– Active data collection
– More expensive, more ethical concerns
– Can establish causality
– Examples: Clinical trials, A/B testing
• Sampling Techniques
– Simple Random Sampling (SRS)
– Other methods: Stratified, Cluster, etc.

55
2.7. Survey Research
• Choose response mode
• Identify categories
• Formulate clear questions
• Pilot test the survey

56
2.7.1. Survey Response Modes
• Personal interview
• Telephone interview
• Mail survey
• Online survey

57
2.7.2. Survey Questionnaires
• Use clear, unambiguous language
• Avoid leading questions
• Use universally accepted definitions
• Cover all possible response options
• Have a “cover” letter
– State survey goals and purpose
– Explain importance of response
– Assure anonymity
– Offer incentives for participation
58
2.7.3. Reasons for Sampling
• Less time consuming than a census
• Lower cost
• More practical for large populations

59
2.7.4. Types of Sampling Methods
• Probability samples
– Simple random:
• Equal chance of selection for each unit
• Can use random number table or generator
• Selection with or without replacement
– Systematic
• Select every kth unit from population
• k = population size / sample size
• Random start point
N = 64
n=8 First
Group
60 k=8
2.7.4. Types of Sampling Methods
• Probability samples (cont’d)
– Stratified
• Divide population into groups (strata)
• Take random sample from each stratum
• Combine samples
– Cluster Population
divided into 4
• Divide population into clusters clusters.
• Randomly select clusters
• Sample all units in selected clusters

61
2.7.4. Types of Sampling Methods
• Non-Probability samples
– Convenience Sampling: Example: A researcher stands outside
a mall and surveys the first 100 people who walk by. The
sample is selected based on the convenience of accessibility.
– Judgmental or Purposive Sampling: Example: A researcher
selects experts in a particular field to interview about their
opinions on a new technology, based on their expertise.
– Snowball Sampling: Example: A researcher studying a rare
medical condition starts by interviewing a few patients and
then asks them to refer others who have the condition. The
sample grows through referrals from participants.

62
2.7.4. Types of Sampling Methods
• Non-Probability samples (cont’d)
– Quota Sampling: Example: A researcher sets a quota to
interview 50% males and 50% females for a survey, but within
each group, participants are selected non-randomly (such as
through convenience sampling).
– Self-Selection Sampling: Example: People are invited to
participate in an online survey, and only those who are
interested and willing choose to respond. There is no control
over who decides to participate.

63
2.7.4.1. Examples of Survey Designs

Source: Jeff Pitblado, Associate Director, Statistical Software at StataCorp LP. 2009 Canadian Stata Users
Group Meeting. Available at http://www.stata.com/meeting/canada09/ca09_pitblado_handout.pdf

64
2.7.5. Advantages of Sampling (over Census)
• Simple random: Easy to implement
• Systematic: Simple procedure
• Stratified: Ensures representation across groups
• Cluster: Cost-effective for large populations
• Non-probability : Cost-effective and quick for
exploratory research or when a random sample is
not feasible.

65
2.7.6. Disadvantages of Sampling
• Simple random: May not represent subgroups
well
• Systematic: Potential for bias with cyclical data
• Stratified: Requires prior knowledge of
population
• Cluster: Lower precision, larger samples needed
• Non-probability : Potential for bias and lack of
generalizability to the larger population.

66
2.7.7. Evaluating Survey Quality
• What is the purpose of the survey?
• Is the survey based on a probability sample?
• Total survey error divided into
– Sampling error : always exists (for probability
surveys)
– Non-sampling error
• Coverage error – appropriate frame
• Nonresponse error – follow up
• Measurement error – good questions elicit good
responses

67
2.7.8. Types of Survey Errors
• Coverage error
Excluded
from frame.
• Nonresponse error
Follow up on non-
responses.
• Measurement error Bad Question!

• Sampling error
Chance
differences from
sample to
sample.
68
2.8. The Role of Probability in Statistics
• Foundation for understanding uncertainty
and variability in data
• Data as realizations of chance processes
(e.g., sample averages)
• Essential for making inferences about
populations from samples

69
2.8.1. Probability as the Basis for Statistics
• Provides mathematical framework for
quantifying uncertainty
• Allows modeling of random phenomena in real-
world situations
• Underpins key statistical concepts (e.g.,
distributions, hypothesis testing)

70
2.8.2. Understanding Variability Through
Probability

• Probability explains why different samples


yield different results
• Probability helps quantify the likelihood of
various outcomes
• Crucial for assessing reliability of statistical
71 estimates
2.8.3. From Probability to Statistical
Inference
• Sampling distributions
connect sample statistics
to population parameters
• Enables calculation of
confidence intervals and
p-values
• Forms the basis for
decision-making under
uncertainty

72
Introduction to
and

73
1. Introduction: Computing Resources
R Commercial Packages
Many different datasets (and other One datasets available at a given time
“objects”) available at same time Datasets are rectangular
Datasets can be of any dimension
Functions are proprietary
Functions can be modified
Experience is passive-you choose an
Experience is interactive-you program analysis and they give you everything
until you get exactly what you want they think you need
One stop shopping - almost every Tend to be have limited scope, forcing
analytical tool you can think of is available you to learn additional programs; extra
options cost more and/or require you to
learn a different language (e.g., SPSS
Macros)
R is free and will continue to exist.
They cost money. There is no guarantee
Nothing can make it go away, its price will
never increase. they will continue to exist, but if they do,
you can bet that their prices will always
74
1. Introduction: Computing Resources

CAVEAT:
• “Using R is a bit akin to smoking.
The beginning is difficult, one
may get headaches and even gag
the first few times. But in the
long run, it becomes pleasurable
and even addictive. Yet, deep
R
down, for those willing to be
honest, there is something not
fully healthy in it.” --Francois
Pinard
75
2. R Basics
• To enable us to use R, we firstly discuss its
capabilities, then we describe how to install
it, then we illustrate some basic commands
and how to obtain help.
• Various ways of communicating with R
– Interactively: (through console)
– Batch Processing: (through scripts)
– Point and Click: (through “add ons” Rcmdr,
rattle, deducer)

76
2.1. What is R?
• R is a statistical programming environment for
performing standard & specialized statistical tools
– “environment” : intended to characterize R as a fully planned
and coherent system, rather than an incremental accretion of
very specific and inflexible tools, as is frequently the case with
other data analysis software
• R is a is a free open source statistical package based on
the S language developed at Bell Labs (later
commercially released by Mathsoft as Splus).
• Although R is a programming language, i.e,. generating
computer code to complete tasks is needed, there are
Graphical User Interface (GUI) Add Ons like R
77 Commander, which allow users to “point and click”.
2.1. What is R?
• Initially developed by Robert Gentleman and Ross
Ihaka of University of Auckland; now maintained by
the “R core development team”
– Since 1997: international R-core team
~20 people & 1000s of code writers
and statisticians happy to share their
libraries
– About 2 million R users globally : forums, mailing lists, blogs
• Cross platform compatibility: Windows, MacOS, Linux
• Very powerful for writing programs.
– Many statistical functions are already built in.
– Contributed packages expand the functionality to cutting
edge research.
78
2.1. What is R?
Advantages Disadvantages
oFast and free. oNot user friendly @ start - steep
oState of the art: Statistical researchers learning curve, minimal GUI.
provide their methods as R packages. oNo commercial support; figuring out
SPSS and SAS are years behind R! correct methods or how to use a function
o2nd only to MATLAB for graphics. on your own can be frustrating.
oMx, WinBugs, and other programs use oEasy to make mistakes and not know.
or will use R. oWorking with large datasets is limited by
oActive user community RAM
oExcellent for simulation, programming, oData prep & cleaning can be messier &
computer intensive analyses, etc. more mistake prone in R vs. SPSS or SAS
oForces you to think about your analysis. oSome users complain about hostility on
the R listserve
oInterfaces with database storage
software (SQL)
79
2.1. What is R?
• As of 7 Sept 2024, there are 21,229 add-on packages

(http://cran.r-project.org/src/contrib/PACKAGES.html)
– This is an enormous advantage - new techniques available
without delay, and they can be performed using the R
language you already know.
– Allows you to build a customized statistical program
suited to your own needs.
– Downside = as the number of packages grows, it is
becoming difficult to choose the best package for your
needs, & QC is an issue.
80
80
2.2. Installation
• R home page: http://www.r-project.org/

Other Important Sites:


• R Archive: http://cran.r-project.org/
• R FAQ (frequently asked questions about R):
http://cran.r-project.org/doc/FAQ/R-FAQ.html
• R manuals:
http://cran.r-project.org/manuals.html
• R seek engine: http://www.rseek.org
81
2.3. Starting Up
1. Look for R shortcut.

2. Or click
Start ► Programs ►
R ► Rx64 3.2.1
For 64 bit machines

82
R Console:
2.3.1. Start Up Windows

INTERACTIVE COMMAND WINDOW: commands typed here


83
2.3.1. Start Up Windows
 R command window (console)
◦ Used for entering commands, data manipulations, analyses,
graphing
◦ Output: results of analyses, queries, etc. are written here
◦ Toggle through previous commands by using the up and
down arrow keys
 The R workspace
◦ Current working environmen
 Most functionality is provided through built-in and
user-created functions ; all data objects are kept in
memory during an interactive session.
◦ Basic functions are available by default.
84
◦ Other functions contained in packages 84
2.3.2. Menu & Tool Bars

MENU BAR:

TOOL BAR :

85
2.3.3. Changing GUI Preferences
• Click on Edit ► GUI Preferences

86
2.3.4. Buttons
Button Functions
• Open : Opens R file.
• Load Workspace
• Save: Saves the current data.
• Copy
• Paste
• Copy and Paste
• Stop current computation
• Print

87
2.3.5. Opening a Script Window
• Click on
File ► New Script
gives you a script window.

88
2.3.5. Opening a Script Window

R scripts
◦A text file containing commands that you
would enter on the command line of R
◦To place a comment in a R script, use a hash
mark (#) at the beginning of the line

89
2.3.6. Assignments and Operations
• Arithmetic and Mathematical Operations:
+, -, *, /, ^ are the standard arithmetic operators.
Mod: %%
sqrt, exp, log, log10, sin, cos, tan, …..
• Functions:
– Almost everything in R is done through functions.
Here we only refer to numeric and character
functions that are commonly used in creating or
recoding variables.
– Note that while the examples here apply
functions to individual variables, many can be
applied to vectors and matrices as well.
90
2.3.6. Assignments and Operations
• Numeric Functions:
Function Description
abs(x) absolute value
sqrt(x) square root
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x

• Character Functions:
Function Description
substr(x, start=n1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
91
2.3.6. Assignments and Operations
Probability Functions :
• Notations
Probability Density Function: p
Distribution Function: d
Quantile function: q
Random generation for distribution: r
• Examples:
– Normal distribution:
• dnorm(x, mean=0, sd=1, log = FALSE)
• pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
• qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
• rnorm(n, mean=0, sd=1)
92
2.3.6. Assignments and Operations
Statistical Functions :
Excel R
NORMSDIST pnorm(7.2,mean=5,sd=2)
NORMSINV qnorm(0.9,mean=5,sd=2)
LOGNORMDIST plnorm(7.2,meanlog=5,sdlog=2)
LOGINV qlnorm(0.9,meanlog=5,sdlog=2)
GAMMADIST pgamma(31, shape=3, scale =5)
GAMMAINV qgamma(0.95, shape=3, scale =5)
GAMMALN lgamma(4)
WEIBULL pweibull(6, shape=3, scale =5)
BINOMDIST pbinom(2,size=20,p=0.3)
POISSON ppois(2, lambda =3)
93
2.3.6. Assignments and Operations
Other Useful Functions
Function Description
seq(from , to, by) generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)

rep(x, ntimes) repeat x n times


y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)

cut(x, n) divide continuous variable in factor with n


levels
y <- cut(x, 5)

94
2.3.6. Assignments and Operations
• Matrix Arithmetic.
 * is element wise multiplication
 %*% is matrix multiplication
• Assignment
 To assign a value to a variable use “<-” or
equal (=) character

95
2.3.6. Assignments and Operations
• Objects can be used in other calculations.
To print object just enter name of object.
• Restrictions for name of object:
 Object names cannot contain `strange' symbols like !,
+, -, #.
 A dot (.) and an underscore ( ) are allowed, also a
name starting with a dot.
 Object names can contain a number but cannot start
with a number.
 R is case sensitive, X and x are two different objects,
as well as temp and temP.

96
2.3.6. Assignments and Operations

The assignment operator <-


x <- 25
assigns the value of 25 to the variable x
y <- 3*x
assigns the value of 3 times x (75 in this
case) to the variable y
r <- 4
area.circle <- pi*r^2
area.circle

97
NOTE: R is case-sensitive (y ≠ Y)
2.3.6. Assignments and Operations
We can evaluate truth or falsity of expressions:
2>1
1>2&2>1
generate sequences (and perform operations on
them)
3*(1:5)
We can do matrix operations
a <- 1:3
b <- 3:5
a*b
a%*%b
a%*%t(b)
98
2.3.7. R Objects and Indexing Techniques
The most basic data structure in R is a vector,
a sequence of values of the same type
(numeric, integer, character, logical, complex).

All basic operations in R work element-wise


on vectors where the shortest argument is
“recycled”, if necessary. This goes for
arithmetic operations (addition, subtraction,
…), comparison operators (==, <=,…), logical
operators (&, |, !,…) and basic math functions.
99
2.3.7. R Objects and Indexing Techniques
The most basic variable in R is a vector, a 1-d
array / sequence of values of the same type
(numeric, integer, character, logical, complex).

All basic operations in R work element-wise


on vectors where the shortest argument is
“recycled”, if necessary. This goes for
arithmetic operations (addition, subtraction,
…), comparison operators (==, <=,…), logical
operators (&, |, !,…) and basic math functions.
100
2.3.7. R Objects and Indexing Techniques
Another R object is a matrix, a 2-d (rectangular)
array where the values (as in a a vector) should
be of the same data class/ type. A further
generalization is an array, identical to a matrix by
with with 3+ dimensions.
Still another useful R object is a data frame,
which can have vectors of multiple types, e.g. one
column may be string/character, another could
be integer/numeric, etc. Like a matrix, a data
frame must have the same length. A data frame
is closest analogue to a spreadsheet. Unlike other
101 objects, a data frame has special functions.
2.3.7. R Objects and Indexing Techniques
Another R object is a list, an ordered
collection of elements. We can have also a list
of lists of lists… A list is the most flexible of R
objects. It can have any class, length or
structure.

102
2.3.7. R Objects and Indexing Techniques
• Example :
n1 <- 25
n1
typeof(n1)
v1 <- 1:5
v1
is.vector(v1)
v2 <- c("t", "o", "o", "t", "s")
v2
is.vector(v2)
v2 <- c(FALSE, TRUE, TRUE)

103
2.3.7. R Objects and Indexing Techniques
It can be helpful to coerce objects, i.e. change an R data
object from one type to another, e.g.., character vector to
logical, matrix to a data frame, (double precision) numeric
to integer, etc.
(coerce1 <- c(1, "a", TRUE) )
typeof(coerce1)
(coerce2 <- c(5))
typeof(coerce2)
(coerce3 <- as.integer(5)) #coerce numeric to integer
typeof(coerce3)
(coerce4 <- c("1", "2", "3") )
typeof(coerce4)
(coerce5 <- as.numeric(c("1", "2", "3"))) #coerce to numeric
104
2.3.7. R Objects and Indexing Techniques
• Accessing elements of a vector, matrix,
data frame or list is achieved through a
process called indexing. Indexing may be
done by
– a vector of positive integers: to indicate
inclusion
– a vector of negative integers: to indicate
exclusion
– a vector of logical values: to indicate which
are in and which are out
– a vector of names: if the object has a names
105 attribute
2.3.7. R Objects and Indexing Techniques
• Example : producing a random sample of values between
one and five, twenty times and determining which
elements are equal to 1
x <- sample(1:5, 20, rep=T)
x
x == 1
ones <- (x == 1) # parentheses unnecessary
• Suppose we now want to replace the ones appearing in
the sample with zeros and store the values greater than 1
into an object called y
x[ones] <- 0
x
others <- (x > 1) # parentheses unnecessary
y <- x[others]
106
2.3.7. R Objects and Indexing Techniques
• The following command queries the x vector and
reports the position of each element that is greater
than 1
which(x > 1)
• Example : creating a matrix and a data frame, and
accessing elements
value <- rnorm(6)
dim(value) <- c(2,3)
value # notice we now have a matrix
dim(value) <- NULL
value # converted back to a vector

107
2.3.7. R Objects and Indexing Techniques
• Other than the use of the dim function, we could
use matrix
matrix(value,2,3)
matrix(value,2,3,byrow=T) # to fill by rows
• Use the rbind function to bind a row onto an
already existing matrix
value <- matrix(rnorm(6),2,3,byrow=T)
value2 <- rbind(value,c(1,1,2))
value2
• To bind a column onto an already existing matrix,
the cbind function can be used
value3 <- cbind(value,c(1,1,2))
108
2.3.7. R Objects and Indexing Techniques
• The function data.frame converts a matrix or
collection of vectors into a data frame
value3 <- data.frame(value3)
value3
• Row and column names are already assigned to a
data frame but they may be changed using the
names and row.names functions. To view the row
and column names of a data frame:
names(value3)
row.names(value3)
• Alternative labels can be assigned:
names(value3) <- c("C1","C2","C3","C4")
109
2.3.7. R Objects and Indexing Techniques
• Data frames can be indexed by either column
value3 <- data.frame(value3)
value3
value3[, "C1"] <- 0
value3
• Or by row :
value3["R1", ] <- 0
value3
value3[] <- 1:12
value3

110
2.3.7. R Objects and Indexing Techniques
EXERCISE 1: How do we access
(a) the first two rows of the matrix/data frame?

(b) the first two columns of the matrix/data frame?

(c) the elements with a value greater than five (and


ensure that we have a vector produced?)

111
2.3.7. R Objects and Indexing Techniques
Solutions to EXERCISE 1: How do we access
(a) the first two rows of the matrix/data frame?
value3[1:2,]
(b) the first two columns of the matrix/data frame?
value3[,1:2]
(c) the elements with a value greater than five (and
ensure that we have a vector produced?)
as.vector(value3[value3>5])

112
2.3.7. R Objects and Indexing Techniques
EXERCISE 2: Execute the following code and think
about why it works the way it does:
a <- 1:3
# vectors have variables of one type
c(1, 2, "three")
# shorter arguments are recycled
(1:3) * 2* (1:4) * c(1, 2)
# warning! (why?)
(1:4) * (1:3)

113
2.3.7. R Objects and Indexing Techniques
• Lists can be created using the list function. Like data
frames, they can incorporate a mixture of modes into
the one list and each component can be of a different
length or size.
L1 <- list(x = sample(1:5, 20, rep=T), y = rep(letters[1:5], 4), z =
rpois(20, 1))
L1
• The first component can be accessed in several ways:
L1[["x"]]
L1$x
L1[[1]]
• What about
114 L1[1] # this is a sublist
2.3.7. R Objects and Indexing Techniques
Each element of a vector, matrix, data frame and
list can be given a name. This can be done by
passing named arguments to the c() function or
later with the names function. Such names can be
helpful giving meaning to your variables.
For example compare the vector
x <- c("red", "green", "blue")
with
capColor = c(huey = "red", duey = "blue", louie = "green")

115
2.3.7. R Objects and Indexing Techniques
As pointed out earlier, elements of a vector,
matrix, data frame and list can be selected or
replaced using the square bracket operator [ ]
which accepts either a vector of names, index
numbers, or a logical.
In the case of a logical, the index is recycled if it is
shorter than the indexed vector.
In the case of numerical indices, negative indices
omit, in stead of select elements.
Negative and positive indices are not allowed in
the same index vector.
116
2.3.7. R Objects and Indexing Techniques
You can repeat a name or an index number, which
results in multiple instances of the same value.
For example compare the vector
x <- c("red", "green", "blue")
with
capColor["louie"] names(capColor)[capColor == "blue"] x
<- c(4, 7, 6, 5, 2, 8) I <- x < 6 J <- x > 7 x[I | J] x[c(TRUE,
FALSE)] x[c(-1, -2)]

117
2.3.7. Workspace
• Objects that you create during an R
session are hold in memory, the
collection of objects that you currently
have is called the workspace.
• This workspace is not saved on disk
unless you tell R to do so. This means
that your objects are lost when you close
R and not save the objects, or worse
when R or your system crashes on you
during a session.

118
2.3.7. Workspace
• When you close the RGui or the R
console window, the system will ask if
you want to save the workspace image.
• If you select to save the workspace image
then all the objects in your current R
session are saved in a file .RData. This is a
binary file located in the working
directory of R, which is by default the
installation directory of R.

119
2.3.8. Seeking Help
• R has a very good help system built in.
• If you know which function you want help
with simply use ?_______ with the
function in the blank.
?hist
args(hist)
• If you don’t know which function to use, then
use help.search(“_______”).
help.search("histogram")

120
2.3.8. Seeking Help
Obtaining Html help
• We can do a search with the Menu bar:
Help ► Html help
• If you want to use search engine

121
2.3.8. Seeking Help
Tutorials
• Each of the following tutorials are in PDF format.
– P. Kuhnert & B. Venables,
An Introduction to R: Software for Statistical Modeling & Computing
– J.H. Maindonald, Using R for Data Analysis and Graphics
– B. Muenchen, R for SAS and SPSS Users
– B. Muenchen, R for Stata Users
– Getting Started in R~Stata
– UCLA’s Data Analysis Using R
– W.J. Owen, The R Guide
– D. Rossiter,
Introduction to the R Project for Statistical Computing for Use at the IT
C

– W.N. Venebles & D. M. Smith, An Introduction to R


122
2.3.8. Seeking Help
Tutorials (cont’d)
– R time series tutorial
– R Concepts and Data Types presentation by Deepayan Sarkar
– Interpreting Output From lm()
– The R Wiki
– An Introduction to R
– Import / Export Manual
– R Reference Cards
– Anthony Damico’s Two Minute YouTube Tutorials in Using R

123
2.3.9. Quitting
Three Ways of Quitting from R session
1. Enter in Command Window:
quit()
2. Click on
File ► Exit
3. Click on Close button (X at upper right hand
corner of R console window).

124
2.4.1. Datasets
• R comes with a number of sample datasets
that you can experiment with. Type
data( )
to see the available datasets. The result will
depend on which packages you have
loaded. Type
help(datasetname)
for details on a sample dataset. For ex.,
help("iris")
provides info on the Iris Dataset
125
2.4.2. Packages / Libraries / Add-ons
• One of the strengths of R is that the system can easily
be extended.
• The system allows you to write new functions and package those
functions in a so called `R package' (or `R library').
• The R package may also contain other R objects, for example data
sets or documentation.
• R packages/libraries are bundles of codes that add new functions to
R so we can do new things. We have basic packages (installed with R
but not loaded by default) and contributed (or third party) packages
(that need to be downloaded, installed and loaded separately)
• There is a lively R user community and many R
packages have been written and made available on
CRAN for other users.
• Just a few examples, there are packages for portfolio optimization,
drawing maps, exporting objects to html, time series analysis, spatial
statistics and the list goes on and on.
126
2.4.1. Datasets
• Suppose we would like to work with the Aids2 dataset in
the package / library MASS (Modern Applied Statistics with
S)
data("Aids2", package = "MASS")
• If the search on the dataset was successful,the command
above attaches the data object to the R global environment.
The command
ls()
lists the names of all objects currently stored
in the global environment, and, as the result
of the previous command, a variable named
Aids2 is available for further manipulation.
Now try
print(Aids2)
127
2.4.2. Packages / Libraries / Add-ons
• Contributed Packages can be found
• CRAN
• Crantastic : cratstics
• Github : github.com/trending/R
• There is a lively R user community and
many R packages have been written and
made available on CRAN for other users.
• Just a few examples, there are packages for
portfolio optimization, drawing maps, exporting
objects to html, time series analysis, spatial
statistics and the list goes on and on.

128
2.4.2. Packages / Libraries / Add-ons

129
2.4.2. Packages / Libraries / Add-ons
• hadoop with R (RHadoop)
https://github.com/RevolutionAnalytics/RHadoop/wiki

For Loading Data


• DBI - The standard for for communication between R and relational database management
systems. Packages that connect R to databases depend on the DBI package.
• odbc - Use any ODBC driver with the odbc package to connect R to your database. Note: RStudio
professional products come with professional drivers for some of the most popular databases.
• RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages
are a good place to start. Choose the package that fits your type of database.
• XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can
also just export your spreadsheets from Excel as .csv's.
• foreign - Want to read a SAS data set into R? Or an SPSS data set? Foreign provides functions that
help you load data files from other programs into R.
• haven - Enables R to read and write data from SAS, SPSS, and Stata.
• httr – for working with website data
• rio – for importing and exporting data
• markdown – for creating interactive notebooks or rich documents for sharing your information
• pacman – for package management
130
2.4.2. Packages / Libraries / Add-ons
For Manipulating Data
• tidyverse - An opinionated collection of R packages designed for data science that
share an underlying design philosophy, grammar, and data structures. This
collection includes all the packages in this section, plus many more for data import,
tidying, and visualization listed here.
• dplyr - Essential shortcuts for subsetting, summarizing, rearranging, and joining
together data sets. dplyr is our go to package for fast data manipulation.
• tidyr - Tools for changing the layout of your data sets. Use the gather and
spread functions to convert your data into the tidy format, the layout R likes
best.
• stringr - Easy to learn tools for regular expressions and character strings.
• lubridate - Tools that make working with dates and times easier.

131
2.4.2. Packages / Libraries / Add-ons
For Visualizing Data
• ggplot2 - R's famous package for making beautiful graphics. ggplot2 lets you use the
grammar of graphics to build layered, customizable plots.
• ggvis - Interactive, web based graphics built with the grammar of graphics.
• rgl - Interactive 3D visualizations with R
• shiny – creates interactive apps that can be installed on website
• htmlwidgets - A fast way to build interactive (javascript based) visualizations with R. Packages
that implement htmlwidgets include:
• leaflet (maps)
• dygraphs (time series)
• DT (tables)
• diagrammeR (diagrams)
• network3D (network graphs)
• threeJS (3D scatterplots and globes).
• googleVis - Let's you use Google Chart tools to visualize data in R. Google Chart tools used to
be called Gapminder, the graphing software Hans Rosling made famous in hie TED talk.

132
2.4.2. Packages / Libraries / Add-ons
For Modeling Data
• tidymodels - A collection of packages for modeling and machine learning using tidyverse
principles. This collection includes rsample, parsnip, recipes, broom, and many other general
and specialized packages listed here.
• car - car's Anova function is popular for making type II and type III Anova tables.
• mgcv - Generalized Additive Models
• lme4/nlme - Linear and Non-linear mixed effects models
• randomForest - Random forest methods from machine learning
• multcomp - Tools for multiple comparison testing
• vcd - Visualization tools and tests for categorical data
• glmnet - Lasso and elastic-net regression methods with cross validation
• survival - Tools for survival analysis
• caret - Tools for training regression and classification models

133
2.4.2. Packages / Libraries / Add-ons
• When you download R, already a number
of packages are downloaded as well.
• To use a function in an R package, that package has
to be attached to the system.
• hen you start R not all of the downloaded packages
are attached, only seven packages are attached to
the system by default.
• You can use the function search to see a
list of packages that are currently attached
to the system, this list is also called the
search path.
search( )
134
2.4.2.1. Attaching Packages
• To attach another package to the system
you can use the menu or the library
function.
Via the menu:
• Select the `Packages' menu and select `Load
package...', a list of available packages on your
system will be displayed. Select one and click `OK',
the package is now attached to your current R
session. Via the library function:
library()
library(MASS)
drivers
135
2.4.2.2. Installing Packages
• IMPORTANT TO NOTE:
Before you download a
new package, make sure to
run R as administrator
– Right click on the shortcut
– Choose “Run as
administrator”:

136
2.4.2.2. Installing Packages

• Suppose we want to install a package


called Rcmdr:
– Choose Rcmdr in
Packages ► Install packages menu
– Or alternatively run the command:
install.packages("Rcmdr")

137
2.4.2.2. Installing Packages

• Alternatively, after downloading R Studio


https://www.rstudio.com/products/rstudio/download/
– Load R studio, then go to the “Packages” tab of R
Studio and click on “Install Packages”.

– The first time you’ll do this you’ll be prompted to


choose a CRAN mirror. R will download all
necessary files from the server you select here.
138 Choose any mirror site
2.4.2.2. Installing Packages

– To install Rcmdr, start typing “Rcmdr” until


you see it appear in a list. Select the first
option (or finish typing Rcmdr), ensure
that “Install dependencies” is checked,
and click “Install”

139
TIP
• Download and install package pacman. How?
• Also run the command to install and load
specific packages within the pacman:
pacman::p_load(pacman, dplyr, GGally, ggplot2,
ggthemes, ggvis, httr, lubridate, plotly, rio,
rmarkdown, shiny, stringr, tidyr, Rcmdr)
library(datasets)
p_unload(dplyr, tidyr, stringr)
p_unload(all)
detach("package:datasets", unload= TRUE)
cat("\014")
140
Summary and Key Points

• Statistics is crucial for informed decision-making in


the data-driven world
• The statistical process involves: Questions, Data
Collection, Description, Inference, and Decisions
• Data collection methods include observational
studies and designed experiments
• Descriptive Statistics helps understand data
characteristics
• Statistical inference allows drawing conclusions
about populations from samples
• R provides a powerful environment for statistical
141
END of DISCUSSIONS FO SESSION 1

NEXT DISCUSSIONS
• Summary Measures
• Tables and Visuals
• Implementing in R

142

You might also like