W1 Class Overview and R Basics
W1 Class Overview and R Basics
Programing
Instructor
David Li
Course Logistics
• Basic information
• Requirements
• Goal
CS636 Data Analytics with R Programing
• Textbooks
– R Programming for Data Science, by Roger D. Peng
– Using R for Introductory Statistics, by John Verzani, 2014, ISBN 1466590734
– Advanced R, by Hadley Wickham, ISBN 9781466586963
• Website
– https://njit.instructure.com/courses/10227
Requirements
• Homework & computing lab exercise (10%)
• Quiz (20%)
• Term Project (10%)
• Midterm (20%)
• Final (40%)
You should sign the attendance sheet at the end of each class. Extra
bonus based on attendance will be determined.
Homework (5 %)
• Homework assignments
– Try to do it independently, discussions allowed, but copying is
forbidden.
• Homework Grading Policy
– Your homework: may have several homework assignments, but
pick only one (the worst one) to grade. Namely, if you miss one
assignment, you get 0.
• Late homework policy
– 25% penalization per late day;
– Not accepted more than 3 days late
Lab exercise (5 %)
• Have a lab session every week
• Lab exercises
– Focus on R computing exercises
– 3 students a group. Please find your group mates as quick as
possible.
– Some answers may be selected for discussion by the end of lab
session.
– Lab exercise grade is based on the attendance sheet.
Two Term Projects (10%)
• Submit code and report to summarize what you have
done and results you obtained.
• Prepare for presentation and demo.
• 1~4 students a group. It can be same as lab group.
• More details to be announced soon
• Cheating/Copying is strictly prohibited. I will report to
Dean and you will get F in this course.
• If you think your group members don’t make contribution,
talk to me.
Quiz (20%)
David Li
14
What is R?
• Statistical computer language similar to S-plus
• Interpreted language (like Matlab)
• Has many built-in (statistical) functions
• Easy to build your own functions
• Good graphic displays
• Extensive help files
15
Strengths
Weaknesses
Not as commonly used by non-statisticians
Not a compiled language, language interpreter
can be very slow, but allows to call own C/C++
code
16
R packages
17
A sample job opening
18
When to use R?
• When
– Requires standalone computing or analysis on individual servers.
– Great for exploratory work: it's handy for almost any type of data
analysis because of the huge number of packages and necessary
tools to get up and running quickly
– R can even be part of a big data solution.
19
How to use/learn R?
• How
– (optional) Install and Use Rstudio IDE
– (optional) Install Jupyter with R kernel
– Getting started with R (Basic grammars)
– Get to use/learn those popular packages
• dplyr, plyr and reshape2 for data manipulation
• stringr for string operation
• ggplot2 for data visualization
• …
– Do (a lot of) practices including real projects
20
Install RStudio
• An integrated development environment (IDE) available for R
– a nice editor with syntax highlighting
– there is an R object viewer
– there are a number of other nice features that are integrated
• How to install
– https://www.youtube.com/watch?v=9-RrkJQQYqY
Install Jupyter with R kernel
1. Install R and Rstudio
2. Download and install the latest Anaconda at
https://www.anaconda.com/download/
3. In windows, add your R bin path and Anaconda3 Scripts path to your
environmental variable "Path"
– In my computer the R bin path is C:\Program Files\R\R-3.5.1\bin
– Anaconda3 Scripts path is C:\ProgramData\Anaconda3\Scripts, the paths in your
computer may vary.
– How to set the path and environment variables in Windows
https://www.computerhope.com/issues/ch000549.htm
– Install R kernel to Jupyter (PLEASE DO THIS STEP IN R CONSOLE, not in Rstudio or
RGui)
https://irkernel.github.io/installation/
https://stackoverflow.com/questions/44056164/jupyter-client-has-to-be-
installed-but-jupyter-kernelspec-version-exited-wit
– Then you can start "Jupyter Notebook" from the start menu.
Starting and stopping R
• Starting
– Windows: Double click on the R icon
– Unix/Linux: type R (or the appropriate path on your
machine)
Stopping
Type q()
q()is a function execution
Everything in R is a function
q merely returns the content of the function
23
Writing R code
• Can input lines one at a time into R
• Can write many lines of code in any of your favorite text editors
(including Rstudio) and run all at once
– Simply paste the commands into R
– Use function source(“path/yourscript”), to run in batch mode the
codes saved in file “yourscript” (use options(echo=T) to have the
commands echoed)
24
R as a Calculator
> log2(32)
[1] 5
1.0
> sqrt(2)
0.5
sin(seq(0, 2 * pi, length = 100))
[1] 1.414214
0.0
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
-0.5
> plot(sin(seq(0,
-1.0
Index
25
Recalling Previous Commands
• Given the history window then one can copy certain commands
or else past them into the console window
26
Language layout
• Three types of statement
– expression: it is evaluated, printed, and the value is lost (3+5)
– assignment: passes the value to a variable but the result is not
printed automatically (out<-3+5)
– comment: (#This is a comment)
27
Naming conventions
• Any roman letters, digits, underline, and ‘.’ (non-initial position)
• Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi,
range, rank, tree, var
• Hold for variables, data and functions
• Variable names are case sensitive
28
Arithmetic operations and functions
• Most operations in R are similar to Excel and calculators
• Basic: +(add), -(subtract), *(multiply), /(divide)
• Exponentiation: ^
• Remainder or modulo operator: %%
• Matrix multiplication: %*%
• sin(x), cos(x), cosh(x), tan(x), tanh(x), acos(x), acosh(x), asin(x),
asinh(x), atan(x), atan(x,y) atanh(x)
• abs(x), ceiling(x), floor(x)
• exp(x), log(x, base=exp(1)), log10(x), sqrt(x), trunc(x) (the next integer
closer to zero)
• max(), min(), mean(), median()
29
Defining new variables
30
Use functions on a vector
• Most functions work on vectors exactly as we would want
them to do
>sum(whales)
>length(whales)
>mean(whales)
– sort(), min(), max(), range(), diff(), cumsum()
31
Functions that create vectors
• Simple sequences
>1:10 >c(1:10, 10:1)
>rev(1:10) >fractions(1/(2:10))
>10:1 >library(MASS) #to have fractions()
• Arithmetic sequence
– a+(n-1)*h: how to generate 1, 3, 5, 7, 9?
>a=1; h=2; n=5 OR >seq(1,9,by=2)
>a+h*(0:(n-1)) >seq(1,9,length=5)
• Repeated numbers
>rep(1,10)
>rep(1:2, c(10,15))
– getting help: ?rep or help(rep)
– help.search(“keyword”) or ??keyword
32
Next week
• More data structure and R packages
• Homework 1
• Please find your lab group mates and sit together. I expect 13
groups of 39 students.