Data Science Productivity Tools
Data Science Productivity Tools
The textbook for the Data Science course series is freely available online.
Learning Objectives
• How to leverage the many useful features provided by RStudio
• How to use Unix/Linux to manage your file system
• How to start a repository on GitHub
• How to perform version control with git
Course Overview
You will learn how to install R, RStudio, git (and git bash for Windows users), create a GitHub account,
create a GitHub repository, and connect RStudio to your GitHub account.
Section 2: Unix
You will learn the basics of managing your filesystem from the terminal with Unix commands such as mv
and rm.
You will learn to create data science reports using R Markdown and the knitr package.
You will learn to use git and GitHub from the command line to clone and create repositories.
You will learn other Unix commands, including arguments, getting help, pipes, and wildcards that are helpful
in data science.
1
• Unix shell
• Git and GitHub
• R markdown
Section 1 Overview
The Installing Software section walks you through the steps to download and install R, RStudio, git (and
git bash on Windows machines), create a GitHub account, and connect RStudio to GitHub.
There is a graded comprehension check at the end of the section.
If you get stuck, we encourage you to search the discussion boards for the answer to your issue or ask us for
help!
Installing Software
Key points
• RStudio has many useful features as an R editor, including the ability to test code easily as we write
scripts and several autocomplete features.
• Keyboard shortcuts:
• RStudio provides a way to keep all the components of a data analysis project organized into one folder
and to keep track of information about this project.
2
• To start a project, click on File > New Project > New repository > New project > decide the location
of files and give a name to the project, e.g. “my-first-project”. This will then generate a Rproj file
called my-first-project.Rproj in the folder associated with the project, from which you can double click
to start where you last left off.
• The project name will appear in the upper left corner or the upper right corner, depending on your
operating system. When you start an RStudio session with no project, it will display “Project: (None)”.
Key points
• Git is a version control system, tracking changes and coordinating the editing of code.
• GitHub is a hosting system for code, which can help with your career profile.
• Git is most effectively used with Unix, but it can also interface with RStudio.
Installing Git
• If you have a Windows machine, you will need to install Git and Git Bash.
• If you have a Mac, you will only need to install Git (which may already be installed on your system).
The textbook for this section is available here (for Windows) or here (for Mac).
Install on Windows
Install on Mac
1. Open the terminal, either from the utility folder or using Cmd+space, and check if you already have
Git installed by typing git --version in the command line.
2. If you already have Git installed, you will be shown the version number after executing the above. If
you do not have Git installed already, you will be prompted to do so.
GitHub
• Sign up for a GitHub account, with a name that is professional, short, and easy to remember
• Connect to RStudio: global options > Git/SVM, enter the path to git executables
• To avoid typing our GitHub password every time, we create a SSH/RSA key automatically through
RStudio with the create RSA key button.
3
GitHub Repositories
2. Select the code that will NOT install the popular graphing and data manipulation packages ggplot2
and dplyr in R.
A. install.packages(c(“ggplot2”,“dplyr”))
B. install.packages(“tidyverse”)
C. install.packages(c("dplyr","ggplot2”)
D. install.packages(“ggplot2”) install.packages(“dplyr”)
3. Which of the following is not true about installing packages? Select ALL that apply.
4
A. To install a new package, the install.packages() function can be used
B. To install a new package, the drop-down menu Tools > Install packages can be used
C. Installed packages will remain installed even if you upgrade R
D. Installing a package by building from GitHub will give you the exact same version as on CRAN
5. Which of the following statements about keeping organized with RStudio projects is not correct?
A. To start a new project, click on File > New Project > New directory > New project > {choose a
file directory and project name}
B. You must always start a project in a new directory.
C. RStudio provides a way to keep all components of a data analysis project organized into one folder
and to keep track of information about this project.
D. Creating a new R project will produce an .Rproj file associated with the project.
6. What can you change in the global options? Select ALL that apply.
7. What does the term “pull” mean in the context of using Git in RStudio?
8. What does the term “push” mean in the context of using Git in RStudio?
9. What does the term “commit” mean in the context of using Git in RStudio?
10. Did you create a GitHub account? Enter your GitHub username below.
1965Eric
5
Section 2 Overview
The Unix section discusses the basics of managing your filesystem from the terminal with Unix commands
such as mv and rm.
There is a two-part graded comprehension check at the end of the section.
6
Below, you will find a summary of Unix commands that will be covered in this section and the Advanced
Unix section. The examples here refer to this hypothetical file system
7
Useful Unix Commands
8
9
Absolute path vs. relative path
A full path specifies the location of a file from the root directory. It is independent of your present directory,
and must begin with either a “/” or a “~”. In this example, the full path to our “project-1” file is:
/home/projects/project-1
A relative path is the path relative to your present working directory. If our present working directory is the
“projects” folder, then the relative path to our “project-1” file is simply:
project-1
Path shortcuts
1. Your current working directory is ~/projects and you want to move to the figs directory in the project-1
folder
2. Your current working directory is ~/projects and you want to move to the reports folder in the docs
directory
3. Your current working directory is ~/projects/project-1/figs and you want to move to the project-2
folder in the projects directory.
10
The Terminal
Code:
echo "hello world"
The Filesystem
• We refer to all the files, folders, and programs (executables) on your computer as the filesystem.
• Your filesystem is organized as a series of nested folders each containing files, folders, and executables.
(see the visualization above)
• In Unix, folders are referred to as directories and directories that are inside other directories are often
referred to as subdirectories.
• The home directory is where all your stuff is kept. There is a hierarchical nature to the file system.
• Note for Windows Users: The typical R installation will make your Documents directory your
home directory in R. This will likely be different from your home directory in Git Bash. Generally,
when we discuss home directories, we refer to the Unix home directory which for Windows, in this
book, is the Git Bash Unix directory.
Working Directory
Unix Commands
11
• Auto-complete paths, commands and file names with the “Tab” key.
Code
Code
mv path-to-file path-to-destination-directory
rm filename-1 filename-2 filename-3
Code
less cv.tex
12
• Ideally, files (code, data, output) should be structured and self-contained
• In a project, we prefer using relative paths (path relative to the default working directory) instead of
the full path so that code can run smoothly on other individual’s computers.
• It is good practice to write a README.txt file to introduce the file structure to facilitate collaboration
and for your future reference.
Code
1. It is important to know which directory, or folder, you’re in when you are working from the command
line in Unix. Which line of code will tell you the current working directory?
A. cd
B. pwd
C. rm
D. echo
2. You can’t use your computer’s mouse in a terminal. How can you see a line of code that you executed
previously?
A. Type pwd
B. Type echo
C. Use the up arrow
D. Press the enter key
3. Assume a student types pwd and gets the following output printed to the screen: /Users/student/Documents.
mkdir projects
cd projects
What will be printed to the screen if the student types pwd after executing the two lines of code shown
above?
13
A. /Users/student/Documents
B. /Users/student/Documents/projects
C. /Users/student
D. cd: projects: No such file or directory
4. Which of the following statements does NOT correctly describe the utility of a command in Unix?
A. The q key exits the viewer when you use less to view a file.
B. The command ls lists files in the current directory.
C. The command mkdir makes a new directory and moves into it.
D. The mv command can move a file and change the name of a file.
5. The following is the full path to a your homework assignment file called “assignment.txt”:
/Users/student/Documents/projects/homeworks/assignment.txt
Which line of code will allow you to move the assignment.txt file from the homeworks directory into the
parent directory projects?
A. mv assignment.txt
B. mv assignment.txt .
C. mv assignment.txt ..
D. mv assignment.txt /projects
6. You want to move a file called assignment.txt file into your projects directory. However, there is
already a file called “assignment.txt” in the projects directory.
What happens when you execute the “move” (mv) command to move the file into the new directory?
A. The moved “assignment.txt” file replaces the old “assignment.txt” file that was in the “projects”
directory with no warning.
B. An error message warns you that you are about to overwrite an existing file and asks if you want
to proceed.
C. An error message tells you that a file already exists with that name and asks you to rename the
new file.
D. The moved “assignment.txt” file is automatically renamed “assignment.txt (copy)” after it is moved
into the “projects” directory.
8. Suppose you want to delete your project directory at ./myproject. The directory is not empty - there
are still files inside of it.
14
Which command should you use?
A. rmdir myproject
B. rmdir ./myproject
C. rm -r myproject
D. rm ./myproject
9. The source() function reads a script from a url or file and evaluates it. Check ?source in the R
console for more information.
Suppose you have an R script at ~/myproject/R/plotfig.R and getwd() shows ~/myproject/result, and
you are running your R script with source('~/myproject/R/plotfig.R').
Which R function should you write in plotfig.R in order to correctly produce a plot in ~/myproject/result/fig/barplot.pn
A. ggsave('fig/barplot.png'), because this is the relative path to the current working directory.
B. ggsave('../result/fig/barplot.png'), because this is the relative path to the source file (“plot-
fig.R”).
C. ggsave('result/fig/barplot.png'), because this is the relative path to the project directory.
D. ggsave('barplot.png'), because this is the file name.
10. Which of the following statements about the terminal are not correct? Select ALL that apply.
11. Which of the following statements about the filesystem is not correct?
A. The home directory is where the system files that come with your computer exist.
B. The name of the home directory is likely the same as the username on the system.
C. File systems on Windows and Mac are different in some ways.
D. Root directory is the directory that contains all directories.
12. Which of the following meanings for options following less are not correct? (Hint: use man less to
check.)
13. Which of the following statements is incorrect about preparation for a data science project? Select
ALL that apply.
15
Section 3 Overview
The Reproducible Reports section guides you through how to create data science reports using R Markdown
and the knitr package.
There is a graded comprehension check at the end of the section.
We will use this example GitHub repository throughout.
• The final output is usually a report, textual descriptions and figures, and tables.
• The aim is to generate a reproducible report in R markdown and knitr.
• Features of Rmarkdown: code and text can be combined to the same document and figures and tables
are automatically added to the file.
R Markdown
• R Markdown is a format for literate programming documents. Literate programming weaves instruc-
tions, documentation and detailed comments in between machine executable code, producing a docu-
ment that describes the program that is best for human understanding.
• Start an R markdown document by clicking on File > New File > the R Markdown
• The output could be HTML, PDF, or Microsoft Word, which can be changed in the header output,
e.g. pdf_document / html_document
Code
16
knitr
Code
output: html_document
output: pdf_document
output: word_document
output: github_document
2. You have a vector of student heights called heights. You want to generate a histogram of these heights
in a final report, but you don’t want the code to show up in the final report. You want to name the R
chunk “histogram” so that you can easily find the chunk later.
A.
B.
C.
‘‘‘{r, echo=FALSE}
hist(heights)‘‘‘
D.
17
‘‘‘{r histogram, echo=FALSE}
hist(heights)‘‘‘
---
title: "Final Grade Distribution"
output: pdf_document
---
‘‘‘{r, echo=FALSE}
load(file="my_data.Rmd")
summary(grades)‘‘‘
Select the statement that describes the file report generated by the R markdown code above.
A. A PDF document called “Final Grade Distribution” that prints a summary of the “grades” object.
The code to load the file and produce the summary will not be included in the final report.
B. A PDF document called “Final Grade Distribution” that prints a summary of the “grades” object.
The code to load the file and produce the summary will be included in the final report.
C. An HTML document called “Final Grade Distribution” that prints a summary of the “grades”
object. The code to load the file and produce the summary will not be included in the final report.
D. A PDF document called “Final Grade Distribution” that is empty because the argument echo=FALSE
was used.
4. The user specifies the output file format of the final report when using R Markdown.
Which of the following file types is NOT an option for the final output?
A. .rmd
B. .pdf
C. .doc
D. .html
‘‘‘{r, echo=F}
n <- nrow(mtcars)‘‘‘
6. What is the final value from these three sequential Rmd code chunks?
‘‘‘{r, eval=FALSE}
a <- 2‘‘‘
‘‘‘{r, include=FALSE}
print("Hello World!")
a <- 5‘‘‘
18
‘‘‘{r, echo=FALSE}
a <- a+1
print(a)‘‘‘
A. 2
B. 3
C. 6
D. 5
Section 4 Overview
In this section on Git and GitHub, you will learn to clone and create version-controlled GitHub repositories
using the command line.
There is a graded comprehension check at the end of the section.
• Codecademy.
• GitHub Guides.
• Try Git tutorial.
• Happy Git and GitHub for the useR.
Key points
• Next, we will learn how to use Git and GitHub in the command line.
• Reasons to use Git and GitHub:
1. Version-control: Permits us to keep track of changes we made to code, to revert back to previous
versions of files, to test ideas using new branches and decide if we want to merge to the original.
2. Collaboration: On a centralized repo, multiple people may make changes to the code and keep
versions synced. A pull request allows anyone to suggest changes to your code.
3. Sharing code
• To effectively permit version control and collaboration, files move across four different areas: Working
Directory, Staging Area, Local Repository, and Upstream Repository.
• Start your Git journey with either cloning an existing repo, or initializing a new one
• Recap: there are four stages: working directory, staging area, local repository, and upstream repository
• Clone an existing upstream repository (copy repo url from clone button, and type "git clone
<url>"), and all three local stages are the same as upstream remote.
19
Figure 2: Git stages
• The working directory is the same as the working directory in Rstudio. When we edit files we only
change the files in this place.
• git status: tells how the files in the working directory are related to the files in other stages
• edits in the staging area are not tracked by the version control system by default - we add a file to the
staging area by git add command
• git commit: to commit files from the staging area to local repository, we need to add a message stating
what we are doing by git commit -m "something"
• git log: keeps track of all the changes we have made to the local repository
• git push: allows moving from the local repository to upstream repository, only if you have the per-
mission (e.g. if it is yours)
• git fetch: update local repository to be like the upstream repository, from upstream to local
• git merge: make the updated local sync with the working directory and staging area
• To change everything in one shot (from upstream to working dir), use git pull (equivalent to com-
bining git fetch + git merge)
Code
pwd
mkdir git-example
cd git-example
git clone https://github.com/rairizarry/murders.git
cd murders
ls
git status
echo "test" >> new-file.txt
echo "temporary" >> tmp.txt
git add new-file.txt
git status
20
git commit -m "adding a new file"
git status
echo "adding a second line" >> new-file.txt
git commit -m "minor change to new-file" new-file.txt
git status
git add
git log new-file.txt
git push
git fetch
git merge
• Recap: two ways to get started, one is cloning an existing repository, the other is initializing our own
• Create our own project on our computer (independent of Git) on our own machine
• Create an upstream repo on Github, copy repo’s url
• Make a local git repository: On the local machine, in the project directory, use git init. Now git
starts tracking everything in the local repo.
• Now we need to start moving files into our local repo and connect local repo to the upstream remote
by git remote add origin <url>
• Note: The first time you push to a new repository, you may also need to use these git push options:
git push --set-upstream origin master. If you need to run these arguments but forget to do so,
you will get an error with a reminder.
Code
cd ~/projects/murders
git init
git add README.txt
git commit -m "First commit. Adding README.txt file just to get started"
git remote add origin "https://github.com/rairizarry/murders.git"
git push # you may need to add these arguments the first time: --set-upstream origin master
1. Which statement describes reasons why we recommend using git and Github when working on data
analysis projects?
A. Git and Github facilitate fast, high-throughput analysis of large data sets.
B. Git and Github allow easy version control, collaboration, and resource sharing.
C. Git and Github have graphical interfaces that make it easy to learn to code in R.
D. Git and Github is good for long-term storage of private data.
21
• Clone the contents of a git repo at the following URL into that directory https://github.com/user123/repo123.git,
and
• List the contents of the cloned repo.
A.
mkdir project-clone
git add https://github.com/user123/repo123.git
ls
B.
mkdir project-clone
git clone https://github.com/user123/repo123.git
ls
C.
mkdir project-clone
cd project-clone
git clone https://github.com/user123/repo123.git
ls
D.
mkdir project-clone
cd project-clone
git clone https://github.com/user123/repo123.git
less
3. You have successfully cloned a Github repository onto your local system.
The cloned repository contains a file called “heights.txt” that lists the heights of students in a class. One
student was missing from the dataset, so you add that student’s height using the following command:
echo “165” >> heights.txt
Next you enter the command git status to check the status of the Github repository.
What message is returned and what does it mean?
A. modified: heights.txt, no changes added to commit. This message means that the
heights.txt file was modified, but the changes have not been staged or committed to the local
repository.
B. modified: heights.txt, no changes added to commit. This message means that the
heights.txt file was modified and staged, but not yet committed.
C. 1 file changed. This message means that the heights.txt file was modified, staged, committed,
and pushed to the upstream repository.
D. modified: heights.txt. This message means that the heights.txt file was modified, staged, and
committed.
4. You cloned your own repository and modified a file within it on your local system.
22
Next, you executed the following series of commands to include the modified file in the upstream repository,
but it didn’t work. Here is the code you typed:
What is preventing the modified file from being added to the upstream repository?
A. The wrong option is being used to add a descriptive message to the commit.
B. git push should be used instead of git pull.
C. git commit should come before git add.
D. The git pull command line needs to include the file name.
5. You have a directory of scripts and data files on your computer that you want to share with collabo-
rators using GitHub. You create a new repository on your GitHub account called “repo123” that has
the following URL: https://github.com/user123/repo123.git Which of the following sequences of
commands will convert the directory on your computer to a Github directory and create and add a
descriptive “read me” file to the new repository?
A.
git init
git add README.txt
git commit -m "First commit. Adding README file."
git remote add origin ‘https://github.com/user123/repo123.git‘
git push
B.
C.
D.
23
6. You have made a local change to a file in your R project, which is associated with a GitHub repository.
You add your changes and push, but you receive a message:
Everything up-to-date
Which of the following commands did you forget to do?
A. git pull
B. git merge
C. git add
D. git fetch
E. git commit
F. git push
G. git rebase
7. Suppose you previously cloned a repository with git clone. Running git status shows:
On branch master
Your branch is up to date with ’origin/master’.
However, you know that there are some changes in the upstream repository.
How will you sync these changes with one command?
A. git fetch
B. git pull
C. git merge origin/master
D. git merge upstream/master
E. git push
Section 5 Overview
In Section 5, you will learn additional useful Unix commands, including arguments, getting help, pipes, and
wildcards that are all helpful in data science.
There is a two-part graded comprehension check at the end of the section.
• Arguments typically are defined using a dash (-) or two dashes (--) followed by a letter of a word.
• r: recursive. For example, rm -r <directory-name>: remove all files, subdirectories, files in subdi-
rectories, subdirectories in subdirectories, etc.
• Combine arguments: rm -rf directory-name
• ls -a: Shows all files in the directories including hidden files (e.g. .git file when initializing using git
init) (a for all).
• ls -l: Returns more information about the files (i.e. l for long).
• ls -t: Shows files in chronological order.
• ls -r: Reverses the order of how files are shown.
• ls -lart: Shows more information for all files in reverse chronological order.
24
Advanced Unix: Getting Help and Pipes
• Getting Help: Use man + command name to get help (e.g. man ls). Note that it is not available for
Git Bash. For Git Bash, you can use command -- help (e.g. ls --help).
• Pipes: Pipes the results of a command to the command after the pipe. Similar to the pipe %>% in R.
For example, man ls | less (and its equivalent in Git Bash: ls --help | less). Also useful when
listing files with many files (e.g ls -lart | less).
• * means any number of any combination of characters. Specifically, to list all html files: ls *.html
and to remove all html files in a directory: rm *.html.
• ? means any single character. For example, to erase all files in the form file-001.html with the
numbers going from 1 to 999: rm file-???.html.
• Combined wild cards: rm file-001.* to remove all files of the name file-001 regardless of suffix.
• Warning: Combining rm with the * wild card can be dangerous. There are combinations
of these commands that will erase your entire file system without asking you for confir-
mation. Make sure you understand how it works before using this wild card with the rm
command.
• In Unix, variables are distinguished from other entities by adding a $ in front. For example, the home
directory is stored in $HOME.
• See home directory: echo $HOME
• See them all: env
• See what shell is being used: echo $SHELL (most common shell is bash)
• Change environmental variables: (Don’t actually run this command though!) export PATH =
/usr/bin/
• In Unix, all programs are files. They are called executables. So, ls, mv, and git are all files.
• To find where these program files are, use which For example, which git would return /usr/bin/git.
• Type ls /usr/bin to see several executable files. There are other directories that hold program files
(e.g. Application directory for Mac or Program Files directory in Windows).
25
• Type echo $PATH to see a list of directories separated by “:”.
• Type the full path to run the user-created executables (e.g ./my-ls).
• Regular file -, directory d, executable x.
• This string also indicates the permission of the file: is it readable? writable? executable? Can other
users on the system read the file? Can other users on the system edit the file? Can other users execute
if the file is executable?
A. A list of all file (names, sizes, and other information) arranged in chronological order with the most
recently modified files at the top of the list.
B. A list of visible files (names, sizes, and other information) arranged in chronological order with the
oldest files at the top of the list.
C. A list of all files (names only) arranged in chronological order with the oldest files at the top of the
list.
D. A list of visible files (names only) arranged in chronological order with the most recent files at the
top of the list.
2. What happens when you remove a directory using the command rm -r?
3. By default, the head command in Unix displays the first 10 lines of a specified file. You can change
the number of lines using an argument that indicates the numeric value of the desired number of lines.
Which of the following commands displays only the first 6 lines of a manual for the ls command?
A. man ls -6 | head
26
B. head | man ls -6
C. head -6 | man ls
D. man ls | head -6
A. ls data*
B. ls data*.txt
C. ls *.txt
D. ls data?.txt
A. rm D*
B. rm D*.txt
C. ls D*
D. ls D*.txt
6. Imagine you have multiple text files in the following directory: /Users/student/Documents/project.
mkdir data
mv *.txt data
cd data
What will be printed to the screen if you enter the ls command after executing the three lines of code shown
above? - [ ] A. /Users/student/Documents/project/data - [X] B. The file names that were moved from
the “project” directory into the “data” directory. - [ ] C. Nothing. You haven’t added anything to the new
“data” directory yet. - [ ] D. The file names that remain in the “project” directory.
8. Many systems operate using the Unix shell and command language, bash. Each time you start using
bash, it executes the commands contained in a “dot” file. Your “dot” file may be called something like
“.bash_profile” or “.bash_rc”.
27
A. ls -a
B. ls bash*
C. head *bash*
D. ls -l
9. Your colleague was editing his “dot” files when something went wrong. He first noticed there was an
issue when he tried to execute the following line of code:
ls
He received the following error:
-bash: ls: command not found
What could have happened to cause this error?
A. He is trying to execute ls which is a bash command, but his system isn’t running bash as a shell.
B. The command ls doesn’t exist. He should be using the command ll.
C. He forgot to specify a file name to be listed. The command ls * should work.
D. He changed the information contained in $PATH. Now the system cannot find the executable file
for ls.
10. The bash profile in your home directory contains information that the bash shell runs each time you
use it. You can customize the information in your bash profile to tell your system to do different things.
For example, you can make an “alias”, which acts like a keyboard shortcut.
Which line of code, when added to your bash profile, will let you print “seetop” to view the name, size, and
file type of the 10 most recently added visible files?
11. The commands in the pipeline $ cat result.txt | grep "Harvard edX" | tee file2.txt | wc
-l perform which of the following actions?
A. From result.txt, select lines containing “Harvard edX”, store them into file2.txt, and print all unique
lines from result.txt.
B. From result.txt, select lines containing “Harvard edX”, and store them into file2.txt.
C. From result.txt, select lines containing “Harvard edX”, store them into file2.txt, and print the total
number of lines which were written to file2.txt.
D. From result.txt, select lines containing “Harvard edX”, store them into file2.txt, and print the
number of times “Harvard edX” appears.
28
D. To reset the current HEAD to the specified state
E. To download objects and refs from another repository
13. Which of the following statements is wrong about Advanced Unix Executables, Permissions, and File
Types?
A. In Unix, all programs are files/executables except for commands like ls, mv, and git.
B. which git allows a user to find the path to git.
C. When users create executable files themselves, they cannot be run just by typing the command -
the full path must be typed instead.
D. ls -l can be used to inspect the permissions of each file.
14. Which of the following commands correctly copies all files which are named as file-???.r (e.g
file-abc.r, file-qwe.r, file-123.r) into the directory named your_directory?
A. cp file-???.r ./your_directory
B. cp file-*.r ./your directory
C. cp file-[a-z].r ./your_directory
D. cp file-???.* ./your_directory
29