RStudio For R Statistical Computing Cookbook - Sample Chapter
RStudio For R Statistical Computing Cookbook - Sample Chapter
ee
This book will help you to set up your own data analysis project in RStudio, acquire data from different
data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on
with various data visualization methods using ggplot2 and create interactive and multi-dimensional
visualizations with D3.js. You'll also learn to create reports from your analytical application with the full
range of static and dynamic reporting tools available in Studio to effectively communicate results and
even transform them into interactive web applications.
files effectively
Perform reproducible statistical analyses
and problems
problems efficiently
real-world problems
$ 44.99 US
28.99 UK
P U B L I S H I N G
Andrea Cirillo
RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for
computational statistics, visualization, and data science, in an integrated development environment.
pl
e
P U B L I S H I N G
Sa
Andrea Cirillo
Preface
Why should you read RStudio for R Statistical Computing Cookbook?
Well, even if there are plenty of books and blog posts about R and RStudio out there, this
cookbook can be an unbeatable friend through your journey from being an average R and
RStudio user to becoming an advanced and effective R programmer.
I have collected more than 50 recipes here, covering the full spectrum of data analysis
activities, from data acquisition and treatment to results reporting.
All of them come from my direct experience as an auditor and data analyst and from
knowledge sharing with the really dynamic and always growing R community.
I took great care selecting and highlighting those packages and practices that have proven
to be the best for a given particular task, sometimes choosing between different packages
designed for the same purpose.
You can therefore be sure that what you will learn here is the cutting edge of the R language
and will place you on the right track of your learning path to R's mastery.
Preface
Chapter 5, Power Programming with R, discusses how to write efficient R code, making use of
the R objective-oriented systems and advanced tools for code performance evaluation.
Chapter 6, Domain-specific Applications, shows you how to apply the R language to a wide
range of problems related to different domains, from financial portfolio optimization to
e-commerce fraud detection.
Chapter 7, Developing Static Reports, helps you discover the reporting tools available within
the RStudio IDE and how to make the most of them to produce static reports for sharing
results of your work.
Chapter 8, Dynamic Reporting and Web Application Development, displays the collected
recipes designed to make use of the latest features introduced in RStudio from shiny web
applications with dynamic UIs to RStudio add-ons.
Introduction
The American statistician Edward Deming once said:
"Without data you are just another man with an opinion."
I think this great quote is enough to highlight the importance of the data acquisition phase
of every data analysis project. This phase is exactly where we are going to start from. This
chapter will give you tools for scraping the Web, accessing data via web APIs, and importing
nearly every kind of file you will probably have to work with quickly, thanks to the magic
package rio.
All the recipes in this book are based on the great and popular packages developed and
maintained by the members of the R community.
The data format: This is the format in which data is made available
The data license: This is to check whether there is any license covering data
utilization/distribution or whether there is any need for ethics/privacy considerations
After covering these points for each set of data, you will have a clear vision of future data
acquisition activities. This will let you plan ahead the activities needed to clearly define
resources, steps, and expected results.
Chapter 1
Getting ready
Data statically exposed on web pages is actually pieces of web page code. Getting them from
the Web to our R environment requires us to read that code and find where exactly the data is.
Dealing with complex web pages can become a really challenging task, but luckily,
SelectorGadget was developed to help you with this job. SelectorGadget is a bookmarklet,
developed by Andrew Cantino and Kyle Maxwell, that lets you easily figure out the CSS selector
of your data on the web page you are looking at. Basically, the CSS selector can be seen as
the address of your data on the web page, and you will need it within the R code that you are
going to write to scrape your data from the Web (refer to the next paragraph).
The CSS selector is the token that is used within the CSS code to identify
elements of the HTML code based on their name.
CSS selectors are used within the CSS code to identify which elements are to
be styled using a given piece of CSS code. For instance, the following script
will align all elements (CSS selector *) with 0 margin and 0 padding:
* {
margin: 0;
padding: 0;
}
SelectorGadget is currently employable only via the Chrome browser, so you will need to install
the browser before carrying on with this recipe. You can download and install the last version
of Chrome from https://www.google.com/chrome/.
SelectorGadget is available as a Chrome extension; navigate to the following URL while
already on the page showing the data you need:
:javascript:(function(){
var%20s=document.createElement('div');
s.innerHTML='Loading'
;s.style.color='black';
s.style.padding='20px';
s.style.position='fixed';
s.style.zIndex='9999';
s.style.fontSize='3.0em';
s.style.border='2px%20solid%20black';
s.style.right='40px';
s.style.top='40px';
s.setAttribute('class','selector_gadget_loading');
s.style.background='white';
document.body.appendChild(s);
This long URL shows that the CSS selector is provided as JavaScript; you can make this out
from the :javascript: token at the very beginning.
We can further analyze the URL by decomposing it into three main parts, which are as follows:
Creation on the page of a new element of the div class with the document.
createElement('div') statement
The .js file is where the CSS selector's core functionalities are actually defined and the place
where they are taken to make them available to users.
That being said, I'm not suggesting that you try to use this link to employ SelectorGadget
for your web scraping purposes, but I would rather suggest that you look for the Chrome
extension or at the official SelectorGadget page, http://selectorgadget.com. Once
you find the link on the official page, save it as a bookmark so that it is easily available
when you need it.
The other tool we are going to use in this recipe is the rvest package, which offers great web
scraping functionalities within the R environment.
To make it available, you first have to install and load it in the global environment that runs
the following:
install.packages("rvest")
library(rvest)
Chapter 1
How to do it...
1. Run SelectorGadget. To do so, after navigating to the web page you are interested
in, activate SelectorGadget by running the Chrome extension or clicking on the
bookmark that we previously saved.
In both cases, after activating the gadget, a Loading message will appear, and
then, you will find a bar on the bottom-right corner of your web browser, as shown in
the following screenshot:
You are now ready to select the data you are interested in.
Chapter 1
When you are done with this fine-tuning process, SelectorGadget will have correctly
identified a proper selector, and you can move on to the next step.
3. Find your data location on the page. To do this, all you have to do is copy the CSS
selector that you will find in the bar at the bottom-right corner:
This piece of text will be all you need in order to scrape the web page from R.
4. The next step is to read data from the Web with the rvest package. The rvest
package by Hadley Wickham is one of the most comprehensive packages for
web scraping activities in R. Take a look at the There's more... section for further
information on package objectives and functionalities.
For now, it is enough to know that the rvest package lets you download HTML code
and read the data stored within the code easily.
Now, we need to import the HTML code from the web page. First of all, we need to
define an object storing all the HTML code of the web page you are looking at:
page_source <- read_html('https://en.wikipedia.org/wiki/R_
(programming_language)
This code leverages read_html function(), which retrieves the source code that
resides at the written URL directly from the Web.
<- html_nodes(page_source,".wikitable th ,
As you can imagine, this code extracts all the content of the selected nodes, including
HTML tags.
The HTML language
HyperText Markup Language (HTML) is a markup language that is used to
define the format of web pages.
The basic idea behind HTML is to structure the web page into a format with a
head and body, each of which contains a variable number of tags, which can
be considered as subcomponents of the structure.
The head is used to store information and components that will not be
seen by the user but will affect the web page's behavior, for instance, in a
Google Analytics script used for tracking page visits, the body contains all the
contents which will be showed to the reader.
Since the HTML code is composed of a nested structure, it is common to
compare this structure to a tree, and here, different components are also
referred to as nodes.
Printing out the version_block object, you will obtain a result similar to
the following:
print(version_block)
{xml_nodeset (45)}
[1] <th>Release</th>
[2] <th>Date</th>
[3] <th>Description</th>
[4] <th>0.16</th>
[5] <td/>
[6] <td>This is the last <a href="/wiki/Alpha_test" title="Alpha
test" class="mw-redirect">alp ...
[7] <th>0.49</th>
[8] <td style="white-space:nowrap;">1997-04-23</td>
[9] <td>This is the oldest available <a href="/wiki/Source_code"
title="Source code">source</a ...
[10] <th>0.60</th>
[11] <td>1997-12-05</td>
[12] <td>R becomes an official part of the <a href="/wiki/GNU_
Project" title="GNU Project">GNU ...
8
Chapter 1
[13] <th>1.0</th>
[14] <td>2000-02-29</td>
[15] <td>Considered by its developers stable enough for production
use.<sup id="cite_ref-35" cl ...
[16] <th>1.4</th>
[17] <td>2001-12-19</td>
[18] <td>S4 methods are introduced and the first version for <a
href="/wiki/Mac_OS_X" title="Ma ...
[19] <th>2.0</th>
[20] <td>2004-10-04</td>
This result is not exactly what you are looking for if you are going to work with this
data. However, you don't have to worry about that since we are going to give your text
a better shape in the very next step.
6. In order to obtain a readable and actionable format, we need one more step:
extracting text from HTML tags.
This can be done using the html_text() function, which will result in a list
containing all the text present within the HTML tags:
content <- html_text(version_block)
The final result will be a perfectly workable chunk of text containing the data needed
for our analysis:
[1] "Release"
[2] "Date"
[3] "Description"
[4] "0.16"
[5] ""
[6] "This is the last alpha version developed primarily by
Ihaka and Gentleman. Much of the basic functionality from the
\"White Book\" (see S history) was implemented. The mailing lists
commenced on April 1, 1997."
[7] "0.49"
[8] "1997-04-23"
[9] "This is the oldest available source release, and compiles
on a limited number of Unix-like platforms. CRAN is started on
this date, with 3 mirrors that initially hosted 12 packages. Alpha
versions of R for Microsoft Windows and Mac OS are made available
shortly after this version."
10
Chapter 1
[31] "2.14"
[32] "2011-10-31"
[33] "Added mandatory namespaces for
packages. Added a new parallel package."
[34] "2.15"
[35] "2012-03-30"
[36] "New load balancing functions. Improved
serialization speed for long vectors."
[37] "3.0"
[38] "2013-04-03"
[39] "Support for numeric index values
231 and larger on 64 bit systems."
[40] "3.1"
[41] "2014-04-10"
[42] ""
[43] "3.2"
[44] "2015-04-16"
[45] ""
There's more...
The following are a few useful resources that will help you get the most out of this recipe:
A useful list of HTML tags, to show you how HTML files are structured and how to
identify code that you need to get from these files, is provided at http://www.
w3schools.com/tags/tag_code.asp
The blog post from the RStudio guys introducing the rvest package and highlighting
some package functionalities can be found at http://blog.rstudio.
org/2014/11/24/rvest-easy-web-scraping-with-r/
11
A typical use case for API data contains data regarding web and mobile applications, for
instance, Google Analytics data or data regarding social networking activities.
The successful web application If This ThenThat (IFTTT), for instance, lets you link together
different applications, making them share data with each other and building powerful and
customizable workflows:
This useful job is done by leveraging the application's API (if you don't know IFTTT, just
navigate to https://ifttt.com, and I will see you there).
12
Chapter 1
Using R, it is possible to authenticate and get data from every API that adheres to the OAuth
1 and OAuth 2 standards, which are nowadays the most popular standards (even though
opinions about these protocols are changing; refer to this popular post by the OAuth creator
Blain Cook at http://hueniverse.com/2012/07/26/oauth-2-0-and-the-road-tohell/). Moreover, specific packages have been developed for a lot of APIs.
This recipe shows how to access custom APIs and leverage packages developed for
specific APIs.
In the There's more... section, suggestions are given on how to develop custom functions
for frequently used APIs.
Getting ready
The rvest package, once again a product of our benefactor Hadley Whickham, provides
a complete set of functionalities for sending and receiving data through the HTTP protocol
on the Web. Take a look at the quick-start guide hosted on GitHub to get a feeling of rvest
functionalities (https://github.com/hadley/rvest).
Among those functionalities, functions for dealing with APIs are provided as well.
Both OAuth 1.0 and OAuth 2.0 interfaces are implemented, making this package really useful
when working with APIs.
Let's look at how to get data from the GitHub API. By changing small sections, I will point out
how you can apply it to whatever API you are interested in.
Let's now actually install the rvest package:
install.packages("rvest")
library(rvest)
How to do it
1. The first step to connect with the API is to define the API endpoint. Specifications for
the endpoint are usually given within the API documentation. For instance, GitHub
gives this kind of information at http://developer.github.com/v3/oauth/.
In order to set the endpoint information, we are going to use the oauth_endpoint()
function, which requires us to set the following arguments:
request: This is the URL that is required for the initial unauthenticated
token. This is deprecated for OAuth 2.0, so you can leave it NULL in this
authorize: This is the URL where it is possible to gain authorization for the
given client.
13
access: This is the URL where the exchange for an authenticated token
is made.
base_url: This is the API URL on which other URLs (that is, the URLs
2. Create an application to get a key and secret token. Moving on with our GitHub
example, in order to create an application, you will have to navigate to https://
github.com/settings/applications/new (assuming that you are already
authenticated on GitHub).
Be aware that no particular URL is needed as the homepage URL, but a specific URL
is required as the authorization callback URL.
This is the URL that the API will redirect to after the method invocation is done.
As you would expect, since we want to establish a connection from GitHub to our
local PC, you will have to redirect the API to your machine, setting the Authorization
callback URL to http://localhost:1410.
After creating your application, you can get back to your R session to establish a
connection with it and get your data.
3. After getting back to your R session, you now have to set your OAuth credentials
through the oaut_app() and oauth2.0_token() functions and establish a
connection with the API, as shown in the following code snippet:
app <- oauth_app("your_app_name",
key = "your_app_key",
secret = "your_app_secret")
API_token <- oauth2.0_token(github_api,app)
4. This is where you actually use the API to get data from your web-based software.
Continuing on with our GitHub-based example, let's request some information about
API rate limits:
request <- GET("https://api.github.com/rate_limit", config(token =
API_token))
14
Chapter 1
How it works...
Be aware that this step will be required both for OAuth 1.0 and OAuth 2.0 APIs, as the
difference between them is only the absence of a request URL, as we noted earlier.
Endpoints for popular APIs
The httr package comes with a set of endpoints that are already
implemented for popular APIs, and specifically for the following websites:
LinkedIn
Twitter
Vimeo
Google
Facebook
GitHub
For these APIs, you can substitute the call to oauth_endpoint() with a
call to the oauth_endpoints() function, for instance:
oauth_endpoints("github")
There's more...
You can also write custom functions to handle APIs. When frequently dealing with a particular
API, it can be useful to define a set of custom functions in order to make it easier to interact
with.
15
Authentication
Getting content from the API
Posting content to the API
You can get the content from the API through the get function of the httr package:
api_get <- function(path = "api_path",password){
auth <- api_auth(path, password )
request <- GET("https://api.com", path = path, auth)
}
Posting content will be done in a similar way through the POST function:
api_post <- function(Path, post_body, path = "api_path",password){
auth <- api_auth(pat) stopifnot(is.list(body))
body_json <- jsonlite::toJSON(body)
request <- POST("https://api.application.com", path = path, body =
body_json, auth, post, ...)
}
Chapter 1
Getting ready
First of all, we have to install our great twitteR package by running the following code:
install.packages("twitteR")
library(twitter)
How to do it
1. As seen with the general procedure, in order to access the Twitter API, you will need
to create a new application. This link (assuming you are already logged in to Twitter)
will do the job: https://apps.twitter.com/app/new.
Feel free to give whatever name, description, and website to your app that you want.
The callback URL can be also left blank.
After creating the app, you will have access to an API key and an API secret, namely
Consumer Key and Consumer Secret, in the Keys and Access Tokens tab in your
app settings.
Below the section containing these tokens, you will find a section called Your Access
Token. These tokens are required in order to let the app perform actions on your
account's behalf. For instance, you may be willing to send direct messages to all new
followers and could therefore write an app to do that automatically.
Keep a note of these tokens as well, since you will need them to set up your
connection within R.
2. Then, we will get access to the API from R. In order to authenticate your app and use
it to retrieve data from Twitter, you will just need to run a line of code, specifically, the
setup_twitter_oauth() function, by passing the following arguments:
consumer_key
consumer_token
access_token
access_secret
=
=
=
=
"consumer_key",
"consumer_secret",
"access_token",
"access_secret")
17
retryOnRateLimit: This is the number that defines how many times the
tweet_list will be a list of the first 450 tweets resulting from the given query.
Be aware that since n is the maximum number of tweets retrievable, you may retrieve
a smaller number of tweets, if for the given query the number or result is smaller
than n.
18
Chapter 1
Each element of the list will show the following attributes:
text
favorited
favoriteCount
replyToSN
created
truncated
replyToSID
id
replyToUID
statusSource
screenName
retweetCount
isRetweet
retweeted
longitude
latitude
In order to let you work on this data more easily, a specific function is provided to
transform this list in a more convenient data.frame, namely, the twiLstToDF()
function.
After this, we can run the following line of code:
tweet_df
<-
twListToDF(tweet_list)
This will result in a tweet_df object that has the following structure:
> str(tweet_df)
'data.frame': 20 obs. of 16 variables:
$ text
: chr "95% off Applied Data Science with R $ favorited
: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ favoriteCount: num 0 2 0 2 0 0 0 0 0 1 ...
$ replyToSN
: logi NA NA NA NA NA NA ...
$ created
: POSIXct, format: "2015-10-16 09:03:32" "2015-1015 17:40:33" "2015-10-15 11:33:37" "2015-10-15 05:17:59" ...
$ truncated
: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ replyToSID
: logi NA NA NA NA NA NA ...
$ id
: chr "654945762384740352" "654713487097135104"
"654621142179819520" "654526612688375808" ...
$ replyToUID
: logi NA NA NA NA NA NA ...
19
After sending you to the data visualization section for advanced techniques, we will
now quickly visualize the retweet distribution of our tweets, leveraging the base R
hist() function:
hist(tweet_df$retweetCount)
This code will result in a histogram that has the x axis as the number of retweets and
the y axis as the frequency of those numbers:
There's more...
As stated in the official Twitter documentation, particularly at https://dev.twitter.com/
rest/public/rate-limits, there is a limit to the number of tweets you can retrieve within
a certain period of time, and this limit is set to 450 every 15 minutes.
20
Chapter 1
However, what if you are engaged in a really sensible job and you want to base your work on a
significant number of tweets? Should you set the n argument of searchTwitter() to 450
and wait for 15everlastingminutes? Not quite, the twitteR package provides a convenient
way to overcome this limit through the register_db_backend(), register_sqlite_
backend(), and register_mysql_bakend() functions. These functions allow you to
create a connection with the named type of databases, passing the database name, path,
username, and password as arguments, as you can see in the following example:
register_mysql_backend("db_name", "host","user","password")
You can now leverage the search_twitter_and_store function, which stores the
search results in the connected database. The main feature of this function is the
retryOnRateLimit argument, which lets you specify the number of tries to be performed by
the code once the API limit is reached. Setting this limit to a convenient level will likely let you
pass the 15-minutes interval:
tweets_db = search_twitter_and_store("data science R",
retryOnRateLimit = 20)
Retrieving stored data will now just require you to run the following code:
from_db = load_tweets_db()
Getting ready
This recipe will mainly be based on functions from the Rfacebok package. Therefore, we need
to install and load this package in our environment:
install.packages("Rfacebook")
library(Rfacebook)
21
How to do it...
1. In order to leverage an API's functionalities, we first have to create an application
in our Facebook profile. Navigating to the following URL will let you create an app
(assuming you are already logged in to Facebook): https://developers.
facebook.com.
After skipping the quick start (the button on the upper-right corner), you can see the
settings of your app and take note of app_id and app_secret, which you will
need in order to establish a connection with the app.
2. After installing and loading the Rfacebook package, you will easily be able to
establish a connection by running the fbOAuth() function as follows:
fb_connection <-
fbOauth(app_id
= "your_app_id",
app_secret = "your_app_secret")
fb_connection
Running the last line of code will result in a console prompt, as shown in the following
lines of code:
copy and paste into site URL on Facebook App Settings: http://
localhost:1410/ When done press any key to continue
Following this prompt, you will have to copy the URL and go to your Facebook
app settings.
Once there, you will have to select the Settings tab and create a new platform
through the + Add Platform control. In the form, which will prompt you after clicking
this control, you should find a field named Site Url. In this field, you will have to paste
the copied URL.
Close the process by clicking on the Save Changes button.
At this point, a browser window will open up and ask you to allow access permission
from the app to your profile. After allowing this permission, the R console will print out
the following code snippet:
Authentication complete
Authentication successful.
3. To test our API connection, we are going to search Facebook for posts related to data
science with R and save the results within data.frame for further analysis.
Among other useful functions, Rfacebook provides the searchPages() function,
which as you would expect, allows you to search the social network for pages
mentioning a given string.
22
Chapter 1
Different from the searchTwitter function, this function will not let you specify a
lot of arguments:
token: This is the valid OAuth token created with the fbOAuth() function
To search for data science with R, you will have to run the following line of code:
pages searchPages('data science with R',fb_connection)
This will result in data.frame storing all the pages retrieved along with the data
concerning them.
As seen for the twitteR package, we can take a quick look at the like distribution,
leveraging the base R hist() function:
hist(pages$likes)
Refer to the data visualization section for further recipes on data visualization.
23
Getting ready
As a preliminary step, we are going to install and load the RGoogleAnalytics package:
install.packages("RGoogeAnalytics")
library(RGoogleAnalytics)
How to do it...
1. The first step that is required to get data from Google Analytics is to create a Google
Analytics application.
This can be easily obtained from (assuming that you are already logged in to Google
Analytics) https://console.developers.google.com/apis.
After creating a new project, you will see a dashboard with a left menu containing
among others the APIs & auth section, with the APIs subsection.
After selecting this section, you will see a list of available APIs, and among these,
at the bottom-left corner of the page, there will be the Advertising APIs with the
Analytics API within it:
24
Chapter 1
After enabling the API, you will have to go back to the APIs & auth section and select
the Credentials subsection.
In this section, you will have to add an OAuth client ID, select Other, and assign a
name to your app:
After doing that and selecting the Create button, you will be prompted with a window
showing your app ID and secret. Take note of them, as you will need them to access
the analytics API from R.
25
At this point, a browser window will open up and ask you to allow access permission
from the app to your Google Analytics account.
After you allow access, the R console will print out the following:
Authentication complete
3. This last step basically requires you to shape a proper query and submit it through
the connection established in the previous paragraphs. A Google Analytics query can
be easily built, leveraging the powerful Google Query explorer which can be found at
https://ga-dev-tools.appspot.com/query-explorer/.
This web tool lets you experiment with query parameters and define your query before
submitting the request from your code.
The basic fields that are mandatory in order to execute a query are as follows:
The view ID: This is a unique identifier associated with your Google Analytics
property. This ID will automatically show up within Google Query Explorer.
Start-date and end-date: This is the start and end date in the form
YYYY-MM-DD, for example, 2012-05-12.
Metrics: This refers to the ratios and numbers computed from the data
related to visits within the date range. You can find the metrics code in
Google Query Explorer.
If you are going to further elaborate your data within your data project, you will
probably find it useful to add a date dimension ("ga:date") in order to split your
data by date.
Having defined your arguments, you will just have to pack them in a list using the
init() function, build a query using the QueryBuilder() function, and submit it
with the GetReportData() function:
query_parameters <- Init(start.date = "2015-01-01",
end.date
= "2015-06-30",
metrics
=
"ga:sessions,
ga:bounceRate",
dimensions = "ga:date",
table.id = "ga:33093633")
ga_query <- QueryBuilder(query_parameters)
ga_df <- GetReportData(ga_query, ga_token)
The first representation of this data could be a simple plot of data that will result in a
representation of the bounce rate for each day from the start date to the end date:
plot(ga_df)
26
Chapter 1
There's more...
Google Analytics is a complete and always-growing set of tools for performing web analytics
tasks. If you are facing a project involving the use of this platform, I would definitely
suggest that you take the time to go through the official tutorial from Google at https://
analyticsacademy.withgoogle.com.
This complete set of tutorials will introduce you to the fundamental logic and assumptions of
the platform, giving you a solid foundation for any of the following analysis.
Getting ready
As you would expect, we first need to install and load the rio package:
install.packages("rio")
library(rio)
In the following example, we are going to import our well-known world_gdp_data dataset
from a local .csv file.
27
How to do it...
1. The first step is to import the dataset using the import() function:
messy_gdp import("world_gdp_data.csv")
How it works...
We first import the dataset using the import() function. To understand the structure of
the import() function, we can leverage a useful behavior of the R console: putting a
function name without parentheses and running the command will result in the printing of
all the function definitions.
Running the import on the R console will produce the following output:
function (file, format, setclass, ...)
{
if (missing(format))
fmt <- get_ext(file)
else fmt <- tolower(format)
if (grepl("^http.*://", file)) {
temp_file <- tempfile(fileext = fmt)
on.exit(unlink(temp_file))
curl_download(file, temp_file, mode = "wb")
file <- temp_file
}
x <- switch(fmt, r = dget(file = file), tsv = import.delim(file =
file,
sep = "\t", ...), txt = import.delim(file = file, sep = "\t",
...), fwf = import.fwf(file = file, ...), rds = readRDS(file =
file,
...), csv = import.delim(file = file, sep = ",", ...),
csv2 = import.delim(file = file, sep = ";", dec = ",",
...), psv = import.delim(file = file, sep = "|",
...), rdata = import.rdata(file = file, ...), dta =
import.dta(file = file,
...), dbf = read.dbf(file = file, ...), dif = read.
DIF(file = file,
...), sav = import.sav(file = file, ...), por = read_
por(path = file),
sas7bdat = read_sas(b7dat = file, ...), xpt = read.xport(file
= file),
28
Chapter 1
mtp = read.mtp(file = file, ...), syd = read.systat(file =
file,
to.data.frame = TRUE), json = fromJSON(txt = file,
...), rec = read.epiinfo(file = file, ...), arff = read.
arff(file = file),
xls = read_excel(path = file, ...), xlsx = import.xlsx(file =
file,
...), fortran = import.fortran(file = file, ...),
zip = import.zip(file = file, ...), tar = import.tar(file =
file,
...), ods = import.ods(file = file, ...), xml = import.
xml(file = file,
...), clipboard = import.clipboard(...), gnumeric =
stop(stop_for_import(fmt)),
jpg = stop(stop_for_import(fmt)), png = stop(stop_for_
import(fmt)),
bmp = stop(stop_for_import(fmt)), tiff = stop(stop_for_
import(fmt)),
sss = stop(stop_for_import(fmt)), sdmx = stop(stop_for_
import(fmt)),
matlab = stop(stop_for_import(fmt)), gexf = stop(stop_for_
import(fmt)),
npy = stop(stop_for_import(fmt)), stop("Unrecognized file
format"))
if (missing(setclass)) {
return(set_class(x))
}
else {
a <- list(...)
if ("data.table" %in% names(a) && isTRUE(a[["data.table"]]))
setclass <- "data.table"
return(set_class(x, class = setclass))
}
}
As you can see, the first task performed by the import() function calls the get_ext()
function, which basically retrieves the extension from the filename.
Once the file format is clear, the import() function looks for the right subimport function
to be used and returns the result of this function.
Next, we visualize the result with the RStudio viewer. One of the most powerful RStudio tools
is the data viewer, which lets you get a spreadsheet-like view of your data.frame objects.
With RStudio 0.99, this tool got even more powerful, removing the previous 1000-row limit
and adding the ability to filter and format your data in the correct order.
When using this viewer, you should be aware that all filtering and ordering activities will not
affect the original data.frame object you are visualizing.
29
There's more...
As fully illustrated within the Rio vignette (which can be found at https://cran.rproject.org/web/packages/rio/vignettes/rio.html), the following formats
are supported for import and export:
Format
Import
Export
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
JSON (.json)
Yes
Yes
YAML (.yml)
Yes
Yes
Stata (.dta)
Yes
Yes
Yes
Yes
Excel (.xls)
Yes
Excel (.xlsx)
Yes
Yes
Yes
Yes
R syntax (.R)
Yes
Yes
Yes
Yes
SAS (.sas7bdat)
Yes
Yes
Minitab (.mtp)
Yes
Epiinfo (.rec)
Yes
Systat (.syd)
Yes
Yes
Yes
Yes
Google Sheets
Yes
Chapter 1
Since Rio is still a growing package, I strongly suggest that you follow its development on its
GitHub repository, where you will easily find out when new formats are added, at https://
github.com/leeper/rio.
Getting ready
First of all, we need to install and make the rio package available by running the
following code:
install.packages("rio")
library(rio)
In the following example, we are going to import the world_gdp_data dataset from a local
.csv file. This dataset is provided within the RStudio project related to this book, in the
data folder.
You can download it by authenticating your account at http://packtpub.com.
How to do it...
1. The first step is to convert the file from the .csv format to the .json format:
convert("world_gdp_data.csv", "world_gdp_data.json")
This will create a new file without removing the original one.
2. The next step is to remove the original file:
file.remove("world_gdp_data.csv")
31
There's more...
As fully illustrated within the Rio vignette (which you can find at https://cran.rproject.org/web/packages/rio/vignettes/rio.html), the following formats
are supported for import and export:
Format
Import
Export
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
JSON (.json)
Yes
Yes
YAML (.yml)
Yes
Yes
Stata (.dta)
Yes
Yes
Yes
Yes
Excel (.xls)
Yes
Excel (.xlsx)
Yes
Yes
Yes
Yes
R syntax (.r)
Yes
Yes
Yes
Yes
SAS (.sas7bdat)
Yes
Yes
Minitab (.mtp)
Yes
Epiinfo (.rec)
Yes
Systat (.syd)
Yes
Yes
Yes
Yes
Google Sheets
Yes
Since rio is still a growing package, I strongly suggest that you follow its development
on its GitHub repository, where you will easily find out when new formats are added,
at https://github.com/leeper/rio.
32
www.PacktPub.com
Stay Connected: