Data Collection & Scraping
Data Collection & Scraping
Introduction
RESTful API
Streaming API
Semi-structured HTML
2/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
3/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Outline
Pablo Barber
a
4/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Streaming API:
Connect to the stream of tweets as they are being published
Examples: random sample of all tweets, tweets that mention a
keyword, tweets from a set of users...
R library: streamR
More: dev.twitter.com/docs/api/1.1
Pablo Barber
a
5/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Authentication
Pablo Barber
a
6/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Code: 02_analysis_twitter_nyu.R
Pablo Barber
a
7/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
"anthlittle"
"theumpires"
Pablo Barber
a
8/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
9/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
10/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
drewconway
cdsamii
p_barbera
griverorz
j_a_tucker
Pablo Barber
a
11/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
library(igraph)
network <- graph.adjacency(adjMatrix)
plot(network)
eminedeniz
saadgulzar
AriasEric
cdsamii
JonHaidt
griverorz
therriaultphd
drewconway
j_a_tucker
LaineStrutton
o_garcia_ponce
p_barbera
pfernandezvz
SMaPP_NYU
DrewDim
Elad663
LindseyCormack patricionavia
oleacesar
Camila_Vergara
Pablo Barber
a
12/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
cdsamii
9
p_barbera
10
griverorz j_a_tucker
5
12
Pablo Barber
a
13/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Camila_Vergara
patricionavia
LaineStrutton
SMaPP_NYU
o_garcia_ponce
oleacesar
griverorz
zeitzoff
AriasEric
p_barbera
Elad663
saadgulzar
eminedeniz
j_a_tucker
pfernandezvz
drewconway
DrewDim
cdsamii
therriaultphd
LindseyCormack
JonHaidt
Pablo Barber
a
14/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
## [[1]]
## [1] "Jen_at_APSA: Is the Republican attack on political science self-defeati
Code: 03_tweets_search.R
Limitations:
Not all tweets are indexed or made available via search.
Does not contain user metadata
Limited to a few thousand most recent tweets
Old tweets are not available.
Pablo Barber
a
15/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Streaming API
Pablo Barber
a
16/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Anatomy of a tweet
{ "created_at":"Wed Nov 07 04:16:18 +0000 2012",
"id":266031293945503744,
"id_str":"266031293945503744",
"text":"Four more years. http://t.co/bAJE6Vom",
"source":"web",
"user":
{ "id":813286,
"id_str":"813286",
"name":"Barack Obama",
"screen_name":"BarackObama",
"location":"Washington, DC",
"url":"http://www.barackobama.com",
"description":"This account is run by #Obama2012 campaign staff.
Tweets from the President are signed -bo.",
"protected":false,
"followers_count":23487605,
"friends_count":670339,
"listed_count":182313,
"created_at":"Mon Mar 05 22:08:25 +0000 2007",
"utc_offset":-18000,
"time_zone":"Eastern Time (US & Canada)",
"geo_enabled":false,
"verified":true,
"statuses_count":7972,
"lang":"en" },
"geo":null,
"coordinates":null,
"place":null,
"retweet_count":816600 }
Pablo Barber
a
Tweet
information
User
information
Geographic
information
17/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
18/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
load("my_oauth")
# capturing 3 minutes of tweets mentioning obama or biden
filterStream(file.name = "tweets_keyword.json", track = c("obama", "biden"),
timeout = 180, oauth = my_oauth)
Pablo Barber
a
19/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
20/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
21/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
22/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
23/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
24/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
25/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
26/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
But remember...
Pablo Barber
a
27/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
28/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Code: 06_scraping_election_georgia.R
Pablo Barber
a
29/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
library(XML)
url <- "http://results.cec.gov.ge/index.html"
table <- readHTMLTable(url, stringsAsFactors = F)
# how to know which table to extract? run 'str(table)' and look for table
# of interest. Alternatively, search html code for table ID
table <- table$table36
table[1:6, 2:6]
##
##
##
##
##
##
##
V2
V3
V4
1
1
4
5
2 222(0.56%) 51(0.13%) 13229(33.44%)
3 380(0.54%) 92(0.13%) 16728(23.62%)
4 358(0.4%) 123(0.14%) 22539(25.06%)
5 82(0.31%)
27(0.1%) 9991(37.67%)
6 164(0.26%) 56(0.09%) 19778(31.06%)
Pablo Barber
a
V5
V6
9
10
16(0.04%) 413(1.04%)
44(0.06%) 939(1.33%)
56(0.06%) 1047(1.16%)
64(0.24%)
344(1.3%)
92(0.14%) 1052(1.65%)
30/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
2
3
4
5
6
Pablo Barber
a
31/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
32/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
2
3
4
5
6
Pablo Barber
a
33/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
400
300
200
100
0
table(last.digit(results$party_41))
400
300
200
100
0
table(last.digit(results$party_5))
34/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
URL: www.ipaidabribe.com
Pablo Barber
a
35/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Code: 07_scraping_india_bribes.R
Pablo Barber
a
36/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
37/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
38/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
39/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Pablo Barber
a
40/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
## all urls
urls <- paste0("http://www.ipaidabribe.com/reports/paid?page=", 0:50)
## empty array
data <- list()
## looping over urls...
for (i in seq_along(urls)) {
# extracting information
data[[i]] <- extract.bribes(urls[i])
# waiting one second between hits
Sys.sleep(1)
cat(" done!\n")
}
## transforming it into a data.frame
data <- data.frame(do.call(rbind, data), stringsAsFactors = F)
Pablo Barber
a
41/43
Introduction
RESTful API
Streaming API
Semi-structured HTML
Railway Police
Police
406
36
Airports
Stamps and Registration
13
11
Passport Customs, Excise and Service Tax
9
7
157
Bangalore
151
Pune
23
Gurgaon
12
summary(as.numeric(data$amounts))
##
##
Pablo Barber
a
Median
800
NA's
1
42/43
References
Jackman, Simon. 2006. Data from the Web into R. The Political
Methodologist, 14(2).
Hanretty, Chris. Scraping the Web for Arts and Humanities. LINK
Leipzig, Jeremy and Xiao-Yi Li. Data Mashups in R. OReilly
Russell, Matthew. Mining the Social Web. OReilly.
R libraries: scrapeR, XML, twitteR (check vignettes for examples)
Python libraries: BeautifulSoup, tweepy
Alex Hannas Tworkshops LINK
Pablo Barber
a
43/43