Session 2: Wednesday, 1-3 p.m.

Harvesting Twitter data using streamR

There are tens of thousands of software packages available that expand the capacities of R, and make them very specific to certain purposes. If you are a nuclear physicist studying, I don’t know, neutrons in space, R will have your back: there is probably a package that helps you do exactly the statistics that you need to do.

One such specialized package is streamR. It is designed specifically to let you download Twitter data, and then load it into R.

Here is how you load a package in R:

library(streamR)

Try that on your computer.

Didn’t work? Presumably, this is because you haven’t installed the package yet.

So let’s install it:

install.packages("streamR")   # remember to use double quotes

Does it look like it worked? Good. Now try loading it again.

library(streamR)   # remember to use no quotes
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson

Above, you see what your R console should be showing you – basically showing that the package was loaded.

1. Get squared with Twitter

Before we do any harvesting, we need to (a) have a Twitter account and (b) get friendly with the interface that Twitter provides for developers: with its “API”" (application program interface).

In preparation for this workshop, I asked you to register as a developer with Twitter and do this:

  • Go to apps.twitter.com and create a new “app”. (The first screen requests two URLs. For the first one, enter whatever you like. Leave the second one blank.)
  • Once you have created the app, go to “Keys and Access Tokens” and make a note of the:
    • Consumer Key (API Key), and
    • Consumer Secret (API Secret).

2. Start harvesting

Just kidding, we’re not quite ready. There’s one more thing: we first need to load one more package, and chances are you’ll need to install it (see above).

library(ROAuth)

(After installing ROAuth don’t forget to run the library(ROAutho) command again.)

Once you have streamR and ROAuth installed and loaded, copy, paste, adapt and run this code.

requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <-     "YOUR CONSUMER KEY"
consumerSecret <-  "YOUR CONSUMER SECRET"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret, 
                             requestURL=requestURL,
                             accessURL=accessURL, authURL=authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

The last command will open a browser window which asks you to approve of it, then shows a PIN. You enter the PIN into R. Then you should be ready to use the commands in the streamR package.

Use a command like this:

tweets <- try(filterStream("filename.json",
                           locations = c(-180, -90, 180, 90),
                           timeout = 45, 
                           oauth = my_oauth))

filename is the name you assign to the file in which your downloaded tweets will be stored. You can call it alfred, mildred, thomas, or whatever you wish, but be sure to end the name in .json. And enclose the name in “”.

location is a bounding box of latitude-longitude value pairs. In this case, the values cover the whole globe. In other words, it just downloads all geolocated tweets. If you wanted to download all tweets that your account receives, even if they aren’t geotagged, then set location to NULL.

timeout stands for the time that R will be collecting tweets, in seconds. If set to 0, it will continue indefinitely.

If you should be looking to study specific lexical items, then use the track argument (consult the streamR documentation for details).

To restrict your search to just tweets in English, use language = "en" as an argument.

For the following exercises, be sure to have the streamR documentation open.

Exercise 1

Use the filterStream() command provided above and adjust the collection time to 2 min 30 sec, then run it.

Exercise 2

Design another filterStream() command, this time, tracking specifically the key phrases “dumb ass” and “big ass”. Since you are looking for more than one phrase, you will need to provide these as a vector, which should have this format:

c("", "")

Run this call as well, restricting it to 2:30 just like in exercise 1.

Exercise 3

Design and run a third call. This time, restrict the search to the Republic of Ireland.
Remember that Google Maps and the Twitter API work with opposite sequencing of latitude and longitude values…
Include a track argument of your own choosing as well.