Program for today

  1. Harvesting Twitter data using custom searches
  2. Searching for strings in the data
  3. Coding data
  4. Mapping findings

0. Preliminaries

  • Update R to v. 3.3.1 (here)
  • Update RStudio to v. 0.99.903 (here)
  • Install these R packages: streamR, ROAuth, dplyr.
  • Make sure you have a Google account, and that you’re able to log into drive.google.com.
  • Go to apps.twitter.com and create a new “app”. (The first screen requests two URLs. For the first one, enter whatever you like. Leave the second one blank.)
  • Once you have created the app, go to “Keys and Access Tokens” and make a note of the:
    • Consumer Key (API Key), and
    • Consumer Secret (API Secret).
  • Say hello to Pablo Barbera, the kind soul that wrote streamR:

    Here is his website.
  • Finally, some housekeeping:
    • start RStudio,
    • initiate a “Project” for today’s work (why not call it “socialmediaworkshop_1”),
    • open, name, and save a working script file for yourself, and
    • remember the local folder where the project will be housed – that’s also where you’ll come back to deal with your tweets.

1. Harvesting Twitter data using custom searches

First, load two R packages.

library(ROAuth)
library(streamR)

Once streamR and ROAuth are loaded, you can copy, paste, adapt and run this code, which initiates your working session with streamR:

requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <-     "YOUR CONSUMER KEY"
consumerSecret <-  "YOUR CONSUMER SECRET"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret, 
                             requestURL=requestURL,
                             accessURL=accessURL, authURL=authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

The last command will open a browser window which asks you to approve of it, then shows a PIN. You enter the PIN into R (within RStudio).

Now you should be ready to use the commands in the streamR package.

Try a command like this:

tweets <- try(filterStream("filename.json",
                           locations = c(-180, -90, 180, 90),
                           timeout = 45, 
                           oauth = my_oauth))

filename is the name you assign to the file in which your downloaded tweets will be stored. You can call it alfred or ethel, or whatever you wish, but be sure to end the name in .json. And enclose the name in “”.

locations is a bounding box of latitude-longitude value pairs. In this case, the values cover the whole globe. In other words, it just downloads all tweets which are geolocated. If you wanted to download all tweets that your account receives, even if they aren’t geotagged, then set locations to NULL.

timeout stands for the time that R will be collecting tweets, in seconds. If set to 0, it will continue indefinitely.

If you should be looking to study specific lexical items, then use the track argument (consult the streamR documentation for details).

To restrict your search to just tweets in English, use language = "en" as an argument. And to search for two or three specific languages (English and Spanish, anyone?), use this format:

c("", "")

Here is the full documentation for request parameters to the Twitter API.

For the following exercises, be sure to have the streamR documentation handy (link).

Exercise 1.1

Use the filterStream() command provided above and adjust the collection time to 2 min 30 sec, then run it.

Exercise 1.2

Design another filterStream() command, this time, tracking specifically the key phrases Doritos and Pringles. Since you are looking for more than one phrase, you will need to provide these as a vector, so use this format:

c("", "")

Run this call as well, restricting it to 2:30 just like in exercise 1.1.

Exercise 1.3

Design and run a third call. This time, restrict the search to Texas. In order to specify an appropriate bounding box, you may want to figure out an easy way to obtain latitude-longitude pairs for a given location on an online map: here is a discussion of a couple of different ways.

Note that Google Maps and the Twitter API work with opposite sequencing of latitude and longitude values…


2. Searching for strings in the data

2.1 Reading tweets into R

Use parseTweets() (part of the stringR package) to read your .json file of captured tweets into R, as a data.frame. Note that data frame is an additional data type, in addition to numeric, character, and Boolean types.

First, try to get some information about parseTweets() using the R help function. Call help simply by starting a line with a ?:

?parseTweets

The help text tells you that

  1. the function’s first argument must be the json file containing your captured tweets – in this case, in the form of a local filename – this always means the name must be in “”, unlike if we were referencing the data frame as a variable that was already in R’s working memory; and

  2. there are two additional arguments that the function can take: simplify and verbose.

So, use the following call to read in your .json file (assuming you called it alfred.json), but add the argument verbose with value TRUE.

tweets <- parseTweets('alfred.json')

Your tweets are now accessible in R’s working memory, ready for you to work with.

2.2 Start wrangling

By this colorful term, we mean, roughly, “cleaning and reorganizing the data frame according to the demands of our analysis.”

To get a sense of what your data frame looks like, use these two functions on it (first one, then the other): head(), str().

If you want a better view of just the column names use this command: colnames(). How many columns do we have in total?

First of all, you will find that there are far more columns in the data frame than you are interested in. I would suggest we focus on the following:

  1. The text of the tweet,

  2. The column that states whether a tweet was a retweet or not,

  3. The two columns with geolocation info: latitude and longitude.

  4. For a simpler version of the geo information, the country code.

Make a note of the numbers of the four columns that we are going to keep. The best way to get that info is using the colnames() command. I get the numbers 1, 8, 31, 37, 38.

So, let’s make a version of the data that contains only those columns. For this purpose, we will use the dplyr package, which was written by the RStudio guru Hadley Wickham.

library(dplyr)
# See https://rpubs.com/justmarkham/dplyr-tutorial for a dplyr tutorial
# Or http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

First step: define a vector with the numbers of the columns that we decided we will need later.

needed = c(1, 8, 31, 37, 38)

Second step: use the select() function, which is part of dplyr, to make a new data frame with only those four columns. Like this: select(tweets, needed).

When you run that code, the new data frame just gets output to the console… which is no help. So we actually need to assign the output of the last command to a variable name. If we re-use the variable name tweets, then our old data frame will be overwritten with the new one, which can be a good thing.

tweets = select(tweets, needed)

Next, let us get rid of all the tweets that are, in fact, retweets. We only want original ones!

tweets = 
  filter(tweets, retweeted == "FALSE")

2.3 Some simple corpus searches

The basic text search function in R is grep().

Here is how you find all mentions of one specific term:

grep("doritos", tweets$text, ignore.case = TRUE)

If we wanted to turn this into a study, we could compare the number of times Doritos are mentioned to the number of times another snack is mentioned. In order to obtain that numeric value, we can use the length() function, which simply counts and outputs the number of entries in the return object.

c( length(grep("doritos", tweets$text, ignore.case = TRUE)),
   length(grep("pringles", tweets$text, ignore.case = TRUE)) )

Such searching becomes powerful when you have a large collection of tweets and start using regular expressions (“regex”) in your searches. See also ?regex.

Here is a helpful blog post that gives a nice and compact explanation of grep(), and the code includes a straightforward, usable application of regex.

Exercise 2.1

Conduct a grep() search that tells you how many times the word Doritos was spelled with an upper-case first letter, and how many times without one.


3. Coding data

It is sometimes a good idea to code the data further. For example, for each time that Doritos is found, you may want to include a tag for whether any other foods were mentioned in the same tweet or not. Or whether it was talked about positively, neutrally, or negatively.

Such coding can be done by a script or by hand. (Don’t underestimate the power of the latter.)

I mention this here because right here would be the right point in the workflow to do it. But since we have a separate section on topic analysis and sentiment analysis planned, I’ll skip it for now and instead move on to mapping!


4. Mapping findings

There are powerful ways of making analytical maps in R. I mean stuff like this.

Instead of using R, there are also easier ways. Let me suggest one way that is new, free, and can take us a good part of the way: Google Fusion.

For a straightforward initial project, let’s try to make a map that has a dot for each mention of Doritos and another for each mention of Pringles within the Southern U.S.

Here is a manual on how to make maps with Fusion. Warning: Fusion has a “new look” that is in beta; I would suggest we stick with “classic” specifically because when it gets to making heatmaps, those are as yet only available in the “classic” look.

Before we jump over to the manual: we need to make a choice as to what information we’ll want to have included in our map. Some things to think about: * We’ll probably want a graphic distinction between “Doritos” and “Pringles” tweets, * We’ll have to decide if we want to retain the text of the tweet, * And if we want to show any user information (gender?).