Session 3: Friday, 10 a.m. - 12 p.m.


1. Data wrangling

1.1 Reading tweets (in .json format) into R

Use parseTweets() (part of the stringr package) to read your .json file of captured tweets into R, as a data.frame. Note that data frame is an additional data type, in addition to numeric, character, and Boolean types.

First, try to get some information about parseTweets() using the R help function. You call it simply by starting a line with a ?:

?parseTweets

The help text tells you that

  1. the function’s first argument must be the json file containing your captured tweets – in this case, in the form of a local filename – this always means the name must be in “”, unlike if we were referencing the data frame as a variable that was already in R’s working memory; and

  2. there are two additional arguments that the function can take: simplify and verbose.

So, use the following call to read in your .json file, but add the argument verbose with value TRUE.

tweets <- parseTweets('tweets2.json')

Your tweets are now accessible in R’s working memory, ready for you to work with.

1.2 Start wrangling

By this colorful term, we mean, roughly, “cleaning and reorganizing the data frame according to the demands of your analysis.”

To get a sense of what your data frame looks like, use these two functions on it (first one, then the other): head(), str().

If you want a better view of just the column names use this command: colnames(). How many columns do we have in total?

First of all, you will find that there are far more columns in the data frame than you are interested in. I would suggest we focus on the following:

  1. The text of the tweet,

  2. The column that states whether a tweet was a retweet or not,

  3. The two columns with geolocation info: latitude and longitude.

  4. For a simpler version of the geo information, the country code.

Make a note of the numbers of the four columns that we are going to keep. The best way to get that info is using the colnames() command. I get the numbers 1, 8, 31, 37, 38.

So, let’s make a version of the data that contains only those columns. For this purpose, we will use the dplyr package, which was written by our man Hadley Wickham.

library(dplyr)
# See https://rpubs.com/justmarkham/dplyr-tutorial for a dplyr tutorial
# Or http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

First step: define a vector with the numbers of the columns that we decided we will need later.

needed = c(1, 8, 31, 37, 38)

Second step: use the select() function, which is part of dplyr, to make a new data frame with only those four columns. Like this: select(tweets, needed).

When you run that code, the new data frame just gets output to the console… which is no help. So we actually need to assign the output of the last command to a variable name. If we re-use the variable name tweets, then our old data frame will be overwritten with the new one, which can be a good thing.

tweets = select(tweets, needed)

Next, let us get rid of all the tweets that are, in fact, retweets. We only want original ones!

tweets = 
  filter(tweets, retweeted == "FALSE")

1.3 Some simple corpus searches

The basic text search function in R is grep(). If you have ever done corpus-linguistic work, you will find grep() to be closest to a concordance-type search in a corpus.

Here is how you find all mentions of one specific term:

grep("obama", tweets$text, ignore.case = TRUE)

If we wanted to turn this into a study, we could compare the number of times Obama is mentioned to the number of times another politician is mentioned. In order to obtain that numeric value, we can use the length() function, which simply counts and outputs the number of entries in the return object.

c( length(grep("obama", tweets$text, ignore.case = TRUE)),
   length(grep("biden", tweets$text, ignore.case = TRUE)) )

Exercise 1

Conduct a grep() search that tells you how many times the word Obama was spelled with an upper-case first letter, and how many times without one.

Exercise 2

Write a function that counts the number of mentions for Clinton, Trump, and Sanders in your corpus. (Hint: think about how they are likely to actually be referred to.)


2. Statistics


3. Visualization