Program for today

  1. Customize grab code for current project
  2. Sett up timed collection using shell script
  3. Scrub, search, analyze

1. Customizing grab code for current project

Load packages:

library(ROAuth)
library(streamR)

Once these are loaded, you can copy, paste, adapt and run this code.

requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <-     "YOUR CONSUMER KEY"
consumerSecret <-  "YOUR CONSUMER SECRET"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret, 
                             requestURL=requestURL,
                             accessURL=accessURL, authURL=authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

Now let us formulate our HEB-specific searches. I recommend at least one search for mentions of HEB with keywords, and another search that just tracks HEB’s twitter ID. For the first of these, we use the function filterStream() with specifications in the track = argument.

Iterative queries

Remember: in the track = argument,

  • comma stands for OR, and
  • just a whitespace stands for AND.

The first few items in track = should be different versions of the client name, such as heb, h-e-b (covering all combinations of upper- and lower-case letters), and @HEB for mentions of HEB’s Twitter user handle.

Here, again, is the skeleton of the query:

load("my_oauth")
tweets <- try(filterStream("filename.json",
                           locations = c(-180, -90, 180, 90),
                           timeout = 900,        # i.e. 15 minutes
                           track = "",
                           oauth = my_oauth))

filename is the name you assign to the file in which your downloaded tweets will be stored. You can call it alfred or ethel, or whatever you wish, but be sure to end the name in .json. And enclose the name in “”.

locations is a bounding box of latitude-longitude value pairs. In this case, the values cover the whole globe, which makes the search choose only geolocated tweets. In case you’d like to focus on Texas, here are values for a bounding box that covers Texas: c(-106.777407, 27.121082, -94.320961, 36.441398). NB: This box is pretty wide and has big chunks of Mexico, OK and AR in it. That’s the consequence of including the Panhandle in a rectangular area. - If you don’t care about geolocation of tweets, then set locations to NULL.

timeout stands for the time that R will be collecting tweets, in seconds. If set to 0, it will continue indefinitely.

If you should be looking to study specific lexical items, then use the track argument (consult the streamR documentation for details).

To restrict your search to just tweets in English, use language = "en" as an argument. And to search for two or three specific languages (English and Spanish, anyone?), use this format:

c("", "")

Here is the full documentation for request parameters to the Twitter API.

Searches I ran, and how they worked out

heb_search_7.json
track = "(heb, h-e-b) delivery"

heb_search_8.json
track = "heb, h-e-b delivery"

heb_search_9.json
track = "heb, h-e-b Curbside"

heb_search_10.json
track = "HEBtoyou"

heb_search_11.json
track = "heb instacart"

heb_search_12_kroger.json
track = "kroger"

2. Set up timed collection using shell script

For purposes of reference:


#!/bin/bash

for i in {1..240}; do
    Rscript collect_evol2016.R
    echo $i
done

3. Scrub, search, analyze

Let us install, then load a number of packages that we’ll probably need.

# install.packages(c("tidytext", "ggplot2", "tm", "wordcloud", "dplyr")

library(tidytext)
library(ggplot2)
library(tm)
library(tm)
library(wordcloud)
library(dplyr)

3.1 Read tweets into R

Use parseTweets() (part of stringR) to read your .json file of captured tweets into R, as a data.frame. Read in each file under a different name, like so:

tweets1 <- parseTweets('heb_search_10.json')
tweets2 <- parseTweets('heb_search_11.json')
tweets3 <- parseTweets('heb_search_12_kroger.json')

These tweets are now accessible in R’s working memory, ready for you to work with.

3.2 Get rid of URLs and stopwords

Do this for each set of tweets (tweets1, tweets2, …).

df <- tweets3

df$text <- gsub(' http[^[:blank:]]+', '', df$text)

tidy_tw <- df %>% unnest_tokens(word, text)

tw_stop <- data.frame(word = c('amp', 'gt', 't.c', 'rt', 'https', 't.co', '___', '1', '2', '3', '4', '5', '6', '7', '8', '9', "i\'m", '15', '30', '45', '00', '10'), lexicon='whatevs')

data("stop_words")

tidy_tw <- tidy_tw %>%
  anti_join(tw_stop)
tidy_tw <- tidy_tw %>%
  anti_join(stop_words)

# This will output a record of the words that will be in the cloud.
print(tidy_tw %>% count(word, sort = TRUE)) 

3.3 Wordcloud!

For this example, I used my sample set of tweets harvested on 10/19/2016 using track = "kroger".

fig <- tidy_tw %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, 
                 colors = brewer.pal(8,'Dark2')))