Load packages:
library(ROAuth)
library(streamR)
Once these are loaded, you can copy, paste, adapt and run this code.
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "YOUR CONSUMER KEY"
consumerSecret <- "YOUR CONSUMER SECRET"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=requestURL,
accessURL=accessURL, authURL=authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
Now let us formulate our HEB-specific searches. I recommend at least one search for mentions of HEB with keywords, and another search that just tracks HEB’s twitter ID. For the first of these, we use the function filterStream()
with specifications in the track =
argument.
Remember: in the track =
argument,
The first few items in track =
should be different versions of the client name, such as heb, h-e-b (covering all combinations of upper- and lower-case letters), and @HEB for mentions of HEB’s Twitter user handle.
Here, again, is the skeleton of the query:
load("my_oauth")
tweets <- try(filterStream("filename.json",
locations = c(-180, -90, 180, 90),
timeout = 900, # i.e. 15 minutes
track = "",
oauth = my_oauth))
filename
is the name you assign to the file in which your downloaded tweets will be stored. You can call it alfred or ethel, or whatever you wish, but be sure to end the name in .json
. And enclose the name in “”.
locations
is a bounding box of latitude-longitude value pairs. In this case, the values cover the whole globe, which makes the search choose only geolocated tweets. In case you’d like to focus on Texas, here are values for a bounding box that covers Texas: c(-106.777407, 27.121082, -94.320961, 36.441398)
. NB: This box is pretty wide and has big chunks of Mexico, OK and AR in it. That’s the consequence of including the Panhandle in a rectangular area. - If you don’t care about geolocation of tweets, then set locations
to NULL
.
timeout
stands for the time that R will be collecting tweets, in seconds. If set to 0
, it will continue indefinitely.
If you should be looking to study specific lexical items, then use the track
argument (consult the streamR
documentation for details).
To restrict your search to just tweets in English, use language = "en"
as an argument. And to search for two or three specific languages (English and Spanish, anyone?), use this format:
c("", "")
Here is the full documentation for request parameters to the Twitter API.
heb_search_7.json
track = "(heb, h-e-b) delivery"
heb_search_8.json
track = "heb, h-e-b delivery"
heb_search_9.json
track = "heb, h-e-b Curbside"
heb_search_10.json
track = "HEBtoyou"
heb_search_11.json
track = "heb instacart"
heb_search_12_kroger.json
track = "kroger"
For purposes of reference:
#!/bin/bash
for i in {1..240}; do
Rscript collect_evol2016.R
echo $i
done
Let us install, then load a number of packages that we’ll probably need.
# install.packages(c("tidytext", "ggplot2", "tm", "wordcloud", "dplyr")
library(tidytext)
library(ggplot2)
library(tm)
library(tm)
library(wordcloud)
library(dplyr)
Use parseTweets()
(part of stringR
) to read your .json file of captured tweets into R, as a data.frame. Read in each file under a different name, like so:
tweets1 <- parseTweets('heb_search_10.json')
tweets2 <- parseTweets('heb_search_11.json')
tweets3 <- parseTweets('heb_search_12_kroger.json')
These tweets are now accessible in R’s working memory, ready for you to work with.
Do this for each set of tweets (tweets1, tweets2, …).
df <- tweets3
df$text <- gsub(' http[^[:blank:]]+', '', df$text)
tidy_tw <- df %>% unnest_tokens(word, text)
tw_stop <- data.frame(word = c('amp', 'gt', 't.c', 'rt', 'https', 't.co', '___', '1', '2', '3', '4', '5', '6', '7', '8', '9', "i\'m", '15', '30', '45', '00', '10'), lexicon='whatevs')
data("stop_words")
tidy_tw <- tidy_tw %>%
anti_join(tw_stop)
tidy_tw <- tidy_tw %>%
anti_join(stop_words)
# This will output a record of the words that will be in the cloud.
print(tidy_tw %>% count(word, sort = TRUE))
For this example, I used my sample set of tweets harvested on 10/19/2016 using track = "kroger"
.
fig <- tidy_tw %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100,
colors = brewer.pal(8,'Dark2')))