Collecting Twitter data using keywords with R

In the last post, it was explained how to collect tweets from a specific Twitter account utilising Twitter’s search API. However, in this guide, the purpose is to learn how to collect tweets that contain specific keywords with R using the Twitter streaming API. Again this is relatively simple providing that the code used is inputed and executed correctly. Though, there are some differences to the previous guide since additional packages are required because the data is being gathered in near real-time.

First of all, if you have not done so already, you will need to create a Twitter developer application:

  • Go to https://dev.twitter.com/ and click “My Apps”.
  • Log in with your Twitter credentials.
  • Click on the “New App” icon and fill out the details (the name of the App needs to be unique).
  • Agree to the Twitter development agreement and click submit.

Once this has been done it is time to setup the Twitter collection stream (I am using RStudio on an iMac but the code should also work for windows machines).

  • Open R, or RStudio.
  • Create a new R Script.
  • Set your working directory.
setwd("~/Folder/SubFolder") #change to your folder destination

Next, you will need to install several packages which are required for collecting near real-time tweets and load them into the workspace.

install.packages( "RCurl","streamR", "ROAuth", "RJSONIO" )
library(RCurl)
library(streamR) 
library(ROAuth)
library(RJSONIO)
library(stringr)

Following this you will need to authenticate the connection with the Twitter streaming API, this is different from the Twitter search API which was used in the previous guide using the twitteR package. Use the code below and add your Consumer Key and Consumer Secret codes from the Twitter application you made earlier. Important:  When you run the code your web browser will open and you will have to click the authorise button, then paste the presented authorisation pin into R and press enter so to establish the connection.

token <- "https://api.twitter.com/oauth/request_token"
access <- "https://api.twitter.com/oauth/access_token"
authorize <- "https://api.twitter.com/oauth/authorize"
consumerkey <- "Paste-Consumer-Key" #retrieve this from dev.twitter.com under keys and tokens
consumersecret <- "Paste-Consumer-Secret" #retrieve this from dev.twitter.com under keys and tokens
oauth <- OAuthFactory$new(consumerKey = consumerkey,
                          consumerSecret = consumersecret,
                          requestURL = token,
                          accessURL = access,
                          authURL = authorize)

oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl") )

You should now have successfully connected to Twitter’s streaming API. To make life easier I recommend saving these details so you do not have to keep re-entering them in the future. You do this by inputting the following code.

save(oauth, file = "oauth.Rdata")

Now let’s begin collecting tweets with streamR. Go ahead and create a new R script, we will load the saved “oauth” file exactly as you would if starting a new session. For this example, I will be collecting tweets pertaining to “Brexit” from across the world.

library(streamR)
load("oauth.Rdata")
filterStream(file.name = "brexit_tweets.json", #this saves tweets into a .json file
             track = c("Brexit", "Article 50"), #collects tweets that include these keywords
             language = "en", #collect tweets in a specific language
             timeout = 10800, #number is in seconds (3 hours)
             oauth = oauth) #uses the "oauth" file as your accreditation

brexit_tweets.df <- parseTweets("brexit_tweets.json", simplify = FALSE) #creates data frame of tweets

That’s it, by following this guide you should have successfully collected near real-time tweets by searching for particular keywords. The code has been checked and is in working order, so let me know if it does not work for you.