Monday, August 22, 2016

Twitter's Favorite Films

If you were on Twitter at all last week, you probably couldn't help but notice a flurry of "Fav7" hashtags trending, including #Fav7Films, #Fav7Books, #Fav7TVShows where people were posting a list of their favorite 7 things from each category.



I thought it would be fun to scrape the data to see what Twitter's favorite films are, and compare it to the top rated films on IMDb and Rotten Tomatoes. Here are the results.

Twitter's Top 25 Films
  1. The Dark Knight (9.0 IMDb, 94% RT)
  2. Pulp Fiction (8.9 IMDb, 94% RT)
  3. The Empire Strikes Back (8.8 IMDb, 94% RT)
  4. Goodfellas (8.7 IMDb, 96% RT)
  5. The Shawshank Redemption (9.3 IMDb, 91% RT)
  6. Fight Club (8.8 IMDb, 79% RT)
  7. The Godfather (9.2 IMDb, 99% RT)
  8. Back to the Future (8.5 IMDb, 96% RT)
  9. Inception (8.8 IMDb, 86% RT)
  10. Jurassic Park (8.1 IMDb, 93% RT)
  11. Forrest Gump (8.8 IMDb, 72% RT)
  12. The Big Lebowski (8.2 IMDb, 81% RT)
  13. Jaws (8.0 IMDb, 97% RT)
  14. Star Wars (8.7 IMDb, 93% RT)
  15. Raiders of the Lost Ark (8.5 IMDb, 94% RT)
  16. The Princess Bride (8.1 IMDb, 97% RT)
  17. Blade Runner (8.2 IMDb, 89% RT)
  18. Alien (8.5 IMDb, 97% RT)
  19. The Departed (8.5 IMDb, 91% RT)
  20. The Matrix (8.7 IMDb, 87% RT)
  21. Interstellar (8.6 IMDb, 71% RT)
  22. Aliens (8.4 IMDb, 98% RT)
  23. Good Will Hunting (8.3 IMDb, 97% RT)
  24. The Shining (8.4 IMDb, 88% RT)
  25. Die Hard (8.2 IMDb, 92% RT)

A few observations:

  • Less than half of the films in Twitter's top 25 are also in IMDb's top 25.
  • The Godfather (1972) is the oldest film on the list, while Interstellar (2014) is the newest.
  • Harrison Ford has starred in the most (4) of the top 25 films, while Stephen Spielberg has directed the most (3).
  • Only three sequels appear in the top 25. For two of those, the original film also appears in the top 25.
  • Action/adventure and science fiction films dominate the list.
  • As popular as they are right now, only one film based on a comic book character is in the top 25 (although it did take the top spot).

The source code

Start by loading the required libraries, including twitteR for accessing the Twitter API, and setting up authentication. You'll need to sign up for free on Twitter Developers to get your own authentication keys and tokens. (If you've never done this before, see Bogdan Rau's Collecting Tweets Using R and the Twitter Search API for a more detailed guide.)

library(dplyr)
library(purrr)
library(twitteR)

# Download cacert file for Windows use.
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

consumer_key <- 'your key'
consumer_secret <- 'your secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)

Next, query the Twitter search API for the "#Fav7Films" hashtag, and initialize a data frame with tweets.

requests <- 1 # keep count of how many requests are sent
num_tweets <- 3000 # number of tweets to fetch per request
delay <- 62.0 # add in a delay so the API doesn't block

fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets)
Sys.sleep(delay) # be nice to the API
fav_film_df <- tbl_df(map_df(fav_film_tweets, as.data.frame))

fav_film_all <- fav_film_df[fav_film_df$isRetweet == FALSE, ]

Now we want to keep searching in a loop, until we've downloaded all the tweets we're interested in. To do that, we'll keep looping as long as the API returns as many tweets as we told it to. Once it returns fewer tweets, we know it ran out.

while(nrow(fav_film_df) == num_tweets) {
    max_id <- fav_film_df$id[num_tweets]
    requests <- requests + 1
    fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets, maxID=max_id)
    fav_film_df <- tbl_df(map_df(fav_film_tweets, as.data.frame))
    fav_film_all <- rbind(fav_film_all, fav_film_df[fav_film_df$isRetweet == FALSE, ])

    Sys.sleep(delay) # be nice to the API
}

Note that I added the maxID=max_id parameter to the request. This tells the search API to return tweets older than the previous set of tweets. Also note that I added a delay in the loop. Twitter has set a rate limit on their search API to 15 requests every 15 minutes, so this delay is to avoid being blocked.

That will take a while, but once it's done we'll have over 100,000 tweets, so we want to save them so we don't have to go through all that again. I just saved the whole data frame to an R data blob.

save(fav_film_all, file="Fav7FilmTweets.Rda")

You can download that file from GitHub at Fav7FilmTweets.Rda if you want to follow along from this point, or if you want to do your own analysis on this data set. Just use load("Fav7FilmTweets.Rda") to load the data frame from the file.

Next, we want to remove any retweets or multiple tweets from the same user.

fav_film_all <- fav_film_all[fav_film_all$isRetweet == FALSE, ]
fav_film_all <- fav_film_all[!duplicated(fav_film_all$screenName), ]

Now we can start parsing the lists of film titles from the tweets. Most people formatted their titles on separate lines, so we'll assume that format. Any tweets that don't use that format will just fall to the bottom of the list of films once we rank them.

# remove the hashtag, ignoring case
fav_film_all$text <- gsub("#fav7films", "", fav_film_all$text, ignore.case=TRUE)

# remove numbers from lists
fav_film_all$text <- gsub("\\d\\.|\\)|-", "", fav_film_all$text)

# convert to common case for all tweets
fav_film_all$text <- tolower(fav_film_all$text)

# trim any whitespace left over from earlier steps
fav_film_all$text <- trimws(fav_film_all$text)

At this point, we should have a bunch of lists of seven movie titles. What we want to do next is separate them all out into one large list of titles, count how many times each title appears, then sort the list. We'll also remove "A" and "The" from the beginning of any titles that include them, since many people included them, but many didn't.

titles <- list()
titles <- append(titles, strsplit(fav_film_all$text, split="\n"))
titles <- unlist(titles)

# remove leading 'a' and 'the' from titles
titles <- gsub("^a ", "", titles)
titles <- gsub("^the ", "", titles)

# remove empty titles
titles <- titles[titles != ""]

ranked_titles <- sort(table(titles), decreasing=TRUE)
top_25 <- head(ranked_titles, 25)

That's the final list. There are a lot of other conditioning steps that we could have taken, like looking for common abbreviations or misspellings, but I think this gets us pretty close to an accurate list.

You can view the full R source code that I used to gather and analyze tweets for this project in my Fav7 GitHub repository. Feel free to fork that and use it to analyze other Twitter favorites, and leave me a comment if you do, or if you have any questions.

No comments: