Peeling back The Onion

In this post I’ll programmatically find The Onion article links, scrape them for content, and clean them up into a tidy format. I chose The Onion because while not real news, the site does a great job of approximating the tone and cadence of real news stories. In the next post, I’ll use the monkeylearn text processing package to hand these to the MonkeyLearn API and then compare the classifications that MonkeyLearn generates with the URL’s subdomain to get an imperfect measure of the classifier’s accuracy.

I use the article’s subdomain to roughly approximate the true classification of stories. These should correspond more or less with the sections delimited by the top nav: Politics, Sports, Local, etc. That is, articles about politics, will generally have a URL beginning with https://politics.theonion.com.


Get Data

To my knowledge, The Onion doesn’t have a public API, so web scraping seemed like the best way to go about getting article content.

library(dobtools)              # devtools::install_github("aedobbyn/dobtools")
library(tidyverse)
library(stringr)
library(monkeylearn)
library(rvest)

Let’s quickly check that web scraping is allowed by The Onion using the rOpenSci package robotstxt. Its paths_allowed() function will consult the robots.txt file found at the root of a domain (e.g., https://www.google.com/robots.txt) to check whether certain web crawlers, spiders, etc. are discouraged from accessing certain paths on that domain1.

robotstxt::paths_allowed(
  domain = "theonion.com",
  path = "/",
  bot = "*"
)
## [1] TRUE

Cool, we’re good to go.

First, let’s take a bite-sized example link that we know exists, for an article entitled CIA Realizes It’s Been Using Black Highlighters All These Years. From its URL, we can scrape its content using the rvest package.

We can be pretty confident that our content will always be associated with the same combination of HTML tag and CSS class name. Finding out what that combination is is as simple as using the Chrome inspector or the SelectorGadget extension when examining an article in the browser. In this case, article content is always wrapped in p-tags. That’s what we’ll pull out using the HTML in rvest::html_nodes().

example_page <- "https://politics.theonion.com/cia-realizes-its-been-using-black-highlighters-all-thes-1819568147"

(example_content <- example_page %>%
  read_html() %>%
  html_nodes("p") %>%
  html_text())
##  [1] "LANGLEY, VA—A report released Tuesday by the CIA's Office of the Inspector General revealed that the CIA has mistakenly obscured hundreds of thousands of pages of critical intelligence information with black highlighters."                                                                                                                   
##  [2] "According to the report, sections of the documents—  \"almost invariably the most crucial passages\"—are marred by an indelible black ink that renders the lines impossible to read, due to a top-secret highlighting policy that began at the agency's inception in 1947."                                                                      
##  [3] "CIA Director Porter Goss has ordered further internal investigation."                                                                                                                                                                                                                                                                            
##  [4] "\"Why did it go on for this long, and this far?\" said Goss in a press conference called shortly after the report's release. \"I'm as frustrated as anyone. You can't read a single thing that's been highlighted. Had I been there to advise [former CIA director] Allen Dulles, I would have suggested the traditional yellow color—or pink.\""
##  [5] "Goss added: \"There was probably some really, really important information in these documents.\""                                                                                                                                                                                                                                                
##  [6] "Advertisement"                                                                                                                                                                                                                                                                                                                                   
##  [7] ""                                                                                                                                                                                                                                                                                                                                                
##  [8] "When asked by a reporter if the black ink was meant to intentionally obscure, Goss countered, \"Good God, why?\""                                                                                                                                                                                                                                
##  [9] "Goss lamented the fact that the public will probably never know the particulars of such historic events as the Cold War, the civil-rights movement, or the growth of the international drug trade."                                                                                                                                              
## [10] "\"I'm sure the CIA played major roles in all these things,\" Goss said. \"But now we'll never know for sure.\""                                                                                                                                                                                                                                  
## [11] "Advertisement"                                                                                                                                                                                                                                                                                                                                   
## [12] ""                                                                                                                                                                                                                                                                                                                                                
## [13] "In addition to clouding the historical record, the use of the black highlighters, also known as \"permanent markers,\" may have encumbered or even prevented critical operations. CIA scholar Matthew Franks was forced to abandon work on a book about the Bay Of Pigs invasion after declassified documents proved nearly impossible to read." 
## [14] "\"With all the highlighting in the documents I unearthed in the National Archives, it's really no wonder that the invasion failed,\" Franks said. \"I don't see how the field operatives and commandos were expected to decipher their orders.\""                                                                                                
## [15] "The inspector general's report cited in particular the damage black highlighting did to documents concerning the assassination of John F. Kennedy, thousands of pages of which \"are completely highlighted, from top to bottom margin.\""                                                                                                       
## [16] "Advertisement"                                                                                                                                                                                                                                                                                                                                   
## [17] ""                                                                                                                                                                                                                                                                                                                                                
## [18] "\"It is unclear exactly why CIA bureaucrats sometimes chose to emphasize entire documents,\" the report read. \"Perhaps the documents were extremely important in every detail, or the agents, not unlike college freshmen, were overwhelmed by the reading material and got a little carried away.\""                                           
## [19] "Also unclear is why black highlighters were chosen in the first place. Some blame it on the closed, elite culture of the CIA itself. A former CIA officer speaking on the condition of anonymity said highlighting documents with black pens was a common and universal practice."                                                               
## [20] "\"It seemed counterintuitive, but the higher-ups didn't know what they were doing,\" the ex-officer said. \"I was once ordered to feed documents into a copying machine in order to make backups of some very important top-secret records, but it turned out to be some sort of device that cut the paper to shreds.\""                         
## [21] "Advertisement"                                                                                                                                                                                                                                                                                                                                   
## [22] ""

Now that we can do that for a single URL, let’s write that into a function that we can apply to any arbitrary link.

If the link doesn’t exist, we’ll want to replace the resulting NULL element with an NA using a quick helper function. Since we’ll be mapping this function over a vector of links, in between requests, we’ll add in a sleep_time of 1 second by default, to avoid 429 (too many requests) errors 😱.

get_text <- function(l, sleep_time = 1) {
  out <- l %>%
    read_html() %>%
    html_nodes("p") %>%
    html_text() %>%
    dobtools::replace_x(NA_character_)

  Sys.sleep(1)

  return(out)
}

  1. The robotstxt vignette and robotstxt.org provide some more detail here. Thanks to Maëlle for the suggestion!