Peeling back The Onion
- 2018/25/03
- 21 min read
In this post I’ll programmatically find The Onion article links, scrape them for content, and clean them up into a tidy format. I chose The Onion because while not real news, the site does a great job of approximating the tone and cadence of real news stories. In the next post, I’ll use the monkeylearn
text processing package to hand these to the MonkeyLearn API and then compare the classifications that MonkeyLearn generates with the URL’s subdomain to get an imperfect measure of the classifier’s accuracy.
I use the article’s subdomain to roughly approximate the true classification of stories. These should correspond more or less with the sections delimited by the top nav: Politics, Sports, Local, etc. That is, articles about politics, will generally have a URL beginning with https://politics.theonion.com
.
Get Data
To my knowledge, The Onion doesn’t have a public API, so web scraping seemed like the best way to go about getting article content.
library(dobtools) # devtools::install_github("aedobbyn/dobtools")
library(tidyverse)
library(stringr)
library(monkeylearn)
library(rvest)
Let’s quickly check that web scraping is allowed by The Onion using the rOpenSci package robotstxt
. Its paths_allowed()
function will consult the robots.txt
file found at the root of a domain (e.g., https://www.google.com/robots.txt) to check whether certain web crawlers, spiders, etc. are discouraged from accessing certain paths on that domain1.
robotstxt::paths_allowed(
domain = "theonion.com",
path = "/",
bot = "*"
)
## [1] TRUE
Cool, we’re good to go.
First, let’s take a bite-sized example link that we know exists, for an article entitled CIA Realizes It’s Been Using Black Highlighters All These Years. From its URL, we can scrape its content using the rvest
package.
We can be pretty confident that our content will always be associated with the same combination of HTML tag and CSS class name. Finding out what that combination is is as simple as using the Chrome inspector or the SelectorGadget extension when examining an article in the browser. In this case, article content is always wrapped in p
-tags. That’s what we’ll pull out using the HTML in rvest::html_nodes()
.
example_page <- "https://politics.theonion.com/cia-realizes-its-been-using-black-highlighters-all-thes-1819568147"
(example_content <- example_page %>%
read_html() %>%
html_nodes("p") %>%
html_text())
## [1] "LANGLEY, VA—A report released Tuesday by the CIA's Office of the Inspector General revealed that the CIA has mistakenly obscured hundreds of thousands of pages of critical intelligence information with black highlighters."
## [2] "According to the report, sections of the documents— \"almost invariably the most crucial passages\"—are marred by an indelible black ink that renders the lines impossible to read, due to a top-secret highlighting policy that began at the agency's inception in 1947."
## [3] "CIA Director Porter Goss has ordered further internal investigation."
## [4] "\"Why did it go on for this long, and this far?\" said Goss in a press conference called shortly after the report's release. \"I'm as frustrated as anyone. You can't read a single thing that's been highlighted. Had I been there to advise [former CIA director] Allen Dulles, I would have suggested the traditional yellow color—or pink.\""
## [5] "Goss added: \"There was probably some really, really important information in these documents.\""
## [6] "Advertisement"
## [7] ""
## [8] "When asked by a reporter if the black ink was meant to intentionally obscure, Goss countered, \"Good God, why?\""
## [9] "Goss lamented the fact that the public will probably never know the particulars of such historic events as the Cold War, the civil-rights movement, or the growth of the international drug trade."
## [10] "\"I'm sure the CIA played major roles in all these things,\" Goss said. \"But now we'll never know for sure.\""
## [11] "Advertisement"
## [12] ""
## [13] "In addition to clouding the historical record, the use of the black highlighters, also known as \"permanent markers,\" may have encumbered or even prevented critical operations. CIA scholar Matthew Franks was forced to abandon work on a book about the Bay Of Pigs invasion after declassified documents proved nearly impossible to read."
## [14] "\"With all the highlighting in the documents I unearthed in the National Archives, it's really no wonder that the invasion failed,\" Franks said. \"I don't see how the field operatives and commandos were expected to decipher their orders.\""
## [15] "The inspector general's report cited in particular the damage black highlighting did to documents concerning the assassination of John F. Kennedy, thousands of pages of which \"are completely highlighted, from top to bottom margin.\""
## [16] "Advertisement"
## [17] ""
## [18] "\"It is unclear exactly why CIA bureaucrats sometimes chose to emphasize entire documents,\" the report read. \"Perhaps the documents were extremely important in every detail, or the agents, not unlike college freshmen, were overwhelmed by the reading material and got a little carried away.\""
## [19] "Also unclear is why black highlighters were chosen in the first place. Some blame it on the closed, elite culture of the CIA itself. A former CIA officer speaking on the condition of anonymity said highlighting documents with black pens was a common and universal practice."
## [20] "\"It seemed counterintuitive, but the higher-ups didn't know what they were doing,\" the ex-officer said. \"I was once ordered to feed documents into a copying machine in order to make backups of some very important top-secret records, but it turned out to be some sort of device that cut the paper to shreds.\""
## [21] "Advertisement"
## [22] ""
Now that we can do that for a single URL, let’s write that into a function that we can apply to any arbitrary link.
If the link doesn’t exist, we’ll want to replace the resulting NULL element with an NA using a quick helper function. Since we’ll be mapping this function over a vector of links, in between requests, we’ll add in a sleep_time
of 1 second by default, to avoid 429 (too many requests) errors 😱.
get_text <- function(l, sleep_time = 1) {
out <- l %>%
read_html() %>%
html_nodes("p") %>%
html_text() %>%
dobtools::replace_x(NA_character_)
Sys.sleep(1)
return(out)
}
Get Links
Let’s write a way to collect a bunch of links so we don’t have to go find them by hand. A given article will link to several other articles on some part of the page in the “You may also like” and “Recommended Stories” sections.
We can write a function to take an input URL and return us other article links that appear on the input page (and keep doing this recursively if we wanted to).
All the links we’re looking for will be of the form
https://<subdomain>.theonion.com/<some_characters>
or in other words
<url_prefix><subdomain>.<url_stem></[a-z-0-9]+>
Using those URL components, we can define some global variables.
url_main <- "https://www.theonion.com/"
url_prefix <- "https://"
url_stem <- "theonion.com"
url_stem_reg <- "theonion\\.com" # the regex equivalent of url_stem
subdomains <- c("www", "politics", "local", "sports", "entertainment", "opinion")
Each of the subdomains (www, politics, local, sports, entertainment, opinion) are from the top nav on the main_url
, plus the usual www
. The subdomain should tell us which section of the newspaper that article appeared in.
In the get_links()
function we’ll make below, we take an input url
and return all the links it contains to other articles.
We’ll use our usual read_html() %>% html_nodes()
pipeline with the CSS class .js_entry-title
. That gives us some not-so-beautiful soup that looks like:
"https://www.theonion.com/" %>%
read_html() %>%
html_nodes(".js_entry-title") %>%
as.character()
## [1] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/2001-a-space-odyssey-celebrates-50th-anniversary-1825058954\" class=\"js_entry-link\">‘2001: A Space Odyssey’ Celebrates 50th Anniversary<em></em></a></h1>\n"
## [2] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://sports.theonion.com/pga-officials-break-up-crowd-of-rowdy-fans-committing-c-1825056921\" class=\"js_entry-link\">PGA Officials Break Up Crowd Of Rowdy Fans Committing Commodities Fraud In Augusta National Parking Lot<em></em></a></h1>\n"
## [3] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://sports.theonion.com/kobe-bryant-creates-foundation-to-help-children-struggl-1825050570\" class=\"js_entry-link\">Kobe Bryant Creates Foundation To Help Children Struggling With Severe Narcissism<em></em></a></h1>\n"
## [4] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/study-finds-eating-doctor-after-birth-can-provide-essen-1825048366\" class=\"js_entry-link\">Study Finds Eating Doctor After Birth Can Provide Essential Nutrients To New Mothers<em></em></a></h1>\n"
## [5] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/kitchenaid-unveils-spring-loaded-toaster-that-allows-ra-1825046205\" class=\"js_entry-link\">KitchenAid Unveils Spring-Loaded Toaster That Allows Rad High Schoolers To Grab Breakfast In Midair While Leaving House<em></em></a></h1>\n"
## [6] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://local.theonion.com/kid-putting-pencils-between-knuckles-about-to-fuck-some-1825046023\" class=\"js_entry-link\">Kid Putting Pencils Between Knuckles About To Fuck Someone Up<em></em></a></h1>\n"
## [7] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://local.theonion.com/exercising-woman-really-starting-to-feel-the-burn-of-li-1825044413\" class=\"js_entry-link\">Exercising Woman Really Starting To Feel The Burn Of Lifelong Injury Developing<em></em></a></h1>\n"
## [8] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/retired-pope-benedict-pledges-to-donate-soul-for-eccles-1825044076\" class=\"js_entry-link\">Retired Pope Benedict Pledges To Donate Soul For Ecclesiastic Research<em></em></a></h1>\n"
## [9] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://local.theonion.com/you-can-hold-snake-owner-reports-1825043965\" class=\"js_entry-link\">You Can Hold Snake, Owner Reports<em></em></a></h1>\n"
## [10] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/u-s-marshals-arrest-designers-of-water-slide-that-deca-1825043891\" class=\"js_entry-link\">U.S. Marshals Arrest Designers Of Water Slide That Decapitated Rider<em></em></a></h1>\n"
## [11] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://sports.theonion.com/jack-nicholson-banned-from-sitting-courtside-after-spil-1825027400\" class=\"js_entry-link\">Jack Nicholson Banned From Sitting Courtside After Spilling Tupperware Full Of Homemade Chili<em></em></a></h1>\n"
## [12] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/mueller-tells-trump-he-s-not-under-criminal-investigati-1825026195\" class=\"js_entry-link\">Mueller Tells Trump He’s Not Under Criminal Investigation<em></em></a></h1>\n"
## [13] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/black-father-gives-son-the-talk-about-holding-literally-1825024938\" class=\"js_entry-link\">Black Father Gives Son The Talk About Holding Literally Any Object<em></em></a></h1>\n"
## [14] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/report-this-not-a-gun-1825021641\" class=\"js_entry-link\">Report: This Not A Gun<em></em></a></h1>\n"
## [15] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/cows-go-extinct-1825018199\" class=\"js_entry-link\">Cows Go Extinct<em></em></a></h1>\n"
## [16] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/does-the-younger-generation-of-call-girls-respect-tiger-1825017831\" class=\"js_entry-link\">Does The Younger Generation Of Call Girls Respect Tiger Woods?<em></em></a></h1>\n"
## [17] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/fuming-rachel-maddow-spends-entire-show-just-pointing-w-1825017260\" class=\"js_entry-link\">Fuming Rachel Maddow Spends Entire Show Just Pointing Wildly At Picture Of Putin<em></em></a></h1>\n"
## [18] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/ice-agents-feeling-a-little-hurt-that-trump-doesn-t-thi-1825016656\" class=\"js_entry-link\">ICE Agents Feeling A Little Hurt That Trump Doesn’t Think They’re Doing Enough To Terrorize Hispanics<em></em></a></h1>\n"
## [19] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://sports.theonion.com/who-has-the-best-dick-in-baseball-1825016290\" class=\"js_entry-link\">Who Has The Best Dick In Baseball?<em></em></a></h1>\n"
## [20] "<h1 class=\"headline entry-title js_entry-title\"><a onclick=\"\" href=\"https://www.theonion.com/how-trade-wars-work-1825012677\" class=\"js_entry-link\">How Trade Wars Work<em></em></a></h1>\n"
which we save as links_text
.
Now we can search through this text to find links. We know these are other articles because they contain the url_stem
“theonion.com” in them.
We get a link_base
by looking for “theonion.com” followed by all the characters in our regex characterset [a-z-0-9]
. Once we hit a space, which is not in our characterset, we’ll know we’ve reached the end of the link we’re looking for.
Then we find the subdomains that come directly before link_base
; there will always be only one of these per link, so we can use str_extract()
(which only extracts the first instance of a pattern) instead of str_extract_all()
. Worth keeping in mind here that subd
refers to the actual link’s subdomain rather than the “referring” subdomain. This is important because if we scrape https://local.theonion.com/
we may very well turn up links to https://entertainment.theonion.com/
. Here the referring subdomain would be local
, but the URL’s subdomain would be entertainment
.
Finally, we recompose the whole thing into a link
.
get_links <- function(url = url_main, subds = subdomains) {
links_raw <- url %>%
xml2::read_html() %>%
html_nodes(".js_entry-title")
links_text <- links_raw %>%
as.character()
reg <- paste0(url_stem_reg, "/[a-z-0-9]+")
links <- tibble(
link_base = links_text %>% str_extract_all(reg) %>% unlist()
)
subds_found <- tibble(
subd = links_text %>% map(str_extract, subds) %>% unlist()
) %>% drop_na()
links <- links %>%
bind_cols(subds_found) %>%
mutate(
link = str_c(url_prefix, subd, ".", link_base),
referring_url = url,
referring_subd = referring_url %>%
str_extract(subdomains %>% str_c(collapse = "|"))
) %>%
distinct() %>%
select(-link_base)
return(links)
}
Now a bit of error handling in case a link we ask for doesn’t exist. We’ll return a missing value for that link instead of erroring out.
try_get_links <- possibly(get_links, otherwise = NA_character_)
Let’s get our dataframe of links.
(links <- try_get_links(url_main))
## # A tibble: 20 x 4
## subd link referring_url referring_subd
## <chr> <chr> <chr> <chr>
## 1 www https://www.theonion.com/2001-a… https://www.the… www
## 2 sports https://sports.theonion.com/pga… https://www.the… www
## 3 sports https://sports.theonion.com/kob… https://www.the… www
## 4 www https://www.theonion.com/study-… https://www.the… www
## 5 www https://www.theonion.com/kitche… https://www.the… www
## 6 local https://local.theonion.com/kid-… https://www.the… www
## 7 local https://local.theonion.com/exer… https://www.the… www
## 8 www https://www.theonion.com/retire… https://www.the… www
## 9 local https://local.theonion.com/you-… https://www.the… www
## 10 www https://www.theonion.com/u-s-ma… https://www.the… www
## 11 sports https://sports.theonion.com/jac… https://www.the… www
## 12 www https://www.theonion.com/muelle… https://www.the… www
## 13 www https://www.theonion.com/black-… https://www.the… www
## 14 www https://www.theonion.com/report… https://www.the… www
## 15 www https://www.theonion.com/cows-g… https://www.the… www
## 16 www https://www.theonion.com/does-t… https://www.the… www
## 17 www https://www.theonion.com/fuming… https://www.the… www
## 18 www https://www.theonion.com/ice-ag… https://www.the… www
## 19 sports https://sports.theonion.com/who… https://www.the… www
## 20 www https://www.theonion.com/how-tr… https://www.the… www
Now we can scrape all 20 of these freshly harvested links for their content. But instead of just using those links, we can seed ourselves using our url_stem
theonion.com preceded by each of the subdomains
. This way we can loop through subdomains
and use get_links()
to get links from all of them.
(full_subd_urls <- str_c(url_prefix, subdomains, ".", url_stem))
## [1] "https://www.theonion.com"
## [2] "https://politics.theonion.com"
## [3] "https://local.theonion.com"
## [4] "https://sports.theonion.com"
## [5] "https://entertainment.theonion.com"
## [6] "https://opinion.theonion.com"
First, a quick check to make sure we’re allowed to scrape the front page of all these subdomains.
full_subd_urls %>% robotstxt::paths_allowed()
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Okay, now lets get all of their links.
all_links_list <- full_subd_urls %>%
map(try_get_links)
We can see that our error handling paid off when we asked for “https://opinion.theonion.com”, which in fact 404s when we ask for it in the browser since the page doesn’t exist. We return an NA for that link (list element [[6]]
), which we’ll get rid of later.
all_links_list
## [[1]]
## # A tibble: 20 x 2
## subd link
## <chr> <chr>
## 1 www https://www.theonion.com/we-interview-some-guy-who-hated-marc…
## 2 www https://www.theonion.com/nra-calls-for-more-common-sense-gun-…
## 3 www https://www.theonion.com/nra-says-parkland-students-should-be…
## 4 www https://www.theonion.com/jonathan-safran-foer-guesses-it-s-ti…
## 5 local https://local.theonion.com/it-kind-of-pathetic-how-excited-3-…
## 6 www https://www.theonion.com/the-week-in-pictures-week-of-march-2…
## 7 politics https://politics.theonion.com/stormy-daniels-60-minutes-inter…
## 8 local https://local.theonion.com/man-assumed-celebrity-sighting-wou…
## 9 www https://www.theonion.com/male-birth-control-pill-shows-early-…
## 10 www https://www.theonion.com/fda-deems-genetically-modified-salmo…
## 11 www https://www.theonion.com/apple-recalls-thousands-of-earbuds-t…
## 12 www https://www.theonion.com/stormy-daniels-60-minutes-interview-…
## 13 politics https://politics.theonion.com/john-bolton-warns-war-with-nort…
## 14 www https://www.theonion.com/yosemite-national-park-completes-con…
## 15 politics https://politics.theonion.com/psychopath-joins-fourth-straigh…
## 16 local https://local.theonion.com/friends-trying-on-each-other-s-gla…
## 17 www https://www.theonion.com/sound-off-1824019596
## 18 www https://www.theonion.com/christ-sues-catholic-church-for-unli…
## 19 sports https://sports.theonion.com/is-it-time-for-the-ncaa-to-start-…
## 20 www https://www.theonion.com/u-s-military-announces-plan-to-conso…
##
## [[2]]
## # A tibble: 20 x 2
## subd link
## <chr> <chr>
## 1 politics https://politics.theonion.com/stormy-daniels-60-minutes-inter…
## 2 politics https://politics.theonion.com/john-bolton-warns-war-with-nort…
## 3 politics https://politics.theonion.com/psychopath-joins-fourth-straigh…
## 4 politics https://politics.theonion.com/you-are-the-jewel-of-my-collect…
## 5 politics https://politics.theonion.com/grumblethor-the-mischievous-ple…
## 6 politics https://politics.theonion.com/surrendering-trump-boys-solemnl…
## 7 politics https://politics.theonion.com/key-2018-election-primaries-to-…
## 8 politics https://politics.theonion.com/andrew-mccabe-spending-few-days…
## 9 politics https://politics.theonion.com/rick-perry-apologizes-for-tryin…
## 10 politics https://politics.theonion.com/donald-trump-jr-divorce-leaves-…
## 11 politics https://politics.theonion.com/subpoenaed-trump-organization-f…
## 12 politics https://politics.theonion.com/exhausted-mueller-trying-to-fin…
## 13 politics https://politics.theonion.com/border-wall-prototype-clearly-d…
## 14 politics https://politics.theonion.com/mike-pompeo-startled-after-seei…
## 15 politics https://politics.theonion.com/gina-haspel-recalls-having-to-t…
## 16 politics https://politics.theonion.com/rex-tillerson-shoots-mike-pompe…
## 17 politics https://politics.theonion.com/secretary-of-state-fired-after-…
## 18 politics https://politics.theonion.com/morale-low-at-state-department-…
## 19 politics https://politics.theonion.com/betsy-devos-argues-issue-of-gun…
## 20 politics https://politics.theonion.com/wilbur-ross-shakes-self-awake-a…
##
## [[3]]
## # A tibble: 20 x 2
## subd link
## <chr> <chr>
## 1 local https://local.theonion.com/it-kind-of-pathetic-how-excited-3-yea…
## 2 local https://local.theonion.com/man-assumed-celebrity-sighting-would-…
## 3 local https://local.theonion.com/friends-trying-on-each-other-s-glasse…
## 4 local https://local.theonion.com/employee-leaving-company-unsure-how-t…
## 5 local https://local.theonion.com/coffee-shop-customer-asks-if-guy-at-n…
## 6 local https://local.theonion.com/classically-trained-actor-can-talk-on…
## 7 local https://local.theonion.com/man-constantly-blaming-his-problems-o…
## 8 local https://local.theonion.com/adorable-23-year-old-yelling-about-ec…
## 9 local https://local.theonion.com/old-mans-son-also-old-man-1823955548
## 10 local https://local.theonion.com/bride-has-to-admit-it-d-be-pretty-exc…
## 11 local https://local.theonion.com/johnny-rockets-customer-terrified-aft…
## 12 local https://local.theonion.com/fingerprints-on-bathroom-stall-hopefu…
## 13 local https://local.theonion.com/freak-totally-has-the-hots-for-you-po…
## 14 local https://local.theonion.com/friend-who-listened-to-podcast-on-wat…
## 15 local https://local.theonion.com/dad-recommends-hotel-10-miles-away-fr…
## 16 local https://local.theonion.com/hacker-just-going-to-fix-a-few-annoyi…
## 17 local https://local.theonion.com/completely-unfair-that-man-ended-up-o…
## 18 local https://local.theonion.com/tulip-popping-up-in-middle-of-march-m…
## 19 local https://local.theonion.com/vagina-has-five-oclock-shadow-1823838…
## 20 local https://local.theonion.com/doll-real-estate-agent-glosses-over-g…
##
## [[4]]
## # A tibble: 20 x 2
## subd link
## <chr> <chr>
## 1 sports https://sports.theonion.com/is-it-time-for-the-ncaa-to-start-pa…
## 2 sports https://sports.theonion.com/will-missing-the-ncaa-tournament-af…
## 3 sports https://sports.theonion.com/nfl-sues-ea-to-end-production-of-un…
## 4 sports https://sports.theonion.com/why-is-march-madness-the-only-time-…
## 5 sports https://sports.theonion.com/hank-s-upset-that-the-office-reject…
## 6 sports https://sports.theonion.com/which-ncaa-tournament-team-will-str…
## 7 sports https://sports.theonion.com/gregg-popovich-berates-spurs-for-mi…
## 8 sports https://sports.theonion.com/james-harden-credits-his-nba-succes…
## 9 sports https://sports.theonion.com/how-much-can-hank-pad-out-this-segm…
## 10 sports https://sports.theonion.com/does-americas-poor-showing-at-the-o…
## 11 sports https://sports.theonion.com/is-lindsey-vonn-the-greatest-skier-…
## 12 sports https://sports.theonion.com/eagles-fans-finally-sober-enough-to…
## 13 sports https://sports.theonion.com/spectators-bombarded-with-gamma-rad…
## 14 sports https://sports.theonion.com/u-s-wins-gold-in-couples-snow-eatin…
## 15 sports https://sports.theonion.com/olympic-figure-skating-inspires-tho…
## 16 sports https://sports.theonion.com/uphill-skiing-competition-enters-6t…
## 17 www https://www.theonion.com/snowy-mountain-in-pyeongchang-figures-…
## 18 www https://www.theonion.com/chloe-kim-recalls-growing-up-under-par…
## 19 www https://www.theonion.com/nation-praying-for-super-nasty-luge-ac…
## 20 www https://www.theonion.com/olympic-drug-testing-official-left-hor…
##
## [[5]]
## # A tibble: 20 x 2
## subd link
## <chr> <chr>
## 1 entertainment https://entertainment.theonion.com/damning-evidence-show…
## 2 entertainment https://entertainment.theonion.com/paul-giamatti-cuts-ba…
## 3 entertainment https://entertainment.theonion.com/molly-hatchet-posts-s…
## 4 entertainment https://entertainment.theonion.com/audience-left-wonderi…
## 5 entertainment https://entertainment.theonion.com/negative-review-of-a-…
## 6 entertainment https://entertainment.theonion.com/netflix-executive-uns…
## 7 entertainment https://entertainment.theonion.com/leonardo-dicaprio-ner…
## 8 entertainment https://entertainment.theonion.com/diversity-was-the-rea…
## 9 entertainment https://entertainment.theonion.com/hungover-guillermo-de…
## 10 entertainment https://entertainment.theonion.com/unclear-if-shirtless-…
## 11 entertainment https://entertainment.theonion.com/phantom-thread-wins-a…
## 12 entertainment https://entertainment.theonion.com/banjo-wielding-matt-d…
## 13 entertainment https://entertainment.theonion.com/oscars-audience-shrug…
## 14 entertainment https://entertainment.theonion.com/guillermo-del-toro-in…
## 15 entertainment https://entertainment.theonion.com/red-carpet-organizers…
## 16 entertainment https://entertainment.theonion.com/perverted-creep-keeps…
## 17 entertainment https://entertainment.theonion.com/academy-honors-retiri…
## 18 entertainment https://entertainment.theonion.com/the-onion-s-2018-osca…
## 19 entertainment https://entertainment.theonion.com/sci-fi-film-presents-…
## 20 entertainment https://entertainment.theonion.com/justin-timberlake-pul…
##
## [[6]]
## [1] NA
Let’s rowbind these all together to take them from list to dataframe.
all_links_list <- all_links_list[!is.na(all_links_list)]
all_links <- all_links_list %>% reduce(bind_rows)
How many links can we pull from now?
nrow(all_links)
## [1] 100
Cool, so that’s 80 more than we’d get on just the front page. Let’s take a look at what these look like.
all_links %>% sample_n(10) %>% kable()
Get Content
Now that we have some URLs, we can extract the text from all of those pages.
We’ll map get_text()
over each of the links to get a list of each page’s content.
all_texts <- all_links$link %>% map(get_text)
Let’s take a look at a sample of what we’ve got.
all_texts[!is.na(all_texts)] %>% sample(5)
## [[1]]
## [1] "WASHINGTON—Tears welling in their eyes as they faced each other while standing at attention, the Trump boys, Donald Jr. and Eric, exchanged a solemn salute before defiantly leaping from a first-story White House window. “It’s been an honor to serve with you, Don,” said a stoic Eric Trump who opened the window in the State Dining Room in preparation for the brothers’ last great act of glorious rebellion. “Don’t cry, Don, today we are going to bravely escape to heaven. I’ll meet at the pearly gates. Just look for someone who looks exactly like me yelling your name.” White House tour sources confirmed hearing the Trump boys screaming incoherently after the brothers leaped out the opening, fell 18 inches, and became entangled in coniferous shrubs."
##
## [[2]]
## [1] "NEW YORK—With players, coaches, and executives around the league admitting that the sudden finish had taken them completely by surprise, sources confirmed that the Major League Baseball season ended Thursday night over 200 days earlier than expected after new rules designed to make games take less time sped them up way too much. “We were anticipating that our new guidelines would reduce the amount of time in a nine-inning game, but we absolutely weren’t expecting that the average game would take only one minute and three seconds to play,” said MLB commissioner Rob Manfred, adding that rules designed to limit mound visits and cut down on commercial breaks ended up accelerating game times to such a degree that the first half of the season wrapped up at around 4:48 p.m. yesterday afternoon. “We definitely did see the games go much faster than last year, especially during the three-hour second half of the season. But then after the Milwaukee Brewers beat the Houston Astros 7-6 in a thrilling four-minute, 12-inning game to clinch the World Series title in six, we realized the season was over. The league would like to congratulate Mike Trout on his MVP-winning Triple Crown season, as well as the Brewers’ Chase Anderson on his surprise Cy Young campaign. We’ll see you all in 2019, I guess.” Sources confirmed that among those disappointed by the results of the season were the New York Yankees, who barely finished above .500 after slugger Giancarlo Stanton missed over 70 games while he was in the bathroom."
##
## [[3]]
## [1] "WASHINGTON—In the hours after subpoenaing the Trump Organization for a wide-ranging batch of files possibly germane to the investigation, sources confirmed Thursday that Special Counsel Robert Mueller was already exhausted trying to find Russia-related documents amid thousands of harassment lawsuits. “Oh my god, how many of these could there possibly be?” said a visibly weary Mueller, shoving aside another stack of papers containing only a single email exchange between a Trump Organization employee and a Russian businessman amongst dozens of out-of-court settlements of sexual misconduct suits filed against Trump. “I thought I was finally done with all of the lawsuits women have filed against him after the reams of documents concerning the Jill Harth sexual harassment case, but no—Here’s another stack from the [Summer] Zervos defamation suit. If I keep having to rifle through tens of thousands of pages of victim statements regarding suggestive remarks, corroborating eyewitness accounts of unwelcome contact, and lists of times that Trump slandered his accusers, all to find one damn thing I can use in my investigation, this is going to take forever.” Mueller admitted he was concerned about being thrown completely off the trail after going through eight straight file boxes of sexual harassment complaints filed by Trump Organization employees alone."
##
## [[4]]
## [1] "JUNEAU, AK—Saying it was clear the parents never intended to have such a large brood, sources confirmed Wednesday that the Greene family has way too many daughters for them not to have been trying for a son. “Obviously, after Jess and Katie, they started to get desperate for a boy, otherwise they wouldn’t have had Ashley,” said family friend Lisa Contreras, who noted that the Greenes showed no signs of stopping even though they were both nearing 40 and had daughters in daycare, elementary school, and middle school. “I thought for sure they’d be done once Sophia was born, but then a year and a half later, along came Charlie. For everyone’s sake, I hope the fifth time’s the charm.” Sources later confirmed that the Greenes had posted a photo of pink balloons on Facebook to announce their latest pregnancy."
##
## [[5]]
## [1] "WASHINGTON—In the latest shakeup to their defense of Special Counsel Robert Mueller’s Russia probe, President Trump’s legal team reportedly welcomed Wednesday a guy who never missed an episode of Ally McBeal back in the day. “We’re excited today to be joined by Ron Farkus, an Ohio man who tuned in every week between 1997 and 2002 to watch Fox’s beloved Calista Flockhart vehicle about the eccentric and oversexed Boston law firm Cage and Fish,” said top Trump lawyer Ty Cobb, adding that the 43-year-old Farkus has years of valuable experience owning the legal comedy-drama television series’ DVD box set, watching every episode “a bunch of times,” and even memorizing several of Flockhart’s famous quips from the show. “We look forward to working with Mr. Farkus, whose expertise in the antics of Ally, Richard, and Elaine both in and out of the courthouse will be of great use to us, as will his intimate knowledge of the ensemble’s ever-shifting love triangles, their outlandish courtroom battles, and the recurring dancing babies.” At press time, Farkus had been let go and replaced by a man who’s seen every episode of USA’s Suits, moments before that individual lost his position to some dude who used to watch the CBS legal drama JAG pretty regularly."
Now that we’ve got the text in list form, we’ll clean it up a bit.
The only obvious things we’ll need to remove are paragraphs that are empty strings or are just the alt text "Advertisement"
. We can write a little clean_texts()
function and add to it later if we notice anything else that snuck in.
clean_texts <- function(t) {
out <- t %>%
str_replace_all("Advertisement", "")
out <- out[!out == ""]
}
A page’s content can have multiple paragraphs. Right now, each of those is its own element of a vector, but since we want to treat each page’s content as a single text for classification, we’ll collapse everything on a single page into one long string by mapping str_c(collapse = "")
over each element of our list.
Finally, we’ll want to relate each text to its URL and that URL’s subdomain. In reshape_texts()
we do that by nudging them side-by-side using dplyr::bind_cols()
before dropping all of the texts that contained no content at all (probably because the page only contained a video).
reshape_texts <- function(tbl, links_df = all_links) {
out <- tbl %>%
map(str_c, collapse = "") %>%
unlist() %>%
as_tibble() %>%
rename(
content = value
) %>%
bind_cols(links_df) %>%
drop_na()
return(out)
}
Let’s clean and reshape our newly scraped text in one fell swoop.
all_texts_clean <- all_texts %>%
map(clean_texts) %>%
reshape_texts()
all_texts_clean
## # A tibble: 60 x 4
## content subd link referring_url
## <chr> <chr> <chr> <chr>
## 1 Roughly two million Americans joined the M… www https:… https://www.…
## 2 DALLAS—Contemplating how pivoting away fro… www https:… https://www.…
## 3 FAIRFAX, VA—In response to the March For O… www https:… https://www.…
## 4 FAIRFAX, VA—Reminding them to appreciate t… www https:… https://www.…
## 5 BROOKLYN, NY—Saying that if it were going … www https:… https://www.…
## 6 ATHENS, OH—Noting the “sad fucking glimmer… local https:… https://www.…
## 7 AKRON, OH—After unexpectedly running into … local https:… https://www.…
## 8 A form of once-daily male birth control ap… www https:… https://www.…
## 9 WASHINGTON—Following months of analysis in… www https:… https://www.…
## 10 Despite threats of legal action from the W… www https:… https://www.…
## # ... with 50 more rows
I’ll define a quick helper that will allow me to show a sample of in table format without getting to be too much. It’s added to my dobtools
package as it’ll be used in the next post. We take just the first By default it samples three rows and takes just the 1 to last_char
characters (50 by default) of a specified character column (here, content
).
dobtools::trim_content
## function (df, col = content, last_char = 50, sample_some = 3)
## {
## assertthat::assert_that(deparse(substitute(col)) %in% names(df),
## msg = "col supplied must be present in df")
## assertthat::assert_that(is.null(sample_some) || is.numeric(sample_some),
## msg = "sample_some must be NULL or numeric")
## if (!is.null(sample_some)) {
## df <- df %>% dplyr::sample_n(sample_some)
## }
## q_col <- rlang::enquo(col)
## out <- df %>% dplyr::rowwise() %>% dplyr::mutate_at(dplyr::vars(!(!q_col)),
## substr, 1, last_char)
## return(out)
## }
## <environment: namespace:dobtools>
And now to show a random 3 in full:
all_texts_clean %>%
trim_content() %>%
kable()
content | subd | link | referring_url |
---|---|---|---|
NEW YORK—Shifting creative gears to pursue what he | entertainment | https://entertainment.theonion.com/paul-giamatti-cuts-back-on-acting-to-focus-on-signature-1823828872 | https://entertainment.theonion.com |
PARK CITY, UT—Admitting they felt utterly bewilder | entertainment | https://entertainment.theonion.com/audience-left-wondering-what-happened-after-action-film-1823699285 | https://entertainment.theonion.com |
SANTA MONICA, CA—Alarmed by the red vinyl seats, c | local | https://local.theonion.com/johnny-rockets-customer-terrified-after-evidently-falli-1823950513 | https://local.theonion.com |
Now we’ve got one row per text (link) so we’re ready to feed this to MonkeyLearn. 🍌
On to part two!
The
robotstxt
vignette and robotstxt.org provide some more detail here. Thanks to Maëlle for the suggestion!↩