Monkeys are like Onions

2018/25/03
14 min read

This is part two of a series on scraping content from the satirical news site The Onion and feeding that content to the newly-spruced up monkeylearn package. Part one deals with the scraping and munging of the data itself. In this chunk of work, we’ll go about classifying that data and getting a very unscientific measure of how “well” the classifier performed¹.

MonkeyLearn Background

I’ve spent a really fun chunk of time in the last month or so developing the rOpenSci package text processing package monkeylearn along with the fantastic research software engineer Maëlle Salmon. The package is an R interface to the MonkeyLearn API, which is a cloud platform for text classification and keyword extraction. The service allows you to train your own classifier or use one of their pre-trained extraction or classification “modules.” Those modules are specialized to classify or extract certain information from text. The type of results they return depend on the classifier or extractor the user supplies.

I use the package in my work at Earlybird, which is the reason I originally first started contributing to it, but wanted to test it out on some more fun data. Since there is a pre-trained “News” classifier available from MonkeyLearn, I was interested in how that classifier would handle satirical news from the humor site The Onion.

Refresher

library(dobtools)    # devtools::install_github("aedobbyn/dobtools")
library(tidyverse)
library(stringr)
library(monkeylearn)
library(rvest)

In part one, we defined our subdomains

subdomains <- c("www", "politics", "local", "sports", "entertainment", "opinion")

and saved our freshly-rvested texts in an object all_texts_clean. We’ll use that as our jumping-off point here.

I’ll use our same trim_content helper from before to be able to show a manageable amount of a dataframe with a lot of text in kable format. By default it samples three rows and takes just the first 50 characters of a given column supplied (here, the column content).

dobtools::trim_content

## function (df, col = content, last_char = 50, sample_some = 3) 
## {
##     assertthat::assert_that(deparse(substitute(col)) %in% names(df), 
##         msg = "col supplied must be present in df")
##     assertthat::assert_that(is.null(sample_some) || is.numeric(sample_some), 
##         msg = "sample_some must be NULL or numeric")
##     if (!is.null(sample_some)) {
##         df <- df %>% dplyr::sample_n(sample_some)
##     }
##     q_col <- rlang::enquo(col)
##     out <- df %>% dplyr::rowwise() %>% dplyr::mutate_at(dplyr::vars(!(!q_col)), 
##         substr, 1, last_char)
##     return(out)
## }
## <environment: namespace:dobtools>

all_texts_clean %>% 
  trim_content() %>% 
  kable()

content	subd	link	referring_url	referring_subd
NEW YORK—Saying that he was able to draw upon a li	local	https://local.theonion.com/classically-trained-actor-can-talk-on-cue-1823995491	https://local.theonion.com	local
NEW YORK—Shifting creative gears to pursue what he	entertainment	https://entertainment.theonion.com/paul-giamatti-cuts-back-on-acting-to-focus-on-signature-1823828872	https://entertainment.theonion.com	entertainment
LOS GATOS, CA—Acknowledging that the former presid	entertainment	https://entertainment.theonion.com/netflix-executive-unsure-how-to-tell-barack-obama-his-s-1823651466	https://entertainment.theonion.com	entertainment

Today I [Monkey]Learned

We can peruse the pretrained MonkeyLearn modules (keyword extractors and text classifiers) at https://app.monkeylearn.com/main/explore/ or use monkeylearn::monkeylearn_classifiers() to get a list of pre-trained classifiers and their descriptions.

A sample:

monkeylearn_classifiers() %>% 
  select(classifier_id, name, description) %>% 
  sample_n(5) %>% 
  kable()

classifier_id	name	description
cl_mcHq5Mxu	Generic Sentiment	Classify texts in English according to their sentiment: positive, neutral or negative.
cl_u9PRHNzf	Tweets Sentiment (Spanish)	Classifying tweets in Spanish (from Spain) according to their sentiment: positive, neutral or negative.
cl_hT2jYBHR	Ofertas de trabajo	Classifies job postings in Spanish according to its topic.
cl_rtdVEb8p	Telcos - Sentiment analysis (Facebook)	Classify Facebook posts and comments in Telcos pages according to its sentiment.
cl_uEzzFRHh	Telcos - Needs Help Detection (Facebook)	Classify Facebook posts and comments in Telcos pages to detect if a user needs help from a customer support representative or not.

The two classifiers that look the most promising are the News classifier (ID cl_hS9wMk9y) and the Events classifier (ID cl_4omNGduL).

monkeylearn_classifiers() %>% 
  select(classifier_id, name, description) %>% 
  filter(name %in% c("Events Classifier", "News Classifier")) %>% 
  kable()

classifier_id	name	description
cl_4omNGduL	Events Classifier	Classify events according to their description.
cl_hS9wMk9y	News Classifier	Classify news by topics like Sports, Politics, Business and more.

First we’ll test the News one on a sample text.

text_sample <- all_texts_clean %>% 
  sample_n(1)

Let’s take a look at this link’s content.

text_sample %>% 
  select(content) %>% 
  kable()

content
NEW YORK—Shifting creative gears to pursue what he called “his other great passion in life,” casual men’s fashion, Paul Giamatti announced Friday that he would be cutting back on acting to launch a signature line of shapeless khakis and rumpled polos. “Over the years, I’ve heard from so many fans who wanted to dress like my characters that I figured, why not? I’ll try designing loose-fitting, haphazard looks that evoke the classic ‘Giamatti’ brand,” the 50-year-old Brooklyn native told fashion reporters, confirming that his not-quite-ready-to-wear line would range in size from large to extra large and be sold exclusively at Kohl’s. “We’re also featuring faded golf hats, shabby corduroy sport coats, loafers, and unnecessarily long button-downs. And if you were wondering how the pants are cut, don’t worry—there will be ample roominess around the crotch.” The official release of the Giamatti line will occur later this month with a runway show whenever East Windsor, New Jersey’s historic Bowling & Recreation Center can fit it in.

NEW YORK—Shifting creative gears to pursue what he called “his other great passion in life,” casual men’s fashion, Paul Giamatti announced Friday that he would be cutting back on acting to launch a signature line of shapeless khakis and rumpled polos. “Over the years, I’ve heard from so many fans who wanted to dress like my characters that I figured, why not? I’ll try designing loose-fitting, haphazard looks that evoke the classic ‘Giamatti’ brand,” the 50-year-old Brooklyn native told fashion reporters, confirming that his not-quite-ready-to-wear line would range in size from large to extra large and be sold exclusively at Kohl’s. “We’re also featuring faded golf hats, shabby corduroy sport coats, loafers, and unnecessarily long button-downs. And if you were wondering how the pants are cut, don’t worry—there will be ample roominess around the crotch.” The official release of the Giamatti line will occur later this month with a runway show whenever East Windsor, New Jersey’s historic Bowling & Recreation Center can fit it in.

Cool. Now a bit on the monkey_ functions (which you can see more examples of in the vignette).

The monkey_classify() fuction takes either a vector of texts or a dataframe and a named column. It’ll always return a dataframe relating your input to your output. You can set the number of texts you want batched to the API (up to 200) with the texts_per_req parameter. (This is much, much faster than going one-by-one.) Here we’ve got only a single text, so we’ll have just one batch. What does our output look like?

text_sample %>% 
  monkey_classify(col = content, classifier_id = "cl_hS9wMk9y") %>% 
  trim_content(sample_some = 2) %>% 
  kable()

## Processing batch 1 of 1 batches: texts 1 to 1

content	category_id	probability	label
NEW YORK—Shifting creative gears to pursue what he	53873822	0.268	Fashion & Style
NEW YORK—Shifting creative gears to pursue what he	53873812	0.403	Living

We get multiple classificationlabels for each input content, along with a unique category_id and probability. For now, we only really care about the label.

Scaling Up

Let’s run the News and Events classifiers on all_texts_clean. The .keep_all option allows us to retain all columns from the input dataframe, so that monkey_classify() acts like a mutate, tacking on a new result column to our input. If set to FALSE, it would return only the input col and its corresponding output res.

all_texts_classified_news_nested <- all_texts_clean %>%
  monkey_classify(
    col = content,
    classifier_id = "cl_hS9wMk9y"
    unnest = FALSE,
    .keep_all = TRUE,
    verbose = FALSE)

all_texts_classified_events_nested <- all_texts_clean %>%
  monkey_classify(
    col = content,
    classifier_id = "cl_4omNGduL",
    unnest = FALSE,
    .keep_all = TRUE,
    verbose = FALSE)

I set unnest to FALSE in order to be able to check that the two classifiers returned data of the same dimensions. Each should have one row per input, even if there are 0 or multiple classifications for that input.

assertthat::are_equal(dim(all_texts_classified_news_nested),
                      dim(all_texts_classified_events_nested))

## [1] TRUE

Taking a look at just the output of the News classifier, we see we have a nested list-col called res, the result of our round-trip to MonkeyLearn.

(news_samp <- all_texts_classified_news_nested %>%
  trim_content() %>% 
   select(content, res, everything()))

## Source: local data frame [3 x 6]
## Groups: <by row>
## 
## # A tibble: 3 x 6
##   content      res     subd  link           referring_url   referring_subd
##   <chr>        <list>  <chr> <chr>          <chr>           <chr>         
## 1 LOS ANGELES… <data.… ente… https://enter… https://entert… entertainment 
## 2 BURBANK, CA… <data.… ente… https://enter… https://entert… entertainment 
## 3 WASHINGTON—… <data.… poli… https://polit… https://politi… politics

If we unnest res, we’ll be able to see all our labels as we did in our text_sample above.

news_samp %>% 
  unnest() %>% 
  select(content, label, category_id, probability, everything()) %>%
  kable()

content	label	category_id	probability	subd	link	referring_url	referring_subd
LOS ANGELES—Gushing that yesterday’s Oscars had ch	Arts & Culture	53873823	0.478	entertainment	https://entertainment.theonion.com/diversity-was-the-real-winner-last-night-report-hund-1823524487	https://entertainment.theonion.com	entertainment
LOS ANGELES—Gushing that yesterday’s Oscars had ch	Movies	53873827	0.220	entertainment	https://entertainment.theonion.com/diversity-was-the-real-winner-last-night-report-hund-1823524487	https://entertainment.theonion.com	entertainment
BURBANK, CA—In a revelation that many are saying w	Arts & Culture	53873823	0.451	entertainment	https://entertainment.theonion.com/damning-evidence-shows-actor-al-jolson-wearing-blackfac-1823993317	https://entertainment.theonion.com	entertainment
BURBANK, CA—In a revelation that many are saying w	Movies	53873827	0.221	entertainment	https://entertainment.theonion.com/damning-evidence-shows-actor-al-jolson-wearing-blackfac-1823993317	https://entertainment.theonion.com	entertainment
WASHINGTON—Tears welling in their eyes as they fac	Arts & Culture	53873823	0.259	politics	https://politics.theonion.com/surrendering-trump-boys-solemnly-salute-each-other-befo-1823920497	https://politics.theonion.com	politics
WASHINGTON—Tears welling in their eyes as they fac	Music	53873826	0.171	politics	https://politics.theonion.com/surrendering-trump-boys-solemnly-salute-each-other-befo-1823920497	https://politics.theonion.com	politics

Now we’ll join our News and Events results on the common column, content to end up with a label_news and label_events column for each input.

classified <- all_texts_classified_news_nested %>% unnest() %>% 
  left_join(all_texts_classified_events_nested %>% unnest(), by = "content") 

names(classified) <- names(classified) %>% 
  str_replace_all("\\.x", "_news") %>% 
  str_replace_all("\\.y", "_events")

After a bit of cleaning up, we’re left with a dataframe, classified, relating each input to its News and Events classifications along with its subdomain and information about the link that pointed us to that article.

(classified <- classified %>% 
  rename(link = link_news,
         subd = subd_news,
         referring_url = referring_url_news,
         referring_subd = referring_subd_news) %>% 
  select(subd, content, link, label_news, label_events, 
         referring_url, referring_subd,
         probability_news, probability_events))

## # A tibble: 220 x 9
##    subd   content              link  label_news label_events referring_url
##    <chr>  <chr>                <chr> <chr>      <chr>        <chr>        
##  1 polit… WASHINGTON—In an ef… http… Science &… Entertainme… https://poli…
##  2 polit… WASHINGTON—In an ef… http… Science &… Sports       https://poli…
##  3 polit… WASHINGTON—In an ef… http… Science &… Soccer       https://poli…
##  4 polit… WASHINGTON—In an ef… http… Energy & … Entertainme… https://poli…
##  5 polit… WASHINGTON—In an ef… http… Energy & … Sports       https://poli…
##  6 polit… WASHINGTON—In an ef… http… Energy & … Soccer       https://poli…
##  7 polit… WASHINGTON—In the l… http… Arts & Cu… Entertainme… https://poli…
##  8 polit… WASHINGTON—In the l… http… Arts & Cu… Sports       https://poli…
##  9 polit… WASHINGTON—In the l… http… Arts & Cu… Fighting & … https://poli…
## 10 polit… WASHINGTON—In the l… http… TV & Radio Entertainme… https://poli…
## # ... with 210 more rows, and 3 more variables: referring_subd <chr>,
## #   probability_news <dbl>, probability_events <dbl>

To show a sample:

classified %>% 
  trim_content() %>% 
  kable()

subd	content	link	label_news	label_events	referring_url	referring_subd	probability_news	probability_events
entertainment	NEW YORK—Shifting creative gears to pursue what he	https://entertainment.theonion.com/paul-giamatti-cuts-back-on-acting-to-focus-on-signature-1823828872	Fashion & Style	Beauty & Fashion	https://entertainment.theonion.com	entertainment	0.268	0.272
politics	WASHINGTON—In an effort to guarantee Americans the	https://politics.theonion.com/epa-rolls-back-emissions-standards-to-increase-consumer-1824218100	Energy & Environment	Soccer	https://politics.theonion.com	politics	0.406	0.102
local	JUNEAU, AK—Saying it was clear the parents never i	https://local.theonion.com/family-has-way-too-many-daughters-for-them-not-to-have-1824313914	World	Soccer	https://local.theonion.com	local	0.211	0.114

Evaluate

Now we can ask how accurate our MonkeyLearn classifications were. For instance, how often are texts that appear under the politics subdomain tagged with the Politics label?

What are the possible classifications for our News classifier?

(news_labels <- classified$label_news %>% unique())

##  [1] "Science & Technology"   "Energy & Environment"  
##  [3] "Arts & Culture"         "TV & Radio"            
##  [5] "World"                  "Asia & Pacific"        
##  [7] "Music"                  "U.S."                  
##  [9] "Living"                 "Fashion & Style"       
## [11] "Books"                  "Business"              
## [13] "Companies"              "Family & Relationships"
## [15] "Home"                   "Food"                  
## [17] "Career"                 "Sports"                
## [19] "Baseball"               "Football"              
## [21] "Basketball"             "Movies"

And Events?

(events_labels <- classified$label_events %>% unique())

##  [1] "Entertainment"           "Sports"                 
##  [3] "Soccer"                  "Fighting & Martial Arts"
##  [5] "Movements & Ideologies"  "Politics"               
##  [7] "Arts & Culture"          "Music"                  
##  [9] "Charity & Non-Profits"   "Football"               
## [11] "Family & Children"       "Food & Drink"           
## [13] "Baseball"                "Beauty & Fashion"       
## [15] "Film"                    "Dance & Theatre"

Now we have to square these labels with our subdomains, i.e., www, politics, local, sports, entertainment, opinion, in order to be able to evaluate how well the classifications matched up.

We’ll collapse certain classifications into others and remove some that we can’t clearly align with one or another subdomain.

Of course all of this is purely subjective; you could easily argue that “Food” belongs in entertainment rather than local and of course there’s no way for us to know whether, for example, articles published at the sports subdomain really are always about sports.

entertainment <- 
  c("Entertainment", 
          "Arts & Culture", "Books", "Film", "Music", "TV & Radio")
local <- 
  c("Local",
          "Food", "Living")
politics <- 
  c("Politics", 
          "Middle East", "Movements & Ideologies", "U.S.", "World")
sports <- 
  c("Sports", 
          "Baseball", "Cycling", "Football", "Soccer")

Now we’ll create a dplyr::case_when() switch statement inside a mutate() to assign a collapsed_label column.

classified_relabeled <- classified %>% 
  drop_na(label_news, label_events) %>% 
  rowwise() %>% 
  mutate(
    collapsed_label = case_when(
      label_news %in% entertainment || label_events %in% entertainment ~ "entertainment",
      label_news %in% local || label_events %in% local ~ "local",
      label_news %in% politics || label_events %in% politics ~ "politics",
      label_news %in% sports || label_events %in% sports ~ "sports"
    ) 
  ) %>% 
  replace_na(list(collapsed_label = "www")) %>%   # If none of the above, we give it www
  select(subd, collapsed_label, content, everything())

classified_relabeled$collapsed_label <- as.character(classified_relabeled$collapsed_label)

We drop rows that weren’t classified with any of the entertainment, local, politics, or sports classifiers and then determine whether the MonkeyLearn classification lined up with the Onion subdomain.

If the MonkeyLearn label is the same as the subdomain we give match a TRUE, and a FALSE otherwise.

classified_relabeled_confusion <-
  classified_relabeled %>% 
  select(subd, collapsed_label) %>% 
  drop_na(collapsed_label) %>% 
  rowwise() %>% 
  mutate(
    match = case_when(
      identical(collapsed_label, subd) ~ TRUE,
      !(identical(collapsed_label, subd)) ~ FALSE
    )
  )

Now we can calculate how “well” the classifier did. First, our overall accuracy compared to chance:

n_subs <- length(levels(factor(classified_relabeled$collapsed_label)))

overall_accuracy <- classified_relabeled_confusion %>% 
  ungroup() %>% 
  summarise(
    overall_accuracy = sum(match == TRUE) / nrow(.),
    chance = 1 / n_subs
  )

overall_accuracy %>% kable()

overall_accuracy	chance
0.4272727	0.2

Next, we can break this down by subdomain.

accuracy_per_subd <- classified_relabeled_confusion %>% 
  ungroup() %>% 
  group_by(subd) %>% 
  add_count() %>% 
  rename(n_total = n) %>% 
  group_by(collapsed_label, subd) %>% 
  add_count(sort = TRUE) %>% 
  rename(n_correct = n) %>% 
  ungroup() %>% 
  distinct(subd, collapsed_label, n_total, n_correct, match) %>% 
  arrange(subd, desc(n_total), desc(n_correct))
  

accuracy_per_subd %>% kable()

subd	collapsed_label	match	n_total	n_correct
entertainment	entertainment	TRUE	58	44
entertainment	politics	FALSE	58	4
entertainment	sports	FALSE	58	4
entertainment	local	FALSE	58	3
entertainment	www	FALSE	58	3
local	entertainment	FALSE	62	34
local	local	TRUE	62	12
local	sports	FALSE	62	10
local	politics	FALSE	62	4
local	www	FALSE	62	2
politics	entertainment	FALSE	76	40
politics	politics	TRUE	76	22
politics	sports	FALSE	76	12
politics	local	FALSE	76	2
sports	sports	TRUE	24	16
sports	entertainment	FALSE	24	8

Let’s find our percent correct for each subdomain by taking just the matches and seeing how many “correct” matches out of the total we had.

perc_correct <- accuracy_per_subd %>% 
  filter(match == TRUE) %>% 
  mutate(
    perc = n_correct / n_total
  )

perc_correct %>% kable()

subd	collapsed_label	match	n_total	n_correct	perc
entertainment	entertainment	TRUE	58	44	0.7586207
local	local	TRUE	62	12	0.1935484
politics	politics	TRUE	76	22	0.2894737
sports	sports	TRUE	24	16	0.6666667

Now we can plot these percentages as compared to chance:

ggplot(perc_correct) +
  geom_bar(aes(collapsed_label %>% map_chr(dobtools::cap_it), perc), 
           width = 0.5,
           stat = "identity", 
           position = "dodge",
           colour = "#212529", fill = "white") +
  geom_hline(aes(yintercept = overall_accuracy$chance),
             linetype = "dotted") +
  ggtitle("The Onion articles, Correctly Classified") +
  labs(x = "Subdomain", y = "Percent Correctly MonkeyLearn Classified",
     subtitle = "Using the MonkeyLearn API") +
  theme_minimal(base_family = "Source Sans Pro")

It make sense that Local articles have low accuracy as these were given the most amorphous tags of Local, Food, Living.

When MonkeyLearn went wrong, where did it go wrong?

We’ll plot the full spread, broken into quadrants by subdomain.

accuracy_per_subd %>% 
  mutate(
    perc = n_correct / n_total
  ) %>%  
  ggplot() +
  geom_bar(aes(collapsed_label %>% 
                 map_chr(dobtools::cap_it), perc,
               colour = factor(match)), 
           stat = "identity", 
           width = 0.5,
           fill = "white") +
  geom_hline(aes(yintercept = overall_accuracy$chance),
             linetype = "dotted") +
  facet_wrap(~ subd %>% map_chr(dobtools::cap_it)) +
  scale_colour_manual(values = c("#27265f", "#e83e8c"), name = "A match?") +
  ggtitle("The Onion articles, Full Breakdown by Subdomain") +
  labs(x = "Classification", y = "Percent assigned this Classification",
     subtitle = "When incorrect, what classification were they assigned?",
     colour = "A match?", fill = "A match?") +
  theme_minimal(base_family = "Source Sans Pro") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The classifications are best for Sports and Entertainment articles, while entertainment is over-assigned to Local and Politics articles. Of course, our measure of segmenting MonkeyLearn classifications into subdomains was artificial from the get-go, so all of this is easily challengeable.

Closing Thoughts

In this toy example we can see that the MonkeyLearn API is flexible enough to classify texts it’s never seen before. Of course we’re not measuring its accuracy rigorously, but from our proxy for validating how well it was able to classify texts into the correct category we can show that it at least performed above chance. The monkeylearn package interface makes it trivial to feed in a dataframe of texts and receive a nicely formatted dataframe in return.

Future Directions

Thinking of next places to take this, we might consider examining the network relationships between referring subdomains and content subdomains, or doing a text analysis of the words in articles that tend crop up most frequently when classified with a certain label (as compared to how frequently they appear when assigned a different label). Another direction would be to train a fake news detector (using a MonkeyLearn module or rolling our own, maybe Naïve Bayes to start) on The Onion and real news articles of similar sizes.

Side note: I don’t know if monkeys are at all like onions, but Ogres are, according to trusted sources.↩