Scraping Together a Recipe, Episode I

The Internet is full of amazing content. Like these names of actual recipes. Methodology for getting these to follow.

Recipe Name
Sea-Purb Seafood Pasta
Tuna Salad for Grown-ups
Easy Ham Balls
No Ordinary Meatloaf
CindyD’s Somewhat Southern Fried Chicken
Crust for Two
Butterbeer III

This is a snapshot-in-time look at where I am with a data analysis project related to building daily menus. In the Food for Thought series we’ve built up and tweaked menus algorithmically such that they meet minimum daily nutritional requirements. Just because they pass those benchmarks, though, doesn’t mean that they’re appetizing.

The idea with scraping recipes is to reduce the “Eat, Pray, Barf” factor by looking through how real menus are structured and sussing out general patterns or rules in them. For instance, maybe we could learn that usually more than 1/3 of a dish should be dairy, or pork and apples tend to go well together.

I thought allrecipes.com would be likely to live up to its name and provide a good amount of data to work with. After a bit of poking a few recipes to try to discern if there was a pattern in how Allrecipes structures its URLs, I found that that all the recipe URLs followed this basic structure: http://allrecipes.com/recipe/<ID>/<NAME-OF-RECIPE>/. Omitting the <NAME-OF-RECIPE> parameter seemed to be fine in all cases; http://allrecipes.com/recipe/<ID> would redirect you to http://allrecipes.com/recipe/<ID>/<NAME-OF-RECIPE>/.

I couldn’t figure out much of a pattern behind IDs save that they were always all digits and appeared to usually be between 10000 and 200000. (There’s probably some pattern I’m missing here but this was good enough to start off with.)

So we know our base URL is going to be "http://allrecipes.com/recipe/".

base_url <- "http://allrecipes.com/recipe/"

Then we need to attach IDs to it, so for instance

grab_urls <- function(base_url, id) {
  id <- as.character(id)
  recipe_url <- str_c(base_url, id)
  return(recipe_url)
}

(urls <- grab_urls(base_url, 244940:244950))
##  [1] "http://allrecipes.com/recipe/244940"
##  [2] "http://allrecipes.com/recipe/244941"
##  [3] "http://allrecipes.com/recipe/244942"
##  [4] "http://allrecipes.com/recipe/244943"
##  [5] "http://allrecipes.com/recipe/244944"
##  [6] "http://allrecipes.com/recipe/244945"
##  [7] "http://allrecipes.com/recipe/244946"
##  [8] "http://allrecipes.com/recipe/244947"
##  [9] "http://allrecipes.com/recipe/244948"
## [10] "http://allrecipes.com/recipe/244949"
## [11] "http://allrecipes.com/recipe/244950"

Now that we’ve got URLs to scrape, we’ll need to do the actual scraping.

Since we’re appending some random numbers to the end of our base URL, there’s a good chance some of those pages won’t exist. We want a helper function that can try to read HTML on a page if it exists, and if the page doesn’t exist, tell us without erroring out and exiting our loop. purrr::possibly() will let us do that. It provides a sort of try-catch set up where we try to read_url() but if we can’t, return “Bad URL” and go on to the next URL.

read_url <- function(url) {
  page <- read_html(url)
}
try_read <- possibly(read_url, otherwise = "Bad URL", quiet = TRUE)

For example,

try_read("foo")
## [1] "Bad URL"

read_html() from the xml2 package will return us the raw HTML for a given page. We’re only interested in the recipe portion of that, so using the Chrome inspector or the SelectorGadget Chrome extension we can figure out what the CSS tag is of the content itself.

The recipe’s name gets the CSS class .recipe-summary__h1 and the content gets .checkList__line. So, we’ll pluck everything tagged with those two classes using html_nodes() and return text we can use with html_text().

get_recipe_name <- function(page) {
  recipe_name <- page %>% 
    html_nodes(".recipe-summary__h1") %>% 
    html_text() 
  return(recipe_name)
}

Let’s test that out on our fourth URL.

urls[4] %>% try_read() %>% get_recipe_name()
## [1] "Banana, Orange, and Ginger Smoothie"

We’ll need an extra couple steps when it comes to recipe content to pare out all the stray garbage left over like \n new lines etc.

get_recipe_content <- function(page) {
  recipe <- page %>% 
    html_nodes(".checkList__line") %>% 
    html_text() %>% 
    str_replace_all("ADVERTISEMENT", "") %>% 
    str_replace_all("\n", "") %>% 
    str_replace_all("\r", "") %>% 
    str_replace_all("Add all ingredients to list", "")
  return(recipe)
}

And the content:

tibble(
  recipe_content = urls[4] %>% 
  try_read() %>% 
  get_recipe_content()) %>% 
  kable()
recipe_content
1 orange, peeled
1/2 banana
3 ice cubes
2 teaspoons honey
1/2 teaspoon grated fresh ginger root, or to taste
1/2 cup plain yogurt

Cool, so we’ve got three functions now, one for reading the content from a URL and turning it into a page and two for taking that page and grabbing the parts of it that we want. We’ll use those functions in get_recipes() which will take a vector of URLs and return us a list of recipes. We also include parameters for how long to wait in between requests (sleep) so as to avoid getting booted from allrecipes.com and whether we want the “Bad URL”s included in our results list or not. If verbose is TRUE we’ll get a message of count of the number of 404s we had and the number of duped recipes.

Note on dupes

Dupes come up because multiple IDs can point to the same recipe which means that two different URLs could resolve to the same page. I figured there were two routes we could go to see whether a recipe is a dupe or not; one, just go off of the recipe name or two, go off of the recipe name and content. By going off of the name, we don’t go through the trouble of pulling in duped recipe content if we think we’ve got a dupe; we just skip it. Going off of content and checking whether the recipe content exists in our list so far would be safer (we’d only skip the recipes that we definitely already have), but slower because we have to both get_recipe_name() and get_recipe_content(). I went with the faster way; in get_recipes() we just check the recipe name we’re on against all the recipe names in our list with if (!recipe_name %in% names(out)).

Let’s define a quick helper function for removing whitespace:

remove_whitespace <- function(str) {
  str <- str %>% str_split(" ") %>% as_vector()
  str <- str[!str == ""]
  str <- str_c(str, collapse = " ")
  return(str)
}
get_recipes <- function(urls, sleep = 5, verbose = TRUE, append_bad_URLs = TRUE) {
  bad_url_counter <- 0
  duped_recipe_counter <- 0
  
  out <- NULL       # In this case we don't know how long our list will be 
  
  for (url in urls) {
    Sys.sleep(sleep)    # Sleep in between requests to avoid 429 (too many requests)
    recipe_page <- try_read(url)
  
    if (recipe_page == "Bad URL" ||
       (!class(recipe_page) %in% c("xml_document", "xml_node"))) { 
      recipe_list <- recipe_page    # If we've got a bad URL, recipe_df will be "Bad URL" because of the otherwise clause
      bad_url_counter <- bad_url_counter + 1
      
      if (append_bad_URLs == TRUE) { out <- append(out, recipe_list) }

    } else {
      recipe_name <- get_recipe_name(recipe_page)
      
      if (!recipe_name %in% names(out)) {
        
        if (verbose == TRUE) { message(recipe_name) }
      
        recipe <- recipe_page %>% 
          get_recipe_content() %>% 
          map(remove_whitespace) %>% as_vector()
        
        recipe_list <- list(tmp_name = recipe) %>% as_tibble()  
        names(recipe_list) <- recipe_name
        
        out <- append(out, recipe_list)
        
      } else {
        duped_recipe_counter <- duped_recipe_counter + 1
        if (verbose == TRUE) {
          message("Skipping recipe we already have")
        }
      }
    }
  }
  if (verbose == TRUE) { 
    message(paste0("Number bad URLs: ", bad_url_counter))
    message(paste0("Number duped recipes: ", duped_recipe_counter))
  }
  
  return(out)
}

Let’s give it a shot with a couple URLs.

(a_couple_recipes <- get_recipes(urls[4:5]))
## Banana, Orange, and Ginger Smoothie
## Alabama-Style White Barbecue Sauce
## Number bad URLs: 0
## Number duped recipes: 0
## $`Banana, Orange, and Ginger Smoothie`
## [1] "1 orange, peeled"                                  
## [2] "1/2 banana"                                        
## [3] "3 ice cubes"                                       
## [4] "2 teaspoons honey"                                 
## [5] "1/2 teaspoon grated fresh ginger root, or to taste"
## [6] "1/2 cup plain yogurt"                              
## 
## $`Alabama-Style White Barbecue Sauce`
## [1] "2 cups mayonnaise"                          
## [2] "1/2 cup apple cider vinegar"                
## [3] "1/4 cup prepared extra-hot horseradish"     
## [4] "2 tablespoons fresh lemon juice"            
## [5] "1 1/2 teaspoons freshly ground black pepper"
## [6] "2 teaspoons prepared yellow mustard"        
## [7] "1 teaspoon kosher salt"                     
## [8] "1/2 teaspoon cayenne pepper"                
## [9] "1/4 teaspoon garlic powder"

Now we’ve got a list of named recipes with one row per ingredient. Next step is tidying. We want to put this list of recipes into dataframe format with one observation per row and one variable per column. Our rows will contain items in the recipe content, each of which we’ll associate with the recipe’s name.

dfize <- function(lst, remove_bad_urls = TRUE) {

  df <- NULL
  if (remove_bad_urls == TRUE) {
    lst <- lst[!lst == "Bad URL"]
  }

  for (i in seq_along(lst)) {
    this_df <- lst[i] %>% as_tibble()
    recipe_name <- names(lst[i])
    names(this_df) <- "ingredients"
    this_df <- this_df %>% 
      mutate(recipe_name = recipe_name)
    df <- df %>% bind_rows(this_df)
  }
  return(df)
}
a_couple_recipes_df <- dfize(a_couple_recipes)
a_couple_recipes_df %>% kable(format = "html")
ingredients recipe_name
1 orange, peeled Banana, Orange, and Ginger Smoothie
1/2 banana Banana, Orange, and Ginger Smoothie
3 ice cubes Banana, Orange, and Ginger Smoothie
2 teaspoons honey Banana, Orange, and Ginger Smoothie
1/2 teaspoon grated fresh ginger root, or to taste Banana, Orange, and Ginger Smoothie
1/2 cup plain yogurt Banana, Orange, and Ginger Smoothie
2 cups mayonnaise Alabama-Style White Barbecue Sauce
1/2 cup apple cider vinegar Alabama-Style White Barbecue Sauce
1/4 cup prepared extra-hot horseradish Alabama-Style White Barbecue Sauce
2 tablespoons fresh lemon juice Alabama-Style White Barbecue Sauce
1 1/2 teaspoons freshly ground black pepper Alabama-Style White Barbecue Sauce
2 teaspoons prepared yellow mustard Alabama-Style White Barbecue Sauce
1 teaspoon kosher salt Alabama-Style White Barbecue Sauce
1/2 teaspoon cayenne pepper Alabama-Style White Barbecue Sauce
1/4 teaspoon garlic powder Alabama-Style White Barbecue Sauce

Great, so we’ve got a tidy dataframe that we can start to get some useful data out of. Next up, we’ll extract the relevant portion units and names from the beautiful soup of these recipe ingredients.