Scraping Together a Recipe, Episode I
- 2018/25/02
- 8 min read
The Internet is full of amazing content. Like these names of actual recipes. Methodology for getting these to follow.
Recipe Name |
---|
Sea-Purb Seafood Pasta |
Tuna Salad for Grown-ups |
Easy Ham Balls |
No Ordinary Meatloaf |
CindyD’s Somewhat Southern Fried Chicken |
Crust for Two |
Butterbeer III |
This is a snapshot-in-time look at where I am with a data analysis project related to building daily menus. In the Food for Thought series we’ve built up and tweaked menus algorithmically such that they meet minimum daily nutritional requirements. Just because they pass those benchmarks, though, doesn’t mean that they’re appetizing.
The idea with scraping recipes is to reduce the “Eat, Pray, Barf” factor by looking through how real menus are structured and sussing out general patterns or rules in them. For instance, maybe we could learn that usually more than 1/3 of a dish should be dairy, or pork and apples tend to go well together.
I thought allrecipes.com would be likely to live up to its name and provide a good amount of data to work with. After a bit of poking a few recipes to try to discern if there was a pattern in how Allrecipes structures its URLs, I found that that all the recipe URLs followed this basic structure: http://allrecipes.com/recipe/<ID>/<NAME-OF-RECIPE>/
. Omitting the <NAME-OF-RECIPE>
parameter seemed to be fine in all cases; http://allrecipes.com/recipe/<ID>
would redirect you to http://allrecipes.com/recipe/<ID>/<NAME-OF-RECIPE>/
.
I couldn’t figure out much of a pattern behind ID
s save that they were always all digits and appeared to usually be between 10000 and 200000. (There’s probably some pattern I’m missing here but this was good enough to start off with.)
So we know our base URL is going to be "http://allrecipes.com/recipe/"
.
base_url <- "http://allrecipes.com/recipe/"
Then we need to attach IDs to it, so for instance
grab_urls <- function(base_url, id) {
id <- as.character(id)
recipe_url <- str_c(base_url, id)
return(recipe_url)
}
(urls <- grab_urls(base_url, 244940:244950))
## [1] "http://allrecipes.com/recipe/244940"
## [2] "http://allrecipes.com/recipe/244941"
## [3] "http://allrecipes.com/recipe/244942"
## [4] "http://allrecipes.com/recipe/244943"
## [5] "http://allrecipes.com/recipe/244944"
## [6] "http://allrecipes.com/recipe/244945"
## [7] "http://allrecipes.com/recipe/244946"
## [8] "http://allrecipes.com/recipe/244947"
## [9] "http://allrecipes.com/recipe/244948"
## [10] "http://allrecipes.com/recipe/244949"
## [11] "http://allrecipes.com/recipe/244950"
Now that we’ve got URLs to scrape, we’ll need to do the actual scraping.
Since we’re appending some random numbers to the end of our base URL, there’s a good chance some of those pages won’t exist. We want a helper function that can try to read HTML on a page if it exists, and if the page doesn’t exist, tell us without erroring out and exiting our loop. purrr::possibly()
will let us do that. It provides a sort of try-catch set up where we try to read_url()
but if we can’t, return “Bad URL” and go on to the next URL.
read_url <- function(url) {
page <- read_html(url)
}
try_read <- possibly(read_url, otherwise = "Bad URL", quiet = TRUE)
For example,
try_read("foo")
## [1] "Bad URL"
read_html()
from the xml2
package will return us the raw HTML for a given page. We’re only interested in the recipe portion of that, so using the Chrome inspector or the SelectorGadget Chrome extension we can figure out what the CSS tag is of the content itself.
The recipe’s name gets the CSS class .recipe-summary__h1
and the content gets .checkList__line
. So, we’ll pluck everything tagged with those two classes using html_nodes()
and return text we can use with html_text()
.
get_recipe_name <- function(page) {
recipe_name <- page %>%
html_nodes(".recipe-summary__h1") %>%
html_text()
return(recipe_name)
}
Let’s test that out on our fourth URL.
urls[4] %>% try_read() %>% get_recipe_name()
## [1] "Banana, Orange, and Ginger Smoothie"
We’ll need an extra couple steps when it comes to recipe content to pare out all the stray garbage left over like \n
new lines etc.
get_recipe_content <- function(page) {
recipe <- page %>%
html_nodes(".checkList__line") %>%
html_text() %>%
str_replace_all("ADVERTISEMENT", "") %>%
str_replace_all("\n", "") %>%
str_replace_all("\r", "") %>%
str_replace_all("Add all ingredients to list", "")
return(recipe)
}
And the content:
tibble(
recipe_content = urls[4] %>%
try_read() %>%
get_recipe_content()) %>%
kable()
recipe_content |
---|
1 orange, peeled |
1/2 banana |
3 ice cubes |
2 teaspoons honey |
1/2 teaspoon grated fresh ginger root, or to taste |
1/2 cup plain yogurt |
Cool, so we’ve got three functions now, one for reading the content from a URL and turning it into a page
and two for taking that page
and grabbing the parts of it that we want. We’ll use those functions in get_recipes()
which will take a vector of URLs and return us a list of recipes. We also include parameters for how long to wait in between requests (sleep
) so as to avoid getting booted from allrecipes.com and whether we want the “Bad URL”s included in our results list or not. If verbose
is TRUE we’ll get a message of count of the number of 404s we had and the number of duped recipes.
Note on dupes
Dupes come up because multiple IDs can point to the same recipe which means that two different URLs could resolve to the same page. I figured there were two routes we could go to see whether a recipe is a dupe or not; one, just go off of the recipe name or two, go off of the recipe name and content. By going off of the name, we don’t go through the trouble of pulling in duped recipe content if we think we’ve got a dupe; we just skip it. Going off of content and checking whether the recipe content exists in our list so far would be safer (we’d only skip the recipes that we definitely already have), but slower because we have to both get_recipe_name()
and get_recipe_content()
. I went with the faster way; in get_recipes()
we just check the recipe name we’re on against all the recipe names in our list with if (!recipe_name %in% names(out))
.
Let’s define a quick helper function for removing whitespace:
remove_whitespace <- function(str) {
str <- str %>% str_split(" ") %>% as_vector()
str <- str[!str == ""]
str <- str_c(str, collapse = " ")
return(str)
}
get_recipes <- function(urls, sleep = 5, verbose = TRUE, append_bad_URLs = TRUE) {
bad_url_counter <- 0
duped_recipe_counter <- 0
out <- NULL # In this case we don't know how long our list will be
for (url in urls) {
Sys.sleep(sleep) # Sleep in between requests to avoid 429 (too many requests)
recipe_page <- try_read(url)
if (recipe_page == "Bad URL" ||
(!class(recipe_page) %in% c("xml_document", "xml_node"))) {
recipe_list <- recipe_page # If we've got a bad URL, recipe_df will be "Bad URL" because of the otherwise clause
bad_url_counter <- bad_url_counter + 1
if (append_bad_URLs == TRUE) { out <- append(out, recipe_list) }
} else {
recipe_name <- get_recipe_name(recipe_page)
if (!recipe_name %in% names(out)) {
if (verbose == TRUE) { message(recipe_name) }
recipe <- recipe_page %>%
get_recipe_content() %>%
map(remove_whitespace) %>% as_vector()
recipe_list <- list(tmp_name = recipe) %>% as_tibble()
names(recipe_list) <- recipe_name
out <- append(out, recipe_list)
} else {
duped_recipe_counter <- duped_recipe_counter + 1
if (verbose == TRUE) {
message("Skipping recipe we already have")
}
}
}
}
if (verbose == TRUE) {
message(paste0("Number bad URLs: ", bad_url_counter))
message(paste0("Number duped recipes: ", duped_recipe_counter))
}
return(out)
}
Let’s give it a shot with a couple URLs.
(a_couple_recipes <- get_recipes(urls[4:5]))
## Banana, Orange, and Ginger Smoothie
## Alabama-Style White Barbecue Sauce
## Number bad URLs: 0
## Number duped recipes: 0
## $`Banana, Orange, and Ginger Smoothie`
## [1] "1 orange, peeled"
## [2] "1/2 banana"
## [3] "3 ice cubes"
## [4] "2 teaspoons honey"
## [5] "1/2 teaspoon grated fresh ginger root, or to taste"
## [6] "1/2 cup plain yogurt"
##
## $`Alabama-Style White Barbecue Sauce`
## [1] "2 cups mayonnaise"
## [2] "1/2 cup apple cider vinegar"
## [3] "1/4 cup prepared extra-hot horseradish"
## [4] "2 tablespoons fresh lemon juice"
## [5] "1 1/2 teaspoons freshly ground black pepper"
## [6] "2 teaspoons prepared yellow mustard"
## [7] "1 teaspoon kosher salt"
## [8] "1/2 teaspoon cayenne pepper"
## [9] "1/4 teaspoon garlic powder"
Now we’ve got a list of named recipes with one row per ingredient. Next step is tidying. We want to put this list of recipes into dataframe format with one observation per row and one variable per column. Our rows will contain items in the recipe content, each of which we’ll associate with the recipe’s name.
dfize <- function(lst, remove_bad_urls = TRUE) {
df <- NULL
if (remove_bad_urls == TRUE) {
lst <- lst[!lst == "Bad URL"]
}
for (i in seq_along(lst)) {
this_df <- lst[i] %>% as_tibble()
recipe_name <- names(lst[i])
names(this_df) <- "ingredients"
this_df <- this_df %>%
mutate(recipe_name = recipe_name)
df <- df %>% bind_rows(this_df)
}
return(df)
}
a_couple_recipes_df <- dfize(a_couple_recipes)
a_couple_recipes_df %>% kable(format = "html")
ingredients | recipe_name |
---|---|
1 orange, peeled | Banana, Orange, and Ginger Smoothie |
1/2 banana | Banana, Orange, and Ginger Smoothie |
3 ice cubes | Banana, Orange, and Ginger Smoothie |
2 teaspoons honey | Banana, Orange, and Ginger Smoothie |
1/2 teaspoon grated fresh ginger root, or to taste | Banana, Orange, and Ginger Smoothie |
1/2 cup plain yogurt | Banana, Orange, and Ginger Smoothie |
2 cups mayonnaise | Alabama-Style White Barbecue Sauce |
1/2 cup apple cider vinegar | Alabama-Style White Barbecue Sauce |
1/4 cup prepared extra-hot horseradish | Alabama-Style White Barbecue Sauce |
2 tablespoons fresh lemon juice | Alabama-Style White Barbecue Sauce |
1 1/2 teaspoons freshly ground black pepper | Alabama-Style White Barbecue Sauce |
2 teaspoons prepared yellow mustard | Alabama-Style White Barbecue Sauce |
1 teaspoon kosher salt | Alabama-Style White Barbecue Sauce |
1/2 teaspoon cayenne pepper | Alabama-Style White Barbecue Sauce |
1/4 teaspoon garlic powder | Alabama-Style White Barbecue Sauce |
Great, so we’ve got a tidy dataframe that we can start to get some useful data out of. Next up, we’ll extract the relevant portion units and names from the beautiful soup of these recipe ingredients.