Scraping Together a Recipe, Episode II
- 2018/25/02
- 12 min read
One of the goals here is to see what portion of a menu tends to be devoted to, say, meat or spices or a word that appears in the receipe name etc. In order to answer that, we’ll need to extract portion names and portion sizes from the text. That wouldn’t be pretty simple with a fixed list of portion names (“gram”, “lb”) if portion sizes were always just a single number.
But, as it happens, protion sizes don’t usually consist of just one number. There are a few hurdles:
- Complex fractions
2 1/3 cups
of flour should become:2.3333
cups of flour
- Multiple items of the same item
4 (12oz)
bottles of beer should become:48
oz of beer
- Ranges
6-7
tomatoes should become:6.5
tomatoes
Here is a fake recipe to illustrate some of those cases. It’s an artifically “hard” recipe to straighten out because I’ve built it up line by line as I needed to test functions against fake data.
some_recipes_tester <- list(ingredients = vector()) %>% as_tibble()
some_recipes_tester[1, ] <- "1.2 ounces or maybe pounds of something with a decimal"
some_recipes_tester[2, ] <- "3 (14 ounce) cans o' beef broth"
some_recipes_tester[3, ] <- "around 4 or 5 eels"
some_recipes_tester[4, ] <- "5-6 cans spam"
some_recipes_tester[5, ] <- "11 - 46 tbsp of sugar"
some_recipes_tester[6, ] <- "1/3 to 1/2 of a ham"
some_recipes_tester[7, ] <- "5 1/2 pounds of apples"
some_recipes_tester[8, ] <- "4g cinnamon"
some_recipes_tester[9, ] <- "about 17 fluid ounces of wine"
some_recipes_tester[10, ] <- "4-5 cans of 1/2 caf coffee"
some_recipes_tester[11, ] <- "3 7oz figs with 1/3 rind"
some_recipes_tester %>% kable(format = "html")
ingredients |
---|
1.2 ounces or maybe pounds of something with a decimal |
3 (14 ounce) cans o’ beef broth |
around 4 or 5 eels |
5-6 cans spam |
11 - 46 tbsp of sugar |
1/3 to 1/2 of a ham |
5 1/2 pounds of apples |
4g cinnamon |
about 17 fluid ounces of wine |
4-5 cans of 1/2 caf coffee |
3 7oz figs with 1/3 rind |
Rather than start doing something conditional random field-level smart to get around these problems, to start off I started writing a few rules of thumb.
We’ll worry first about how we find and extract numbers and next about how we’ll add, multiply, or average them as necessary.
Extracting Numbers
We’ll need a few regexes to extract our numbers.
portions_reg
will match any digit even if it contains a decimal or a slash in it, which will be important for capturing complex fractions.
multiplier_reg
covers all cases of numbers that might need to be multiplied in the Allrecipes data, because these are always sepearated by " ("
, whereas multiplier_reg_looser
is a more loosely-defined case matching numbers separated just by " "
.
# Match any number, even if it has a decimal or slash in it
portions_reg <- "[[:digit:]]+\\.*[[:digit:]]*+\\/*[[:digit:]]*"
# Match numbers separated by " (" as in "3 (5 ounce) cans of broth" for multiplying
multiplier_reg <- "[[:digit:]]+ \\(+[[:digit:]]"
# Match numbers separated by " "
multiplier_reg_looser <- "[0-9]+\ +[0-9]"
Now the multiplier_reg
regexes will allow us to detect that we’ve got something that needs to be multiplied, like "4 (12 oz) hams"
or a fraction like "1 2/3 pound of butter"
. If we do, then we’ll multiply or add those numbers as appropriate. The only_mult_after_paren
parameter is something I put in that is specific to Allrecipes. On Allrecipes, it seems that if we do have multiples, they’ll always be of the form “number_of_things (quantity_of_single_thing)”. There are always parentheses around quantity_of_single_thing. If we’re only using Allrecipes data, that gives us some more security that we’re only multiplying quantities that actually should be multiplied. If we want to make this extensible in the future we’d want to set only_mult_after_paren
to FALSE to account for cases like “7 4oz cans of broth”.
We use str_extract()
to check that our regexes are grabbing the parts of a string that we’ll need to do computation on.
str_extract_all("3 1/4 lb patties", portions_reg)
## [[1]]
## [1] "3" "1/4"
And check that multiplier_reg
str_extract_all("3 (4 pound patties) for grilling", multiplier_reg)
## [[1]]
## [1] "3 (4"
We’ll clean that up by passing it to portions_reg
to just grab the numbers:
str_extract_all("3 (4 pound patties) for grilling", multiplier_reg) %>% str_extract_all(portions_reg)
## [[1]]
## [1] "3" "4"
Finally, let’s make sure that our stricter multiplier regex doesn’t want to multiply something shouldn’t be multiplied.
str_extract_all("3 or 4 lb patties", multiplier_reg)
## [[1]]
## character(0)
Okay, now to the multiplying and adding.
First, let’s consider complex fractions. Off the bat, we know we’ll need a way to turn a single fraction into a decimal form. We keep them as.character
for now and turn them into numeric later down the pipe.
frac_to_dec <- function(e) {
if (length(e) == 0) { # If NA because there are no numbers, make the portion 0
out <- 0
} else {
out <- parse(text = e) %>% eval() %>% as.character()
}
return(out)
}
eval()
, which is what does the work inside frac_to_dec)
only only evaluates the last string in a vector, not multiple, so as a workaround I put it into a helper that will turn all fractions into decimal strings:
map_frac_to_dec <- function(e) {
out <- NULL
for (i in e) {
out <- e %>% map_chr(frac_to_dec)
}
return(out)
}
For example:
map_frac_to_dec(c("1/2", "1/8", "1/3"))
## [1] "0.5" "0.125" "0.333333333333333"
Cool, so for a given ingredient we’ll need to look for numbers that are occur next to other numbers, and then and add complex fractions and multiply multiples.
If we’ve got two numbers next to each other and the second number evaluates to a decimal less than 1, we’ve got a complex fraction. For example, if we’re extracting digits and turning all fractions among them into decimals if we consider "4 1/2 loaves of bread"
we’d end up with "4"
and "0.5"
. We know 0.5
is less than 1
, so we’ve got a complex fraction on our hands. We need to add 4 + 0.5
to end up with 4.5
loaves of bread.
It’s true that this function doesn’t address the issue of having both a complex fraction and multiples in a recipe. That would look like "3 (2 1/4 inch)
blocks of cheese." I haven’t run into that issue too much but it certainly could use a workaround.
multiply_or_add_portions <- function(e) {
if (length(e) == 0) {
e <- 0
} else if (length(e) > 1) {
if (e[2] < 1) { # If our second element is a fraction, we know this is a complex fraction so we add the two
e <- e[1:2] %>% reduce(`+`)
} else { # Otherwise, we multiply them
e <- e[1:2] %>% reduce(`*`)
}
}
return(e)
}
multiply_or_add_portions(c(4, 0.5))
## [1] 4.5
multiply_or_add_portions(c(4, 5))
## [1] 20
This function will allow us to add a new column to our dataframe called mult_add_portion
. If we’ve done any multiplying or adding of numbers, we’ll have a value greater than 0 there, and 0 otherwise.
get_mult_add_portion <- function(e, only_mult_after_paren = FALSE) {
if ((str_detect(e, multiplier_reg) == TRUE | str_detect(e, multiplier_reg_looser) == TRUE)
& only_mult_after_paren == FALSE) { # If either matches and we don't care about where there's a parenthesis there or not
if (str_detect(e, multiplier_reg) == TRUE) {
out <- e %>% str_extract_all(portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(multiply_or_add_portions) %>%
round(digits = 2)
} else { # If we do care, and we have a parenthesis
out <- e %>% str_extract_all(portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(multiply_or_add_portions) %>%
round(digits = 2)
}
} else {
out <- 0
}
return(out)
}
get_mult_add_portion("4 1/2 steaks")
## [1] 4.5
get_mult_add_portion("4 (5 lb melons)")
## [1] 20
Ranges
Finally, let’s deal with ranges. If two numbers are separated by an "or"
or a "-"
like “4-5 teaspoons of sugar” we know that this is a range. We’ll take the average of those two numbers.
We’ll add a new column to our dataframe called range_portion
for the result of any range calculations. If we don’t have a range, just like mult_add_portion
, we set this value to 0.
to_reg <- "([0-9])(( to ))(([0-9]))"
or_reg <- "([0-9])(( or ))(([0-9]))"
dash_reg_1 <- "([0-9])((-))(([0-9]))"
dash_reg_2 <- "([0-9])(( - ))(([0-9]))"
First, a couple helpers. If two numbers are separated by an “or” or a “-” we know that this is a range, e.g., 4-5 teaspoons of sugar.
determine_if_range <- function(ingredients) {
if (str_detect(ingredients, pattern = to_reg) |
str_detect(ingredients, pattern = or_reg) |
str_detect(ingredients, pattern = dash_reg_1) |
str_detect(ingredients, pattern = dash_reg_2)) {
contains_range <- TRUE
} else {
contains_range <- FALSE
}
return(contains_range)
}
And, we’ll want to be able to get the mean of the first two elements in a numeric vector.
get_portion_means <- function(e) {
if (length(e) == 0) {
e <- 0 # NA to 0
} else if (length(e) > 1) {
e <- mean(e[1:2])
}
return(e)
}
get_ranges <- function(e) {
if (determine_if_range(e) == TRUE) {
out <- str_extract_all(e, portions_reg) %>%
map(str_split, pattern = " to ", simplify = FALSE) %>%
map(str_split, pattern = " - ", simplify = FALSE) %>%
map(str_split, pattern = "-", simplify = FALSE) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(get_portion_means) %>% round(digits = 2)
} else {
out <- 0
}
return(out)
}
Let’s make sure we get the average.
get_ranges("7 to 21 peaches")
## [1] 14
At the end of the day, we want to end up with a single number describing how much of our recipe item we want. So, let’s put all that together into one function. Either range_portion
or mult_add_portion
will always be 0, so we add them together to get our final portion size. If we neither need to get a range nor multiply or add numbers, we’ll just take whatever the first number is in there.
get_portion_values <- function(df, only_mult_after_paren = FALSE, round_digits = 2) {
df <- df %>%
mutate(
range_portion = map_dbl(ingredients, get_ranges),
mult_add_portion = map_dbl(ingredients, get_mult_add_portion, only_mult_after_paren = only_mult_after_paren),
portion = ifelse(range_portion == 0 & mult_add_portion == 0,
str_extract_all(ingredients, portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(first),
range_portion + mult_add_portion) # Otherwise, take either the range or the multiplied value
) %>%
map_at(round, digits = round_digits,
.at = c("range_portion", "mult_add_portion", "portion")) %>%
as_tibble()
return(df)
}
Let’s see what that looks like in practice.
some_recipes_tester %>% get_portion_values() %>% kable(format = "html")
ingredients | range_portion | mult_add_portion | portion |
---|---|---|---|
1.2 ounces or maybe pounds of something with a decimal | 0.00 | 0.0 | 1.20 |
3 (14 ounce) cans o’ beef broth | 0.00 | 42.0 | 42.00 |
around 4 or 5 eels | 4.50 | 0.0 | 4.50 |
5-6 cans spam | 5.50 | 0.0 | 5.50 |
11 - 46 tbsp of sugar | 28.50 | 0.0 | 28.50 |
1/3 to 1/2 of a ham | 0.42 | 0.0 | 0.42 |
5 1/2 pounds of apples | 0.00 | 5.5 | 5.50 |
4g cinnamon | 0.00 | 0.0 | 4.00 |
about 17 fluid ounces of wine | 0.00 | 0.0 | 17.00 |
4-5 cans of 1/2 caf coffee | 4.50 | 0.0 | 4.50 |
3 7oz figs with 1/3 rind | 0.00 | 21.0 | 21.00 |
Looks pretty solid.
Extracting Measurement Units
Now onto easier waters: portion names. You can check out /scripts/scrape/import/get_measurement_types.R
if you’re interested in the steps I took to find some usual portion names and create an abbreviation dictionary, abbrev_dict
. What we also do there is create measures_collapsed
which is a single vector of all portion names separated by “|” so we can find all the portion names that might occur in a given item.
measures_collapsed
## [1] "[[:digit:]]oz |[[:digit:]]pt |[[:digit:]]lb |[[:digit:]]kg |[[:digit:]]g |[[:digit:]]l |[[:digit:]]dl |[[:digit:]]ml |[[:digit:]]tbsp |[[:digit:]]tsp |[[:digit:]]fluid oz |[[:digit:]]gal |[[:digit:]]qt |[[:digit:]]cup |[[:digit:]] oz |[[:digit:]] pt |[[:digit:]] lb |[[:digit:]] kg |[[:digit:]] g |[[:digit:]] l |[[:digit:]] dl |[[:digit:]] ml |[[:digit:]] tbsp |[[:digit:]] tsp |[[:digit:]] fluid oz |[[:digit:]] gal |[[:digit:]] qt |[[:digit:]] cup |ounce|pint|pound|kilogram|gram|liter|deciliter|milliliter|tablespoon|teaspoon|fluid ounce|gallon|quart|cup"
Then if there are multiple portions that match, we’ll take the last one.
We’ll also add approximate
to our dataframe which is just a boolean value indicating whether this item is exact or approximate.
approximate_raw <- c("about", "around", "as desired", "as needed", "optional", "or so", "to taste")
approximate <- approximate_raw %>%
str_c(collapse = "|")
If the item contains one of approximate
(about ,around ,as desired ,as needed ,optional ,or so ,to taste) then we give it a TRUE.
str_detect("8 or so cloves of garlic", approximate)
## [1] TRUE
str_detect("8 cloves of garlic", approximate)
## [1] FALSE
We’ll take all of our NAs to empty strings with this helper.
nix_nas <- function(x) {
if (length(x) == 0) {
x <- ""
}
x
}
get_portion_text <- function(df) {
df <- df %>%
mutate(
raw_portion_num = str_extract_all(ingredients, portions_reg, simplify = FALSE) %>% # Extract the raw portion numbers,
map_chr(str_c, collapse = ", ", default = ""), # separating by comma if multiple
portion_name = str_extract_all(ingredients, measures_collapsed) %>%
map(nix_nas) %>%
str_extract_all("[a-z]+") %>%
map(nix_nas) %>% # Get rid of numbers
map_chr(last), # If there are multiple arguments that match, grab the last one (rather than solution below of comma-separating them)
# map_chr(str_c, collapse = ", ", default = ""), # If there are multiple arguments that match, separate them with a ,
approximate = str_detect(ingredients, approximate)
)
return(df)
}
Last thing for us for now on this subject (though there’s a lot more to do here!) will be to add abbreviations. This will let us standardize things like "ounces"
and "oz"
which actually refer to the same thing.
All add_abbrevs()
will do is let us mutate our dataframe with a new column for the abbreviation of our portion size, if we’ve got a recognized portion size.
add_abbrevs <- function(df) {
out <- vector(length = nrow(df))
for (i in seq_along(out)) {
if (df$portion_name[i] %in% abbrev_dict$name) {
out[i] <- abbrev_dict[which(abbrev_dict$name == df$portion_name[i]), ]$key
} else {
out[i] <- df$portion_name[i]
}
}
out <- df %>% bind_cols(list(portion_abbrev = out) %>% as_tibble())
return(out)
}
tibble(ingredients = "10 pounds salt, or to taste") %>%
get_portion_text() %>% add_abbrevs() %>% kable(format = "html")
ingredients | raw_portion_num | portion_name | approximate | portion_abbrev |
---|---|---|---|---|
10 pounds salt, or to taste | 10 | pound | TRUE | lb |
All together now. Get the portion text and values. If we only want our best guess as to the portion size, that is, final_portion_size
, we’ll chuck range_portion
and mult_add_portion
.
get_portions <- function(df, add_abbrevs = FALSE, pare_portion_info = FALSE) {
df %<>% get_portion_text()
if (add_abbrevs == TRUE) {
df %<>% add_abbrevs()
}
df %<>% get_portion_values()
if (pare_portion_info == TRUE) {
df %<>% select(-range_portion, -mult_add_portion)
}
return(df)
}
some_recipes_tester %>% get_portions(pare_portion_info = TRUE) %>% add_abbrevs() %>% kable(format = "html")
ingredients | raw_portion_num | portion_name | approximate | portion | portion_abbrev |
---|---|---|---|---|---|
1.2 ounces or maybe pounds of something with a decimal | 1.2 | pound | FALSE | 1.20 | lb |
3 (14 ounce) cans o’ beef broth | 3, 14 | ounce | FALSE | 42.00 | oz |
around 4 or 5 eels | 4, 5 | TRUE | 4.50 | ||
5-6 cans spam | 5, 6 | FALSE | 5.50 | ||
11 - 46 tbsp of sugar | 11, 46 | tbsp | FALSE | 28.50 | tbsp |
1/3 to 1/2 of a ham | 1/3, 1/2 | FALSE | 0.42 | ||
5 1/2 pounds of apples | 5, 1/2 | pound | FALSE | 5.50 | lb |
4g cinnamon | 4 | g | FALSE | 4.00 | g |
about 17 fluid ounces of wine | 17 | ounce | TRUE | 17.00 | oz |
4-5 cans of 1/2 caf coffee | 4, 5, 1/2 | FALSE | 4.50 | ||
3 7oz figs with 1/3 rind | 3, 7, 1/3 | oz | FALSE | 21.00 | oz |
We’ve got some units! Next step will be to convert all units into grams, so that we have them all in a standardized format.