# Scraping Together a Recipe, Episode II

- 2018/25/02
- 12 min read

One of the goals here is to see what portion of a menu tends to be devoted to, say, meat or spices or a word that appears in the receipe name etc. In order to answer that, we’ll need to extract portion names and portion sizes from the text. That wouldn’t be pretty simple with a fixed list of portion names (“gram”, “lb”) if portion sizes were always just a single number.

But, as it happens, protion sizes don’t usually consist of just one number. There are a few hurdles:

- Complex fractions

`2 1/3 cups`

of flour should become:`2.3333`

cups of flour

- Multiple items of the same item

`4 (12oz)`

bottles of beer should become:`48`

oz of beer

- Ranges

`6-7`

tomatoes should become:`6.5`

tomatoes

Here is a fake recipe to illustrate some of those cases. It’s an artifically “hard” recipe to straighten out because I’ve built it up line by line as I needed to test functions against fake data.

```
some_recipes_tester <- list(ingredients = vector()) %>% as_tibble()
some_recipes_tester[1, ] <- "1.2 ounces or maybe pounds of something with a decimal"
some_recipes_tester[2, ] <- "3 (14 ounce) cans o' beef broth"
some_recipes_tester[3, ] <- "around 4 or 5 eels"
some_recipes_tester[4, ] <- "5-6 cans spam"
some_recipes_tester[5, ] <- "11 - 46 tbsp of sugar"
some_recipes_tester[6, ] <- "1/3 to 1/2 of a ham"
some_recipes_tester[7, ] <- "5 1/2 pounds of apples"
some_recipes_tester[8, ] <- "4g cinnamon"
some_recipes_tester[9, ] <- "about 17 fluid ounces of wine"
some_recipes_tester[10, ] <- "4-5 cans of 1/2 caf coffee"
some_recipes_tester[11, ] <- "3 7oz figs with 1/3 rind"
```

`some_recipes_tester %>% kable(format = "html")`

ingredients |
---|

1.2 ounces or maybe pounds of something with a decimal |

3 (14 ounce) cans o’ beef broth |

around 4 or 5 eels |

5-6 cans spam |

11 - 46 tbsp of sugar |

1/3 to 1/2 of a ham |

5 1/2 pounds of apples |

4g cinnamon |

about 17 fluid ounces of wine |

4-5 cans of 1/2 caf coffee |

3 7oz figs with 1/3 rind |

Rather than start doing something conditional random field-level smart to get around these problems, to start off I started writing a few rules of thumb.

We’ll worry first about how we find and extract numbers and next about how we’ll add, multiply, or average them as necessary.

**Extracting Numbers**

We’ll need a few regexes to extract our numbers.

`portions_reg`

will match any digit even if it contains a decimal or a slash in it, which will be important for capturing complex fractions.

`multiplier_reg`

covers all cases of numbers that might need to be multiplied in the Allrecipes data, because these are always sepearated by `" ("`

, whereas `multiplier_reg_looser`

is a more loosely-defined case matching numbers separated just by `" "`

.

```
# Match any number, even if it has a decimal or slash in it
portions_reg <- "[[:digit:]]+\\.*[[:digit:]]*+\\/*[[:digit:]]*"
# Match numbers separated by " (" as in "3 (5 ounce) cans of broth" for multiplying
multiplier_reg <- "[[:digit:]]+ \\(+[[:digit:]]"
# Match numbers separated by " "
multiplier_reg_looser <- "[0-9]+\ +[0-9]"
```

Now the `multiplier_reg`

regexes will allow us to detect that we’ve got something that needs to be multiplied, like `"4 (12 oz) hams"`

or a fraction like `"1 2/3 pound of butter"`

. If we do, then we’ll multiply or add those numbers as appropriate. The `only_mult_after_paren`

parameter is something I put in that is specific to Allrecipes. On Allrecipes, it seems that if we do have multiples, they’ll always be of the form “*number_of_things* (*quantity_of_single_thing*)”. There are always parentheses around *quantity_of_single_thing*. If we’re only using Allrecipes data, that gives us some more security that we’re only multiplying quantities that actually should be multiplied. If we want to make this extensible in the future we’d want to set `only_mult_after_paren`

to FALSE to account for cases like “7 4oz cans of broth”.

We use `str_extract()`

to check that our regexes are grabbing the parts of a string that we’ll need to do computation on.

`str_extract_all("3 1/4 lb patties", portions_reg)`

```
## [[1]]
## [1] "3" "1/4"
```

And check that `multiplier_reg`

`str_extract_all("3 (4 pound patties) for grilling", multiplier_reg)`

```
## [[1]]
## [1] "3 (4"
```

We’ll clean that up by passing it to `portions_reg`

to just grab the numbers:

`str_extract_all("3 (4 pound patties) for grilling", multiplier_reg) %>% str_extract_all(portions_reg)`

```
## [[1]]
## [1] "3" "4"
```

Finally, let’s make sure that our stricter multiplier regex doesn’t want to multiply something shouldn’t be multiplied.

`str_extract_all("3 or 4 lb patties", multiplier_reg)`

```
## [[1]]
## character(0)
```

Okay, now to the multiplying and adding.

First, let’s consider complex fractions. Off the bat, we know we’ll need a way to turn a single fraction into a decimal form. We keep them `as.character`

for now and turn them into numeric later down the pipe.

```
frac_to_dec <- function(e) {
if (length(e) == 0) { # If NA because there are no numbers, make the portion 0
out <- 0
} else {
out <- parse(text = e) %>% eval() %>% as.character()
}
return(out)
}
```

`eval()`

, which is what does the work inside `frac_to_dec)`

only only evaluates the last string in a vector, not multiple, so as a workaround I put it into a helper that will turn all fractions into decimal strings:

```
map_frac_to_dec <- function(e) {
out <- NULL
for (i in e) {
out <- e %>% map_chr(frac_to_dec)
}
return(out)
}
```

For example:

`map_frac_to_dec(c("1/2", "1/8", "1/3"))`

`## [1] "0.5" "0.125" "0.333333333333333"`

Cool, so for a given ingredient we’ll need to look for numbers that are occur next to other numbers, and then and add complex fractions and multiply multiples.

If we’ve got two numbers next to each other and the second number evaluates to a decimal less than 1, we’ve got a complex fraction. For example, if we’re extracting digits and turning all fractions among them into decimals if we consider `"4 1/2 loaves of bread"`

we’d end up with `"4"`

and `"0.5"`

. We know `0.5`

is less than `1`

, so we’ve got a complex fraction on our hands. We need to add `4 + 0.5`

to end up with `4.5`

loaves of bread.

It’s true that this function doesn’t address the issue of having both a complex fraction and multiples in a recipe. That would look like `"3 (2 1/4 inch)`

blocks of cheese." I haven’t run into that issue too much but it certainly could use a workaround.

```
multiply_or_add_portions <- function(e) {
if (length(e) == 0) {
e <- 0
} else if (length(e) > 1) {
if (e[2] < 1) { # If our second element is a fraction, we know this is a complex fraction so we add the two
e <- e[1:2] %>% reduce(`+`)
} else { # Otherwise, we multiply them
e <- e[1:2] %>% reduce(`*`)
}
}
return(e)
}
```

`multiply_or_add_portions(c(4, 0.5))`

`## [1] 4.5`

`multiply_or_add_portions(c(4, 5))`

`## [1] 20`

This function will allow us to add a new column to our dataframe called `mult_add_portion`

. If we’ve done any multiplying or adding of numbers, we’ll have a value greater than 0 there, and 0 otherwise.

```
get_mult_add_portion <- function(e, only_mult_after_paren = FALSE) {
if ((str_detect(e, multiplier_reg) == TRUE | str_detect(e, multiplier_reg_looser) == TRUE)
& only_mult_after_paren == FALSE) { # If either matches and we don't care about where there's a parenthesis there or not
if (str_detect(e, multiplier_reg) == TRUE) {
out <- e %>% str_extract_all(portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(multiply_or_add_portions) %>%
round(digits = 2)
} else { # If we do care, and we have a parenthesis
out <- e %>% str_extract_all(portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(multiply_or_add_portions) %>%
round(digits = 2)
}
} else {
out <- 0
}
return(out)
}
```

`get_mult_add_portion("4 1/2 steaks") `

`## [1] 4.5`

`get_mult_add_portion("4 (5 lb melons)") `

`## [1] 20`

**Ranges**

Finally, let’s deal with ranges. If two numbers are separated by an `"or"`

or a `"-"`

like “4-5 teaspoons of sugar” we know that this is a range. We’ll take the average of those two numbers.

We’ll add a new column to our dataframe called `range_portion`

for the result of any range calculations. If we don’t have a range, just like `mult_add_portion`

, we set this value to 0.

```
to_reg <- "([0-9])(( to ))(([0-9]))"
or_reg <- "([0-9])(( or ))(([0-9]))"
dash_reg_1 <- "([0-9])((-))(([0-9]))"
dash_reg_2 <- "([0-9])(( - ))(([0-9]))"
```

First, a couple helpers. If two numbers are separated by an “or” or a “-” we know that this is a range, e.g., 4-5 teaspoons of sugar.

```
determine_if_range <- function(ingredients) {
if (str_detect(ingredients, pattern = to_reg) |
str_detect(ingredients, pattern = or_reg) |
str_detect(ingredients, pattern = dash_reg_1) |
str_detect(ingredients, pattern = dash_reg_2)) {
contains_range <- TRUE
} else {
contains_range <- FALSE
}
return(contains_range)
}
```

And, we’ll want to be able to get the mean of the first two elements in a numeric vector.

```
get_portion_means <- function(e) {
if (length(e) == 0) {
e <- 0 # NA to 0
} else if (length(e) > 1) {
e <- mean(e[1:2])
}
return(e)
}
```

```
get_ranges <- function(e) {
if (determine_if_range(e) == TRUE) {
out <- str_extract_all(e, portions_reg) %>%
map(str_split, pattern = " to ", simplify = FALSE) %>%
map(str_split, pattern = " - ", simplify = FALSE) %>%
map(str_split, pattern = "-", simplify = FALSE) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(get_portion_means) %>% round(digits = 2)
} else {
out <- 0
}
return(out)
}
```

Let’s make sure we get the average.

`get_ranges("7 to 21 peaches")`

`## [1] 14`

At the end of the day, we want to end up with a single number describing how much of our recipe item we want. So, let’s put all that together into one function. Either `range_portion`

or `mult_add_portion`

will always be 0, so we add them together to get our final portion size. If we neither need to get a range nor multiply or add numbers, we’ll just take whatever the first number is in there.

```
get_portion_values <- function(df, only_mult_after_paren = FALSE, round_digits = 2) {
df <- df %>%
mutate(
range_portion = map_dbl(ingredients, get_ranges),
mult_add_portion = map_dbl(ingredients, get_mult_add_portion, only_mult_after_paren = only_mult_after_paren),
portion = ifelse(range_portion == 0 & mult_add_portion == 0,
str_extract_all(ingredients, portions_reg) %>%
map(map_frac_to_dec) %>%
map(as.numeric) %>%
map_dbl(first),
range_portion + mult_add_portion) # Otherwise, take either the range or the multiplied value
) %>%
map_at(round, digits = round_digits,
.at = c("range_portion", "mult_add_portion", "portion")) %>%
as_tibble()
return(df)
}
```

Let’s see what that looks like in practice.

`some_recipes_tester %>% get_portion_values() %>% kable(format = "html")`

ingredients | range_portion | mult_add_portion | portion |
---|---|---|---|

1.2 ounces or maybe pounds of something with a decimal | 0.00 | 0.0 | 1.20 |

3 (14 ounce) cans o’ beef broth | 0.00 | 42.0 | 42.00 |

around 4 or 5 eels | 4.50 | 0.0 | 4.50 |

5-6 cans spam | 5.50 | 0.0 | 5.50 |

11 - 46 tbsp of sugar | 28.50 | 0.0 | 28.50 |

1/3 to 1/2 of a ham | 0.42 | 0.0 | 0.42 |

5 1/2 pounds of apples | 0.00 | 5.5 | 5.50 |

4g cinnamon | 0.00 | 0.0 | 4.00 |

about 17 fluid ounces of wine | 0.00 | 0.0 | 17.00 |

4-5 cans of 1/2 caf coffee | 4.50 | 0.0 | 4.50 |

3 7oz figs with 1/3 rind | 0.00 | 21.0 | 21.00 |

Looks pretty solid.

**Extracting Measurement Units**

Now onto easier waters: portion names. You can check out `/scripts/scrape/import/get_measurement_types.R`

if you’re interested in the steps I took to find some usual portion names and create an abbreviation dictionary, `abbrev_dict`

. What we also do there is create `measures_collapsed`

which is a single vector of all portion names separated by “|” so we can find all the portion names that might occur in a given item.

`measures_collapsed`

`## [1] "[[:digit:]]oz |[[:digit:]]pt |[[:digit:]]lb |[[:digit:]]kg |[[:digit:]]g |[[:digit:]]l |[[:digit:]]dl |[[:digit:]]ml |[[:digit:]]tbsp |[[:digit:]]tsp |[[:digit:]]fluid oz |[[:digit:]]gal |[[:digit:]]qt |[[:digit:]]cup |[[:digit:]] oz |[[:digit:]] pt |[[:digit:]] lb |[[:digit:]] kg |[[:digit:]] g |[[:digit:]] l |[[:digit:]] dl |[[:digit:]] ml |[[:digit:]] tbsp |[[:digit:]] tsp |[[:digit:]] fluid oz |[[:digit:]] gal |[[:digit:]] qt |[[:digit:]] cup |ounce|pint|pound|kilogram|gram|liter|deciliter|milliliter|tablespoon|teaspoon|fluid ounce|gallon|quart|cup"`

Then if there are multiple portions that match, we’ll take the last one.

We’ll also add `approximate`

to our dataframe which is just a boolean value indicating whether this item is exact or approximate.

```
approximate_raw <- c("about", "around", "as desired", "as needed", "optional", "or so", "to taste")
approximate <- approximate_raw %>%
str_c(collapse = "|")
```

If the item contains one of `approximate`

(about ,around ,as desired ,as needed ,optional ,or so ,to taste) then we give it a TRUE.

`str_detect("8 or so cloves of garlic", approximate)`

`## [1] TRUE`

`str_detect("8 cloves of garlic", approximate)`

`## [1] FALSE`

We’ll take all of our NAs to empty strings with this helper.

```
nix_nas <- function(x) {
if (length(x) == 0) {
x <- ""
}
x
}
```

```
get_portion_text <- function(df) {
df <- df %>%
mutate(
raw_portion_num = str_extract_all(ingredients, portions_reg, simplify = FALSE) %>% # Extract the raw portion numbers,
map_chr(str_c, collapse = ", ", default = ""), # separating by comma if multiple
portion_name = str_extract_all(ingredients, measures_collapsed) %>%
map(nix_nas) %>%
str_extract_all("[a-z]+") %>%
map(nix_nas) %>% # Get rid of numbers
map_chr(last), # If there are multiple arguments that match, grab the last one (rather than solution below of comma-separating them)
# map_chr(str_c, collapse = ", ", default = ""), # If there are multiple arguments that match, separate them with a ,
approximate = str_detect(ingredients, approximate)
)
return(df)
}
```

Last thing for us for now on this subject (though there’s a lot more to do here!) will be to add abbreviations. This will let us standardize things like `"ounces"`

and `"oz"`

which actually refer to the same thing.

All `add_abbrevs()`

will do is let us mutate our dataframe with a new column for the abbreviation of our portion size, if we’ve got a recognized portion size.

```
add_abbrevs <- function(df) {
out <- vector(length = nrow(df))
for (i in seq_along(out)) {
if (df$portion_name[i] %in% abbrev_dict$name) {
out[i] <- abbrev_dict[which(abbrev_dict$name == df$portion_name[i]), ]$key
} else {
out[i] <- df$portion_name[i]
}
}
out <- df %>% bind_cols(list(portion_abbrev = out) %>% as_tibble())
return(out)
}
```

```
tibble(ingredients = "10 pounds salt, or to taste") %>%
get_portion_text() %>% add_abbrevs() %>% kable(format = "html")
```

ingredients | raw_portion_num | portion_name | approximate | portion_abbrev |
---|---|---|---|---|

10 pounds salt, or to taste | 10 | pound | TRUE | lb |

All together now. Get the portion text and values. If we only want our best guess as to the portion size, that is, `final_portion_size`

, we’ll chuck `range_portion`

and `mult_add_portion`

.

```
get_portions <- function(df, add_abbrevs = FALSE, pare_portion_info = FALSE) {
df %<>% get_portion_text()
if (add_abbrevs == TRUE) {
df %<>% add_abbrevs()
}
df %<>% get_portion_values()
if (pare_portion_info == TRUE) {
df %<>% select(-range_portion, -mult_add_portion)
}
return(df)
}
```

`some_recipes_tester %>% get_portions(pare_portion_info = TRUE) %>% add_abbrevs() %>% kable(format = "html")`

ingredients | raw_portion_num | portion_name | approximate | portion | portion_abbrev |
---|---|---|---|---|---|

1.2 ounces or maybe pounds of something with a decimal | 1.2 | pound | FALSE | 1.20 | lb |

3 (14 ounce) cans o’ beef broth | 3, 14 | ounce | FALSE | 42.00 | oz |

around 4 or 5 eels | 4, 5 | TRUE | 4.50 | ||

5-6 cans spam | 5, 6 | FALSE | 5.50 | ||

11 - 46 tbsp of sugar | 11, 46 | tbsp | FALSE | 28.50 | tbsp |

1/3 to 1/2 of a ham | 1/3, 1/2 | FALSE | 0.42 | ||

5 1/2 pounds of apples | 5, 1/2 | pound | FALSE | 5.50 | lb |

4g cinnamon | 4 | g | FALSE | 4.00 | g |

about 17 fluid ounces of wine | 17 | ounce | TRUE | 17.00 | oz |

4-5 cans of 1/2 caf coffee | 4, 5, 1/2 | FALSE | 4.50 | ||

3 7oz figs with 1/3 rind | 3, 7, 1/3 | oz | FALSE | 21.00 | oz |

We’ve got some units! Next step will be to convert all units into grams, so that we have them all in a standardized format.