Chopping ingredients

Wrestling with nested and untidy string data

Martine Wauben https://github.com/MHWauben
2020-08-30

Tidy Tuesday: 18 August 2020

This week’s data can be found here. It contains data on episodes of Chopped, a TV show challenging chefs with cooking particular ingredients.


chopped <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-25/chopped.tsv')

In this post, I will do some data wrangling: unnesting columns that are really multiple rows in one, and then cleaning strings using some basic heuristics so that I can group them more efficiently.

Unnesting data

Say I am interested in analysing the sorts of ingredients that participants are given. Unfortunately, in this datasets the episode ingredients are listed episode-by-episode, rather than having a row per ingredient. In “tidyverse language”, this can be described as nested data: a single column really contains several rows’ worth of data.

I iterate over appetizer, entree, and dessert ingredients to create a ‘long’ version of this data, with a row per ingredient as I want it.


unnest_course <- function(course_var){
  course <- dplyr::enquo(course_var)
  chopped %>%
    dplyr::select(season, series_episode, episode_rating, ingredient = !!course) %>%
    dplyr::mutate(ingredient := strsplit(tolower(ingredient), ", ")) %>%
    tidyr::unnest() %>%
    dplyr::mutate(course = course_var)
}

ingredients <- purrr::map_dfr(c('appetizer', 'entree', 'dessert'), unnest_course)

Unique ingredients

Now, part of the charm of Chopped is that the ingredients are often unique. Therefore, to keep the show fresh for regular watchers, most ingredients appear just once. Prepare yourself for the most meta histogram ever!


ingredients %>%
  dplyr::group_by(ingredient) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::group_by(count) %>%
  dplyr::summarise(count_count = n()) %>%
  ggplot(aes(x = reorder(count, count_count), y = count_count, fill = -count_count))+
  geom_histogram(stat = 'identity')+
  coord_flip()+
  labs(title = 'How often do ingredients tend to appear?',
       x = 'Number of times an ingredient appears',
       y = 'Number of times of an ingredient appears this number of times')+
  theme_minimal()+
  theme(legend.position = "none")

Hence, ideally we’d group ingredients together. For example, there are plenty of rum-related ingredients, but ‘rum’ as an ingredient only appears once!


ingredients %>%
  dplyr::filter(grepl('(^| )rum( |$)', ingredient)) %>%
  dplyr::arrange(ingredient)

# A tibble: 11 x 5
   season series_episode episode_rating ingredient           course   
    <dbl>          <dbl>          <dbl> <chr>                <chr>    
 1     10            117           NA   coconut rum          dessert  
 2     27            333            8.3 rum                  dessert  
 3     27            329            7.7 rum & cola           dessert  
 4     31            375            7.9 rum butter ice cream dessert  
 5     29            361            8.4 rum raisin chèvre    dessert  
 6     25            308            8.9 rum raisin ice cream appetizer
 7     38            476           NA   rum runners          dessert  
 8     11            133            8.5 spiced rum           dessert  
 9     14            162            8.8 spiced rum           dessert  
10     30            367            7.4 spiced rum           dessert  
11     35            435            8.8 spiced rum           dessert  

Can we group these strings in such a way that we can reasonably certainly collapse some together? We will try a few ways, and decide on the best option through a combination of these heuristics.

Heuristic one: single-word ingredients

For a lot of ingredients, there will be a one-word version, and then a multi-word ‘cheffy’ version. Take octopus, for example:


ingredients %>%
  dplyr::filter(grepl('(^| )octopus( |$)', ingredient)) %>%
  dplyr::arrange(ingredient)

# A tibble: 11 x 5
   season series_episode episode_rating ingredient        course   
    <dbl>          <dbl>          <dbl> <chr>             <chr>    
 1      1              1            9.2 baby octopus      appetizer
 2      8             91            8.5 baby octopus      appetizer
 3     35            434            8.7 baby octopus      appetizer
 4     36            456            8.6 baby octopus      appetizer
 5     31            387            8.1 baby octopus      entree   
 6      5             62            8.6 octopus           appetizer
 7     11            123            8.6 octopus           appetizer
 8     21            256            8.7 octopus           appetizer
 9     10            117           NA   octopus           entree   
10     39            492           NA   octopus balls     entree   
11     45            559           NA   octopus corn dogs entree   

Baby octopus and octopus balls are still octopus… So I’d like to count them as one. To start, I create a vector of single-word ingredients that appear in this dataset.


single_words <- ingredients %>%
  dplyr::filter(grepl("^[a-z'-]*$", ingredient)) %>%
  dplyr::select(ingredient) %>%
  dplyr::filter(nchar(ingredient) > 3) %>% # remove odd ones like 'red'
  unique(.) %>%
  pull()

Then, for each ingredient in the complete list, I see if there’s a string match. If there is, I extract the string match into a new column.


ingredients$single_ingredients <- NA
for(word in single_words){
  ingredients <- ingredients %>%
    dplyr::mutate(single_ingredients = ifelse(!is.na(single_ingredients), 
                                              single_ingredients,
                                              stringr::str_match(ingredient, paste0('(^| )', word, '($| )'))))
}
ingredients$single_ingredients <- gsub('(^ )|( $)', '', ingredients$single_ingredients)

ingredients %>%
  dplyr::group_by(single_ingredients) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::arrange(-count)

# A tibble: 580 x 2
   single_ingredients count
   <chr>              <int>
 1 <NA>                3434
 2 duck                  50
 3 potatoes              48
 4 coconut               47
 5 bacon                 44
 6 mushrooms             43
 7 beer                  42
 8 turkey                41
 9 ginger                35
10 watermelon            35
# ... with 570 more rows

Given that our full ingredients list was 6788 rows long, this has allowed us to group about half of the ingredients already!

Heuristic two: bigrams

Could we do this trick again, with two-word matches? Two-word phrases are known as ‘bigrams’; tidytext, usually designed for natural language processing, will let us create these bigrams easily. It also has in-built stopword lists, which we can use to remove bigrams that include things like ‘of’ or ‘and’.


bigrams <- ingredients %>%
  dplyr::filter(is.na(single_ingredients)) %>%
  dplyr::select(ingredient) %>%
  dplyr::mutate(ingredient = stringr::str_extract_all(ingredient, pattern = '[a-z]+ [a-z]+')) %>%
  tidyr::unnest() %>%
  tidyr::separate(ingredient, c("word1", "word2"), sep = " ") %>%
  dplyr::filter(!word1 %in% tidytext::stop_words$word) %>%
  dplyr::filter(!word2 %in% tidytext::stop_words$word)%>%
  tidyr::unite(bigram, word1, word2, sep = ' ') %>%
  unique(.) %>%
  pull(bigram) 

Now we use the same loop to go through the bigram ingrediens.


ingredients$bigram_ingredients <- NA
for(bigram in bigrams){
  ingredients <- ingredients %>%
    dplyr::mutate(bigram_ingredients = ifelse(is.na(bigram_ingredients), # If we already have a bigram ingredient, keep it
                                              stringr::str_match(ingredient, as.character(bigram)),
                                              bigram_ingredients))
}

ingredients %>%
  dplyr::group_by(bigram_ingredients) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::arrange(-count)

# A tibble: 1,788 x 2
   bigram_ingredients count
   <chr>              <int>
 1 <NA>                3402
 2 ice cream             55
 3 sweet potato          27
 4 peanut butter         26
 5 white chocolate       19
 6 cream cheese          18
 7 brussels sprouts      17
 8 bok choy              15
 9 potato chips          15
10 blood orange          14
# ... with 1,778 more rows

This covers roughly the same number of rows as the single ingredients! But possibly not the same ones.

Combine heuristics

Since the single-word heuristic will group more ingredients into one, that one takes precedence. Wherever that one didn’t match, we can fill in the bigram instead.


ingr_grouped <- ingredients %>%
  dplyr::mutate(ingr_group = ifelse(!is.na(single_ingredients), single_ingredients, bigram_ingredients))

ingr_grouped %>%
  dplyr::group_by(ingr_group) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::arrange(-count)

# A tibble: 2,359 x 2
   ingr_group count
   <chr>      <int>
 1 <NA>         207
 2 duck          50
 3 potatoes      48
 4 coconut       47
 5 bacon         44
 6 ice cream     43
 7 mushrooms     43
 8 beer          42
 9 turkey        41
10 ginger        35
# ... with 2,349 more rows

This has helped us label nearly every row! Judging from the ingredient group values we get back, many groups were single-word only; however, “ice cream” apparently shows up lots on this show!

However, the real question is: how many groups are now bigger than they were before?


ingr_grouped %>%
  dplyr::filter(!is.na(ingr_group)) %>%
  dplyr::group_by(ingr_group) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::group_by(count) %>%
  dplyr::summarise(count_count = n()) %>%
  ggplot(aes(x = count, y = count_count, fill = -count_count))+
  geom_histogram(stat = 'identity')+
  scale_x_reverse()+
  coord_flip()+
  labs(title = 'After grouping: how often do ingredients tend to appear?',
       subtitle = 'Did our heuristics for grouping "cheffy" ingredients work?',
       x = 'Number of times an ingredient appears',
       y = 'Number of times of an ingredient appears this number of times')+
  theme_minimal()+
  theme(legend.position = "none")

Although there is still a significant number of items appearing just once, the tail is longer: there are now ingredient groups that appear 20+ times, which was not the case before!