US Spending on kids

Using timeseries clustering to detect groups of government spending without domain knowledge

Martine Wauben https://github.com/MHWauben
2020-09-20

Tidy Tuesday: 20 September 2020

This week’s data can be found here. This data contains US Govt spending data, on various types of expenditure related to children.


kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')

This dataset is in ‘long’ format: in other words, there are not separate columns for different types of expenditure. Instead, the ‘variable’ column specifies the type of expenditure. Note: the tidykids package comes with a data dictionary to find out what each of the variables stand for.

To help with visualisation, therefore, we will randomly sample some variables and states to visualise. After doing this, this kind of data structure is helpful for ggplot:


vars_to_vis <- sample(unique(kids$variable),5)
states_to_vis <- sample(unique(kids$state), 5)
kids %>%
  tidyr::pivot_longer(cols = raw:inf_adj_perchild,
                      names_to = 'value_type',
                      values_to = 'value') %>%
  dplyr::filter(variable %in% vars_to_vis & state %in% states_to_vis) %>%
  ggplot(aes(x = year, y = value, colour = variable))+
  geom_line()+
  facet_wrap(state~value_type, scales = "free_y", ncol = 3)+
  theme_minimal()+
  theme(legend.position = "bottom")

Now, quite clearly some of the timeseries are more similar than others. These timeseries are likely to correlated in several ways:

Can we cluster these timeseries together, to find out which timeseries are in similar domains?

Timeseries clustering

For clustering, we need to transform the data into a matrix where each column is a timepoint (ie. a year in this case). This means that each row ID is a state-variable combination.

Moreover, the original data has three versions of each value: the raw value, the inflation-adjusted value, and the inflation-adjusted value per child in that state in that year. We should not mix these up into a single matrix, so three matrices will be required.

So, to create the numeric matrices we’ll need for clustering, we need to do some data prep.

Crucially, we are not interested in clustering together the big states or the big sources of expenditure; rather, we’re interested in the shape of the timeseries, ie. correlated peaks and troughs. Hence we need to index all the timeseries based on their first available value.


# Transform data to column-per-year
ts_wide <- kids %>%
  # Remove 0 rows to avoid infinities
  dplyr::filter(raw > 0) %>%
  dplyr::group_by(state, variable) %>%
  # For each state and variable, index by the first non-zero value
  dplyr::mutate(raw = raw / dplyr::first(raw),
                inf_adj = inf_adj / dplyr::first(inf_adj),
                inf_adj_perchild = inf_adj_perchild / dplyr::first(inf_adj_perchild)) %>%
  dplyr::ungroup() %>%
  # Create single ID column
  dplyr::mutate(ts_id = paste(state, variable, sep = '_')) %>%
  # Make sure the years will be pivoted in the right order
  dplyr::arrange(year) %>%
  tidyr::pivot_wider(id_cols = ts_id,
                     names_from = year,
                     values_from = raw:inf_adj_perchild,
                     # Fill empty values (with 0 budget) with 0
                     values_fill = 0) %>%
  dplyr::arrange(ts_id)

# Save the rownames and available years
ts_ids <- ts_wide$ts_id
ts_years <- stringr::str_extract(colnames(ts_wide), '[0-9]+')

# Separate dataframes for each metric
raw <- dplyr::select(ts_wide, ts_id, tidyselect::starts_with('raw'))
inf_adj <- dplyr::select(ts_wide, ts_id, tidyselect::matches('inf_adj_[0-9]'))
inf_adj_perchild <- dplyr::select(ts_wide, ts_id, tidyselect::matches('perchild'))

# Create timeseries matrices for each metric
ts_matrices <- list(raw = raw[,2:ncol(raw)],
                    inf_adj = inf_adj[,2:ncol(inf_adj)],
                    inf_adj_perchild = inf_adj_perchild[,2:ncol(inf_adj_perchild)])
ts_matrices <- purrr::map(ts_matrices, as.matrix)

Now that we have the timeseries prepared, we can start clustering!

We will use the K-Medoids (aka PAM) algorithm to do so, because then we can visualise the selected timeseries as actual datapoints rather than imputed ones.

This algorithm has one crucial drawback: it cannot decide, by itself, how many clusters to look for. Therefore, we need to do some hyperparameter tuning. We give a range of acceptable numbers of clusters, and choose the final number of clusters on the basis of the Davies-Bouldin score of the resulting clusters.


cluster_select <- function(data, try_clusters){
  clusterings <- lapply(try_clusters, function(x) cluster::pam(data, x))
  db_values <- sapply(seq_along(clusterings), function(x) 
    clusterCrit::intCriteria(data, as.integer(clusterings[[x]]$clustering),
                c("Davies_Bouldin")))
  plot <- ggplot(data.frame(Clusters = try_clusters, db = unlist(db_values)),
         aes(Clusters, db)) +
    geom_line(size = 1) +
    geom_point(size = 3) +
    theme_bw()
  
  num_clusters_index <- which.min(unlist(db_values))
  num_clusters <- try_clusters[num_clusters_index]
  
  return(list(plot = plot,
              clusterings = clusterings,
              num_clusters_index = num_clusters_index,
              num_clusters = num_clusters))
}

Now that we have a function to select the number of clusters, let’s run it on each of the actual data metrics!


try_clusters <- seq(8, 16, 2)

ts_clusters <- purrr::map(ts_matrices, cluster_select, try_clusters)

For the raw values, we selected 10 clusters; for the inflation-adjusted ones, we chose 10, and for the inflation-adjusted figures per child, we chose 10.

However, to properly understand what these clusters mean, we need to plot them. First, we have to assign the cluster assignments back onto the data.


prepare_data <- function(metric){
  data <- get(metric)
  raw_data <- ts_matrices[[metric]]
  clusterings <- ts_clusters[[metric]]$clusterings
  num_clusters_index <- ts_clusters[[metric]]$num_clusters_index
  data_plot_t <- data.frame(class = as.factor(clusterings[[num_clusters_index]]$clustering),
                            data) %>%
    tidyr::pivot_longer(cols = starts_with(metric),
                        names_to = 'variable',
                        values_to = 'value') %>%
    dplyr::mutate(measure = gsub('[a-zA-Z ]*_', '', ts_id),
                  Time = as.integer(gsub(paste0(metric, '_'), '', variable)),
                  ID = rep(1:nrow(raw_data), ncol(raw_data)))
  return(data_plot_t)
}

metrics <- c('raw', 'inf_adj', 'inf_adj_perchild')
data_plot <- purrr::map(metrics, prepare_data) %>%
  setNames(metrics)

With the data prepared, we have to also save the medoids of each cluster separately, so we can clearly plot what trend the cluster predicts.


prepare_centers <- function(metric){
  clusterings <- ts_clusters[[metric]]$clusterings
  num_clusters_index <- ts_clusters[[metric]]$num_clusters_index
  num_clusters <- ts_clusters[[metric]]$num_clusters
  centers_t <- data.frame(clusterings[[num_clusters_index]]$medoids,
                          class = 1:num_clusters) %>%
    tidyr::pivot_longer(cols = starts_with(metric),
                        names_to = 'variable',
                        values_to = 'value') %>%
    dplyr::arrange(variable, class) %>%
    dplyr::mutate(Time = as.integer(gsub(paste0(metric, '_'), '', variable)),
                  ID = value)
  return(centers_t)
}

centers <- purrr::map(metrics, prepare_centers) %>%
  setNames(metrics)

Now that we have both the data and the centers saved, we can plot them on top of each other. We create lines for each state’s timeseries for each type of expenditure, and colour those lines by the type of expenditure. After, we top the cluster medoids on top so we can see what the overarching pattern is that the PAM algorithm picked up on.


for(i in metrics){
  print(
    ggplot(data_plot[[i]], aes(Time, value, group = ts_id)) +
      facet_wrap(~class, ncol = 2, scales = "free_y") +
      geom_line(aes(colour = measure), alpha = 0.65)+
      scale_color_manual(values = pals::cols25(n = 25)) +
      geom_line(data = centers[[i]], aes(Time, value, group= NULL), 
                color = "firebrick1", alpha = 0.8, size = 1.2) +
      theme_bw()+
      labs(title = 'Different types of child-related US expenditure',
           subtitle = paste0('Values are ', i, '. Heavy red line indicates cluster medoids.'),
           x = 'Year',
           y = 'Value')+
      theme(legend.text = element_text(size = 6))
  )
}

Clearly, the trends are very similar for raw and inflation-adjusted figures. However, transforming the figure to be per child results in slightly different groupings.

However, the powerful thing here is that some clusters are just one colour - for example unemployment has a very characteristic shape, with a spike around the 2008-2010 recession - and some clusters have grouped together different colours. These types of spending, therefore, are correlated: their spending goes up and down in similar ways. Hence, without knowing anything about the content of these spending types, we are able to group them by domain or political leaning.

To see more clearly, we can count how many timeseries from each measure appear in each cluster. Some clusters are more ‘pure’, containing only one or two types of spending; other clusters are more mixed, and may start to reflect how different states made different decisions. This table is based on the raw (not inflation-adjusted) data.


knitr::kable(data_plot[[1]] %>%
    dplyr::group_by(measure, class) %>%
    dplyr::summarise(number = dplyr::n()) %>%
    tidyr::pivot_wider(id_cols = measure,
                       names_from = 'class',
                       values_from = 'number',
                       values_fill = 0))
measure 1 2 4 5 6 3 7 9 8 10
addCC 1020 0 0 0 0 0 0 0 0 0
CHIP 0 80 460 80 400 0 0 0 0 0
CTC 0 620 0 0 40 360 0 0 0 0
edservs 0 240 100 20 140 400 100 0 0 0
edsubs 0 140 340 140 300 40 0 40 0 0
fedEITC 0 540 0 0 140 340 0 0 0 0
fedSSI 0 460 40 0 60 460 0 0 0 0
HCD 0 300 180 100 300 140 0 0 0 0
HeadStartPriv 0 580 0 0 40 380 20 0 0 0
health 0 300 80 20 540 80 0 0 0 0
highered 0 440 20 0 220 340 0 0 0 0
lib 0 520 0 0 80 400 20 0 0 0
othercashserv 0 380 20 0 240 380 0 0 0 0
parkrec 0 600 0 0 80 340 0 0 0 0
pell 0 40 220 560 200 0 0 0 0 0
PK12ed 0 660 20 0 40 300 0 0 0 0
pubhealth 0 280 80 20 220 380 40 0 0 0
SNAP 0 100 240 60 540 80 0 0 0 0
socsec 0 80 0 0 0 940 0 0 0 0
stateEITC 180 20 120 80 40 60 20 20 0 0
TANFbasic 0 0 0 0 0 0 1020 0 0 0
unemp 0 0 0 0 20 120 20 0 860 0
wcomp 40 140 0 40 20 280 340 0 0 20

To check this fully automated work, you can now go back to the tidykids codebook and see if we agree with the domain groupings the algorithm has made!