Rstats and Recipes: Astronauts

Tidy Tuesday: 14 July 2020

This week’s data can be found here.


astronauts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-14/astronauts.csv') %>%
  dplyr::arrange(year_of_mission, mission_title)
head(astronauts)


# A tibble: 6 x 24
     id number nationwide_numb~ name  original_name sex  
  <dbl>  <dbl>            <dbl> <chr> <chr>         <chr>
1     1      1                1 Gaga~ <U+0413><U+0410><U+0413><U+0410><U+0420><U+0418><U+041D> <U+042E><U+0440><U+0438><U+0439>~ male 
2     2      2                2 Tito~ <U+0422><U+0418><U+0422><U+041E><U+0412> <U+0413><U+0435><U+0440><U+043C><U+0430><U+043D>~ male 
3     3      3                1 Glen~ Glenn, John ~ male 
4     5      4                2 Carp~ Carpenter, M~ male 
5    10      7                3 Schi~ Schirra, Wal~ male 
6     6      5                2 Niko~ <U+041D><U+0418><U+041A><U+041E><U+041B><U+0410><U+0415><U+0412> <U+0410><U+043D><U+0434>~ male 
# ... with 18 more variables: year_of_birth <dbl>, nationality <chr>,
#   military_civilian <chr>, selection <chr>,
#   year_of_selection <dbl>, mission_number <dbl>,
#   total_number_of_missions <dbl>, occupation <chr>,
#   year_of_mission <dbl>, mission_title <chr>, ascend_shuttle <chr>,
#   in_orbit <chr>, descend_shuttle <chr>, hours_mission <dbl>,
#   total_hrs_sum <dbl>, field21 <dbl>, eva_hrs_mission <dbl>,
#   total_eva_hrs <dbl>

I want to make a graph showing each flight as an arch from the start year to the end year. This can be done with the ggraph package.

Unfortunately, the data only contains the year of ascent. For the descending flight, it only contains the shuttle name. To learn the earliest time each shuttle could have descended, we save the year of mission for each shuttle. This is called the ‘node list’: shuttle names and years.


node_list <- unique(dplyr::select(astronauts, year_of_mission, ascend_shuttle)) %>%
  dplyr::filter(!is.na(ascend_shuttle)) %>%
  dplyr::group_by(ascend_shuttle) %>%
  dplyr::summarise(year_of_mission = min(year_of_mission)) %>%
  dplyr::arrange(year_of_mission, ascend_shuttle) %>%
  dplyr::transmute(id = dplyr::row_number(),
                   year_of_mission, ascend_shuttle)
head(node_list)


# A tibble: 6 x 3
     id year_of_mission ascend_shuttle 
  <int>           <dbl> <chr>          
1     1            1961 Vostok 1       
2     2            1961 Vostok 2       
3     3            1962 MA-6           
4     4            1962 Mercury-Atlas 7
5     5            1962 Mercury-Atlas 8
6     6            1962 Vostok 3

With the node list prepared, we can create a row for each space flight we’d like to show. These are called the ‘edges’. I also remove some rows:

Some astronauts return on shuttles whose ascent year was not represented in the node list.
Some flights were aborted or ended in explosions.

These rows do not have IDs in the node lists, and so I remove edges with no ‘to’ ID (which would have been provided by the node list).


edge_list <- dplyr::left_join(astronauts, node_list, by  = c("ascend_shuttle")) %>%
  dplyr::left_join(., node_list, by = c("descend_shuttle" = "ascend_shuttle")) %>%
  dplyr::transmute(from = id.y,
                   to = id,
                   from_year = year_of_mission.y,
                   to_year = year_of_mission,
                   ascend_shuttle,
                   descend_shuttle,
                   asc_year = year_of_mission.x,
                   desc_year = year_of_mission.y) %>%
  dplyr::group_by_all() %>%
  dplyr::summarise(num_astronauts = dplyr::n()) %>%
  dplyr::ungroup() %>%
  dplyr::filter(!is.na(to))
head(edge_list)


# A tibble: 6 x 9
   from    to from_year to_year ascend_shuttle descend_shuttle
  <int> <int>     <dbl>   <dbl> <chr>          <chr>          
1     1     6      1961    1962 Vostok 1       Vostok 3       
2     2     2      1961    1961 Vostok 2       Vostok 2       
3     3     3      1962    1962 MA-6           MA-6           
4     4     4      1962    1962 Mercury-Atlas~ Mercury-Atlas 7
5     5     5      1962    1962 Mercury-Atlas~ Mercury-Atlas 8
6     6     6      1962    1962 Vostok 3       Vostok 3       
# ... with 3 more variables: asc_year <dbl>, desc_year <dbl>,
#   num_astronauts <int>

To visualise the graph, we have to convert the data into a network visualisation table (tbl_graph), and then activate the ‘edges’ dataframe (so we can make use of the columns in the edge_list dataframe). This is a directed graph, because astronauts always travel into the future!


graph_data <- tidygraph::tbl_graph(nodes = node_list, edges = edge_list, directed = TRUE) %>%
  tidygraph::activate(edges)

The ggraph package allows you to visualise a tbl_graph using the familiar ggplot syntax. A linear layout results in all nodes being set on a line (in our case: a timeline).

To make all the arcs point in a single direction, you use the fold = TRUE argument in geom_edge_arc. I also increase the line width of the arc according to the number of astronauts that went on that particular flight.

The height of the arc automatically reflects the duration of the flight.


graph_data %>%
  ggraph(layout = "linear") +
  geom_edge_point(aes(colour = asc_year), size = 0.5)+
  geom_edge_arc(aes(width = num_astronauts),
                alpha = 0.8,
                fold = TRUE)

We can then use ggplot and ggraph functions to change the visual appearance of the chart, to make it more appealing and easier to read.


graph_data %>%
  ggraph(layout = "linear") +
  geom_edge_point(aes(colour = asc_year), size = 0.5)+
  geom_edge_arc(aes(width = num_astronauts),
                alpha = 0.8,
                fold = TRUE)+
  scale_edge_width(range = c(0.1,1.5))+
  theme_void()+ 
  labs(title = 'Ascent and return of astronauts',
       subtitle = 'Colour denotes year (lighter is more recent); line width denotes number of astronauts on journey')+
  theme(legend.position="none")