Detecting outliers in plant threat profiles with isolation forests
This week’s data can be found here. It is descriptions of different categories of threat types and actions taken for endangered plants. I explicitly re-named the column names to make them clearer.
plants <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/plants.csv') %>%
dplyr::rename('Threat: Agriculture & Aquaculture' = threat_AA,
'Threat: Biological Resource Use' = threat_BRU,
'Threat: Commercial Development' = threat_RCD,
'Threat: Invasive Species' = threat_ISGD,
'Threat: Energy Production & Mining' = threat_EPM,
'Threat: Climate Change' = threat_CC,
'Threat: Human Intrusions' = threat_HID,
'Threat: Pollution' = threat_P,
'Threat: Transportation Corridor' = threat_TS,
'Threat: Natural Systems Modifications' = threat_NSM,
'Threat: Geological Events' = threat_GE,
'Threat unknown' = threat_NA)
For this blog post, I will use isolation forests to detect outliers in this data.
I would expect that different types of threat tend to cluster into groups. Are there any plants that are threatened in unique ways, or unique combinations of ways? Isolation forests let you find datapoints that are very easily separated from the rest of the data, and so gives an indication of how distinct each datapoint is from the others. This can let you quantify ‘outlier-ness’.
First, we select only the relevant data (ie. the threat columns). The data is already scaled as binary (0/1). We’ll have to be careful not to re-sort the data rows, so we can match the raw data columns (which has only the threat columns) with the name lookup for each plant.
raw_data <- dplyr::select_at(plants, dplyr::vars(tidyselect::starts_with('Threat')))
We then fit an isolation forest to the data. The isotree package can handle dataframes as input, so I don’t need to convert the raw data.
first_trees <- isotree::isolation.forest(raw_data, ntrees = 150, nthreads = 1)
# Apply isolation forest to data to acquire outlier scores
plants$outlier_score <- predict(first_trees, raw_data)
What do the outlier scores look like? We would expect most of the scores to be fairly low, and see a low number of high-scoring outliers.
hist(plants$outlier_score)

Which plants have the highest outlier scores?
plants %>%
dplyr::arrange(-outlier_score) %>%
head(.) %>%
dplyr::select(1:5)
# A tibble: 6 x 5
binomial_name country continent group year_last_seen
<chr> <chr> <chr> <chr> <chr>
1 Myriophyllum axilli~ Madagasc~ Africa Flowering~ 1920-1939
2 Vanvoorstia bennett~ Australia Oceania Algae Before 1900
3 Diplazium laffanian~ Bermuda North Amer~ Ferns and~ 1900-1919
4 Cynorkis marojejyen~ Madagasc~ Africa Flowering~ 1940-1959
5 Musa zaifui China Asia Flowering~ 2000-2020
6 Thelethylax isalens~ Madagasc~ Africa Flowering~ 1900-1919
Which column is the most correlated with outlier-ness?
# Select outlier score, and all threat columns
cor_cols <- c('outlier_score', colnames(plants)[grepl('Threat', colnames(plants))])
plants %>%
dplyr::select_at(cor_cols) %>%
dplyr::rename('Outlier score' = outlier_score) %>%
cor() %>%
as.data.frame() %>%
dplyr::add_rownames() %>%
tidyr::pivot_longer(cols = 2:14,
names_to = 'var_2',
values_to = 'correlation') %>%
ggplot(aes(x = rowname, y = var_2, fill = correlation))+
geom_tile()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))+
labs(x = '',
y = '')

To visualise a multidimensional dataset, we can first project the data onto the first principal components. They are calculated to maximise the preserved variance. However, isolation forests are a tree-based approach, and so they may detect poitns in a non-linear way that principal components are not easily able to pick up. Still, we’d expect outlier scores to be higher nearer the outside of the ‘data cloud’.
pca_cols <- prcomp(raw_data, scale = TRUE)$x[,1:2]
pca_data <- as.data.frame(cbind(plants, pca_cols))
Now that we have calculate the principal components, we can visualise how the outlier scores are distributed across the two main data axes.
ggplot(pca_data, aes(x = PC1, y = PC2, colour = outlier_score))+
geom_point(size = 2, alpha = 0.8)+
theme_minimal()+
labs(colour = 'Outlier score')

Indeed most high-outlier points are near the edge of the two-PC plot. However, there are a few right in the middle of the cloud. Clearly the tree-based approach paid off and found a few interesting points that linear approaches would not have!