Rstats and Recipes: Clustering with minimum and maximum size

I was lucky to get a place on Techionista’s programme to help women get into tech roles, in collaboration with Microsoft. As part of Microsoft’s Professional Programme in Data Science, we had to tackle real business problems using data science techniques.

The challenge we were tackling is improving how event visitors can travel to and from events more efficiently, so there is less clogging of transport links. Our idea: provide a coach service that allows visitors to ‘carpool’ together. This way, the event can start the minute they get on the bus, and we may reduce the number of cars on the road 30-fold!

However, this leaves one problem: where do the buses pick up people?

This is a clustering problem: which groups of visitors live close to each other? However, most clustering algorithms do not allow constraints on the maximum size of clusters. Buses do have a maximum capacity! So we set out to design an algorithm that finds the most optimal clusters, where each individual cluster has a minimum of 30 individuals, and a maximum of 59. By finding these groups, we can optimise pick-up locations!

Loading and assessing data

This demonstration is based on the scenario that an event is taking part in Amsterdam, and visitors are to be allocated to bus seats to be taken to the event.

We acquired a theoretical dataset for visitors to a particular event, and where in the Netherlands they may be ordinarily based.


      lat             lon            number       
 Min.   :50.77   Min.   :3.401   Min.   :  1.000  
 1st Qu.:52.07   1st Qu.:4.751   1st Qu.:  1.000  
 Median :52.30   Median :4.920   Median :  2.000  
 Mean   :52.22   Mean   :5.045   Mean   :  2.028  
 3rd Qu.:52.38   3rd Qu.:5.255   3rd Qu.:  2.000  
 Max.   :53.45   Max.   :7.193   Max.   :193.000

Note: The data is structured such that there are some longitude / latitude combinations with multiple people, so we should duplicate those rows so seats are appropriately allocated.

National visitor distribution

Let’s have a look where the visitors come from.

We see there are many visitors from near Amsterdam, who are unlikely to want to make use of the bus transport there. What does the distribution of distance-from-Amsterdam look like?

Indeed, it looks like nearby people are overrepresented: however, they are unlikely to make use of a coach service. Thus, we want to remove these from our analysis. As a rule of thumb, we remove all visitors that live within a 35km radius of central Amsterdam.

Clustering algorithm

We use repeated hierarchical clustering. This algorithm creates a dendrogram, combining those datapoints that are closest together one at a time. Therefore, one may cut this tree at any height to create a set number of clusters.

For the clustering, we use a random sample of 20% of the data: this is not only an estimate of likely demand for this service, but also cuts down on running time of the algorithm itself.

We create repeated hierarchical clusterings, cut at different heights, to arrive at a dendrogram where the maximum cluster size is 59 (the number of seats per bus). The clusters with a reasonable number of visitors in it are saved, and the clustering algorithm is run again on the remaining visitors. This is repeated until all visitors are allocated a bus.


[1] "Number of clusters in this round:  865"
[1] "Round  1  is over"
[1] "Number of clusters in this round:  518"
[1] "Round  2  is over"
[1] "Number of clusters in this round:  366"
[1] "Round  3  is over"
[1] "Number of clusters in this round:  227"
[1] "Round  4  is over"
[1] "Number of clusters in this round:  128"
[1] "Round  5  is over"
[1] "Number of clusters in this round:  62"
[1] "Round  6  is over"
[1] "Number of clusters in this round:  21"
[1] "Round  7  is over"
[1] "Number of clusters in this round:  5"
[1] "Round  8  is over"
[1] "Number of clusters in this round:  2"
[1] "Round  9  is over"
[1] "Total running time was:  29.5325260162354"

Results

We first see how many people were allocated per bus, and how many buses are needed for this particular event.


# A tibble: 128 x 2
   clustnum count
   <chr>    <int>
 1 1 - 2       36
 2 1 - 9       54
 3 10 - 1      43
 4 10 - 6      40
 5 100 - 1     45
 6 100 - 3     35
 7 101 - 4     42
 8 11 - 5      47
 9 11 - 6      39
10 11 - 7      33
# ... with 118 more rows


[1] 128

We can plot the groups of visitors, distributed across the country.

This plot shows the center of each cluster, which would be the optimal place for a pick-uppoint for that group.