Using learning-to-rank to precisely locate where to deliver packages
Models adapted from information retrieval deal well with noisy GPS input and can leverage map information.
For delivery drivers, finding the doorstep where a package should be dropped off can be surprisingly hard. House numbers can be obscured by foliage, or they might be missing entirely; some neighborhoods use haphazard numbering systems that make house numbers hard to guess; and complexes of multiple buildings sometimes share a single street address.
Having the correct latitude-longitude coordinates of the customer’s doorstep would make delivery more efficient, but that information is hard to come by. When a driver confirms a delivery, our app records the current GPS location, which could be anywhere between the customer’s door and the Prime delivery van parked down the road. Plus, in city “canyons”, where line of sight to GPS satellites is severely limited, the GPS measurement error can be substantial.
In a paper I’m presenting at the European Conference on Machine Learning (ECML), I adapt an idea from information retrieval — learning-to-rank — to the problem of predicting the coordinates of a delivery location from past GPS data.
In experiments, I compared the new method to two other approaches to the problem — centroid computation and kernel density estimation (KDE) — and found that the new method significantly outperformed its predecessors. On delivery data from New York state, the error for the new method was far less than that for KDE, the best-performing baseline.
In the information retrieval context, learning-to-rank is a way to learn from pairwise preference data. If a search engine presents a ranked list of results, clicking only the third search result implicitly provides two pairwise preferences: the user prefers the third search result to the first result, and the user also prefers the third result to the second result. That provides two labeled preference pairs that can help train a ranking model to improve future search results for other queries.
Similarly, I train a ranking model to select the best point from a set of candidate locations for a particular address. A single labeling click on the best location implies a preference order over (nearly) all pairs of candidate locations; the candidate closest to the labeled location is preferred. Thus, the volume of training pairs produced per labeling click is much greater than in information retrieval. The base binary classifier model takes pairs of points as input and is trained to prefer the one closer to the labeled point.
There’s a difference between the popular learning-to-rank methods in information retrieval and my adaptation of them, however. In the search engine scenario, the algorithm may be sorting through tens of thousands of documents or products to produce a ranking. Although the model is trained by pairwise comparisons, at inference time it doesn’t have enough time to compare each candidate document against all the others. Instead, it works as a regression model to score each candidate independently, and the final ranking is simply sorted by score.
In the geospatial case, however, we typically compute offline, and we usually have fewer than 100 sufficiently distinct candidate locations to consider for each delivery address. This makes it practical to compare each candidate location against all the others at inference time: we select the one that wins the most pairwise comparisons. Experiments showed that this produced better results than regression models trained from pairwise data, such as RankNet.
The primary machine learning model I used was a random forest — an ensemble of, say, 30 decision trees learned from the training data. Each decision tree performs a series of evaluations on a select set of data attributes (features) to produce a score. The average of all the trees’ scores is the model’s overall score for a given input item, which indicates a preference for one or the other of two candidate points.
To generate candidate locations, we first thin out the many GPS delivery locations reported in the past: only a point or two will represent a tight cluster of points. Then we add potential candidates along the faces of nearby buildings.
The feature vector for each candidate includes features based on the density of past GPS measurements in its vicinity, as well as features based on nearby map data. These include things like the distance to the nearest street, the distance to the nearest parking lot, the distance to the nearest building, etc. Among other things, these types of features help the model not to select locations in the middle of a street or parking lot, which a centroid or KDE model might easily do (as in the figure above). There are also context features that help, such as the number of past deliveries and the number of building outlines nearby.
The ability to leverage a variety of informative features explains the ranking model’s advantage over the baselines. Not all areas have reliable map data, but the many addresses on which my model’s performance is strongest probably do. Since the baseline models can’t exploit that information, the gap in performance is greater.
Technically, the first baseline I consider is a mediod method, which selects the candidate location closest to the centroid of the past GPS measurements. Centroid, mediod, and geometric-median methods all make the faux pas of selecting a low-density point in the middle of a multimodal distribution.
The KDE method avoids this by finding dense clusters of past GPS points, but often the ground truth doorway is at the edge of the cluster, not in the middle. Thus, we really need a supervised machine learning method that can leverage many different signals, including information from the online map. The method is somewhat resilient to places in which the map is incomplete — for instance, missing building outlines or even roads.
I also compared the new model against an oracle that always selects the candidate location closest to the true location of the doorway; unless you have an omniscient candidate generator or generate all points, you won’t always have a candidate that’s super close to the ground truth label.
Below are plots showing the performance of the novel ranking model (GeoRank) against the mediod and KDE methods, the oracle, and, as an upper bound, random selection. They are shown for two different sets of data, one for New York state and one for Washington state. The y-axis is loss, so lower is better; the x-axis reveals the whole distribution of losses (instead of a single scalar average per method).
As can be seen, the GeoRank model significantly outperforms the baselines and compares favorably against the oracle. By putting this work into practice, we have improved delivery efficiency substantially, with benefits for both Amazon customers and delivery drivers for Amazon Last Mile. Please read the paper for more detail.