|Resumo: ||"The human long-term memory known as Episodic Memory stores sequences of events (episodes in time and in space) experienced by an individual and subsequently allows access to them in whole or in part. Such ability provided by our Episodic Memory system is important, among other functions, for our localisation in space throughout time. Inspired by such capabilities we investigate the use of spatio-temporal Machine Learning methods for the Global Localisation problem of Autonomous Vehicles.
The state-of-the-art Global Localisation methods rely greatly on the Global Positioning System (GPS) infrastructure and other expensive sensors, albeit there is an increasing demand for global localisation alternative methods in GPS-denied environments. One of them is known as appearance-based global localisation, which associates images of places with their corresponding position. This is very appealing regarding the great amount of geotagged photos publicly available and the ubiquitous devices fitted with ultra-high-resolution cameras, motion sensors and multicore processors nowadays. The appearance-based global localisation can be devised in topological or metric solution regarding whether it is modelled as a classification or regression problem, respectively. The topological common approaches to solve the global localisation problem often involves solutions in the spatial dimension and less frequent in the temporal dimension, but not both simultaneously.
We propose an integrated spatio-temporal solution based on an ensemble of kNN classifiers, WHERE each classifier uses the Dynamic Time Warping (DTW) and the Hamming distance to compare binary features extracted from sequences of images. Each base learner is fed with its own binary set of features extracted from images. The solution was designed to solve the global localization problem in two phases: mapping and localisation. During mapping, it is trained with a sequence of images and associated locations that represents episodes experienced by an autonomous robot. During localization, it receives subsequences of images of the same environment and compares them to its previous experienced episodes, trying to recollect the most similar experience in time and space at once. Then, the system outputs the positions WHERE it believes these images were captured.
Although the method is fast to train, it scales linearly with the number of training samples in order to compute the Hamming distance and compare it against the test samples. Often, while building a map, one collects high correlated and redundant data around de environment of interest. Some reasons are due to the use of high frequency sensors or to the case of repeating trajectories. This extra data would carry an undesired burden on memory and runtime performance during test if not treated appropriately during mapping phase. To tackle this problem, we employ some clustering algorithms to compress the network's memory after mapping. For large scale environments, we combine the clustering algorithms with a multi hashing data structure seeking the best compromise between classification accuracy, runtime performance and memory usage.
So far, this encompasses solely the topological solution part for the global localisation problem, which is not precise enough for autonomous cars operation. Instead of just recognising places and outputting an associated pose, it is desired that a global localisation system regresses a pose given a current image of a place. But, inferring poses for city-scale scenes is infeasible at least for sub-decimetric precision. Our approach to tackle this problem is as follows: first take a live image from the camera and use the localisation system aforementioned to return the image-pose pair most similar to a topological database built as before in the mapping phase. And then, given the live and mapped images, the odometry system outputs the relative pose between those images. To solve the odometry problem we trained a Convolutional Neural Network (CNN) following the Siamese architecture design to take as input two separated images in time and space in order to output a 6D pose vector, representing the relative position between input images. In conjunction, both systems solve the global localisation problem using topological and metric information to achieve average sub-decimetric precision."|