i3dLoc: Image-to-range Cross-domain Localization

We present a method for localizing a single camera with respect to a point cloud map in indoor and outdoor scenes. The problem is challenging because correspondences of local invariant features are inconsistent across the domains between image and 3D. The problem is even more challenging as the method must handle various environmental conditions such as illumination, weather, and seasonal changes. Our method can match equirectangular images to the 3D range projections by extracting cross-domain symmetric place descriptors. Our key insight is to retain condition-invariant 3D geometry features from limited data samples while eliminating the condition-related features by a designed Generative Adversarial Network. Based on such features, we further design a spherical convolution network to learn viewpoint-invariant symmetric place descriptors. We evaluate our method on extensive self-collected datasets, which involve Long-term(variant appearance conditions), Large-scale (up to 2km structure/unstructured environment), and Multistory (four-floor confined space). Our method surpasses other current state-of-the-arts by achieving around 3 times higher place retrievals to inconsistent environments, and above 3 times accuracy on online localization. To highlight our method’s generalization capabilities, we also evaluate the recognition across different datasets. With one single trained model, i3dLoc can achieve reliable visual localization under random conditions and viewpoints.

The framework of the proposed i3dLoc. It is the combination of two generator networks between 2D imagery domain and range prediction domain, a classification module estimating the environmental conditions, and a discriminator module distinguishing the generated range predictions from the real LiDAR projections.

Mobile robots and self-driving cars have entered our daily life in the recent years with the development of High-Definition maps-based accurate localization. Cameras have the huge potential to provide low-cost, compact and self-contained visual localization against point cloud maps. However, visual methods are inherently limited by inconsistent environmental conditions in the real world, e.g., illumination, weather, season and viewpoint differences. Whereas, accurate matching can be challenging to perform on point cloud data due to sensor sparsity with no sufficient texture feature guarantees. Transitional geometry-based methods implicitly assume a static environment, such as stable lighting conditions, sunny weather, and fixed seasonal attributes. Recent learning-based visual localization methods are either constrained under limit environments(structure road) or only fit for limited viewpoints (forwards or backwards on the street). Current image-to-range localization methods are difficult to leverage in real-world applications, or can hardly address the above issues simultaneously.

The major contributions of i3dLoc are:

We put forth a new end-to-end large-scale visual localization method with the assistance of offline 3D maps, providing reliable 3D localization.
We introduce a Generative adversarial Network (GAN) based cross-domain transfer learning network to extract condition-invariant features while eliminating the condition-related factors.
We design an innovative symmetric feature learning network performing on spherical convolution networks, where intrinsic characteristics of a spherical harmonica naturally help place descriptor matching under variant viewpoints.
We design an evaluation framework (includes Long-term, Large-scale and Multistory datasets), which can analyze the visual localization performance under significant environmental appearance changes, casual viewpoints and also the generalization ability for unseen datasets.

Cross Domain Image Transfer

To generate constant geometry features from visual inputs under different environmental conditions, we construct a cross domain transfer network between 2D imagery and 3D range projections. Before we introduce the details, we will introduce the visual features from the point view of information entropy. Naturally, the condition- related feature (illumination, weather, seasons) and invariant features (geometry) in the image domain are tightly coupled. The above figure shows the relationship between geometry and conditional features. H and I are the joint entropy and the mutual entropies conditioned on the given data samples x in the image domain. Our domain transfer network can extract the geometry features for place recognition.

Key Results

The training and evaluation dataset includes:

Long-term Dataset: we create 15 long-term dataset by generating trajectories in 2020\06~2021\01 with variant season, weather changes. Distance for each trajectory is around 200~400m. Trajectories 1~10 for training, and 11~15 for evaluation.
Large-scale Dataset: we create a large-scale outdoor dataset with 8 trajectories by traversing 1.5~2km routes under structured/unstructured outdoor environments. Trajectories 1~6 for training procedure, and trajectories 7~8 for evaluation.
Multistory Dataset: we create a indoor dataset by traversing 8 trajectories within a multi-floor area under daytime/nighttime. The average distance for indoor routines is 100~150m. We use trajectories 1~6 for network training, and 7~8 for evaluation.

The dataset are collected with a LiDAR (VLP16) device, an 360 camera and an inertial measurement unit (IMU). We gathered the indoor and outdoor datasets via recording the raw data from above sensors. Since lacking the ground truth position in indoor or other GPS-denied environments, we use the LiDAR odometry outputs as the ground truth estimation.

Publications

Peng Yin, Lingyun Xu, Sebastian Scherer “i3dLoc: Image-to-range Cross-domain Localization Robust to Inconsistent Environmental Conditions”, Robotics: Science and Systems 2021.

Contact

Peng Yin: (pyin2 [at] andrew [dot] cmu [dot] edu)
Sebastian Scherer: (basti [at] cmu [dot] edu)