-
Natural sciences
- Artificial intelligence not elsewhere classified
- Computer vision
Machine perception for autonomous driving integrates multiple sensors to overcome sensing limitations: cameras for full scene coverage; radar and LiDaR for distance sensing and coping with adverse weather or poor visibility; radar and future LiDaR to measure longitudinal motion; ther-mal cameras for night vision and object distinction. Similar multi-sensor perception systems are integrated into drones, ships, and autonomous guided vehicles in factories.
Real-time road user detection and tracking in driver assistance systems today rely on sophisticat-ed deep learning, executed independently per sensor and integrated into a smart sensor. Such smart sensors output geometric primitives (e.g., bounding boxes) and high-level object de-scriptors (e.g., object class). However, this late fusion paradigm forces individual smart sensors to make early decisions on what information (not) to transmit to the fusion center and is far from optimal. Early fusion instead directly fuses unprocessed data by jointly processing radar, LiDaR, and camera data in a deep neural network (DNN). Early fusion can outperform late fusion be-cause it can exploit weak evidence, which does not reach the fusion center in the case of late fu-sion. Nevertheless, early fusion is unsuitable for autonomous driving as it handles all data togeth-er, making it non-scalable due to the rapid increase in data rate with additional sensors. Moreo-ver, it lacks flexibility, necessitating complete retraining when adding, removing, or upgrading sensors, and it is not robust to sensor failures.
The proposal instead puts forth cooperative sensor processing for fusion in multimodal sensor systems. This innovative paradigm does not refer to multi-agent cooperative perception but in-stead to enhancing local sensor processing in DNN-equipped smart sensors with a limited amount of high-level context information from other sensors or the fusion center. The context includes (1) preliminary candidate objects found by the other sensors; (2) the confidence the other sensor has in them; (3) the sensing conditions in which the other sensor operates (e.g., the camera reliability depends on light conditions); and (4) the scene context.
(3) and (4) are novel and important: they allow one sensor to assess how much to trust the con-text the other sensor provides. For instance, if a radar asserts with high confidence that pedestri-ans are present but the camera does not see them, the camera can then learn that some pedes-trians are not visible in specific scene locations (e.g., behind the corner of a building). Then, in-stead of outputting with high confidence that no road user is present, the camera outputs that it cannot decide, thereby learning the scene context. In another example, Doppler radar has diffi-culty distinguishing an empty scene from an immobile pedestrian. When a camera detects a per-son, it can easily offer additional context to the radar on whether the pedestrian is in motion. This context is highly valuable for radar, enabling it to learn new features to handle such situations more effectively.