<aside>
💡 Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360 coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.
</aside>
The most basic approach for getting BEV from camera data is Inverse Perspective Mapping - Severe distortions and too simplistic.
- Inverse Perspective Mapping
Advantages of BEV : Use for sensor fusion and path planning
- It captures the scene’s depth proportional to metric scale which allows it to be directly used for distance sensitive tasks such as collision avoidance.
- Explicitly capture occlusions in the scene, allowing for subsequent tasks such as path planning and control, to handle the ambiguity associated with such regions
- No perspective distortion - simplifying the processing of lane geometry and road markings
Advantages of PV : Used for segmentation and tracking
- Extremely useful when the car drives through overhanging regions
- Allows inference of road rules due to the detection of traffic lights and signs. Overall very useful for scene understanding.
‼️There is a disconnect though: Generation of BEV required depth estimation that cannot be directly got from PV.
- Solved using depth estimation networks on PV or other sensors like LIDAR
Thus BEV fusion becomes the main challenge : How should you combine the information from cameras and LIDAR to generate BEV image.
Techniques
- Multi-stage approaches where the depth estimates are combined with intermediate outputs of PV to generate required BEV representation.
- Generates sub-optimal due to scale inconsistencies in depth estimation and sparsity of range based sensors.
- End-to-end learning from PV-BEV
- Examples : VPN, Cam2BeV, LSS, BEVFormer
- Two types:
- Early fusion : data from every camera is fused and processed together to give the desired output
- Detection of large objects becomes easier
- Predictions in regions where one camera overlaps another become trivial as no post processing is needed to handle contradictory items
- Increased computational and memory cost.
- Late fusion : data from every camera is processed to generate a output. the outputs are then fused together
Perception Tasks
- 3D object detection : BEV > PV - Better spatial and geometric reasoning, addresses scale inconsistency and object occlusion
- BEV Segmentation :
- Semantic, instance and panoptic
- characterized by grid size and grid resolution - corresponds to a certain real world area.
Network Architectures