Bird’s Eye View Perception

<aside> 💡 Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360 coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.

</aside>

The most basic approach for getting BEV from camera data is Inverse Perspective Mapping - Severe distortions and too simplistic.

Inverse Perspective Mapping

Advantages of BEV : Use for sensor fusion and path planning

It captures the scene’s depth proportional to metric scale which allows it to be directly used for distance sensitive tasks such as collision avoidance.
Explicitly capture occlusions in the scene, allowing for subsequent tasks such as path planning and control, to handle the ambiguity associated with such regions
No perspective distortion - simplifying the processing of lane geometry and road markings

Advantages of PV : Used for segmentation and tracking

Extremely useful when the car drives through overhanging regions
Allows inference of road rules due to the detection of traffic lights and signs. Overall very useful for scene understanding.

‼️There is a disconnect though: Generation of BEV required depth estimation that cannot be directly got from PV.

Solved using depth estimation networks on PV or other sensors like LIDAR

Thus BEV fusion becomes the main challenge : How should you combine the information from cameras and LIDAR to generate BEV image.

Techniques

Multi-stage approaches where the depth estimates are combined with intermediate outputs of PV to generate required BEV representation.
- Generates sub-optimal due to scale inconsistencies in depth estimation and sparsity of range based sensors.
End-to-end learning from PV-BEV
- Examples : VPN, Cam2BeV, LSS, BEVFormer
- Two types:
  - Early fusion : data from every camera is fused and processed together to give the desired output
    - Detection of large objects becomes easier
    - Predictions in regions where one camera overlaps another become trivial as no post processing is needed to handle contradictory items
    - Increased computational and memory cost.
  - Late fusion : data from every camera is processed to generate a output. the outputs are then fused together

Perception Tasks

3D object detection : BEV > PV - Better spatial and geometric reasoning, addresses scale inconsistency and object occlusion
BEV Segmentation :
- Semantic, instance and panoptic
- characterized by grid size and grid resolution - corresponds to a certain real world area.

Techniques

Perception Tasks

Network Architectures