Abstract
Surrounding vehicle trajectory prediction is a crucial component of autonomous driving. Currently, trajectory prediction research relies primarily on publicly available datasets processed by perception methods rather than raw sensor perception information. With the increasing emphasis on visual perception, the integration of the visual perception trajectory prediction pathway will be highly important for the application of prediction algorithms. This paper proposes a multimodal vehicle trajectory prediction model based on visual perception information (VP-MTP). First, a vehicle detection network is employed to obtain the position coordinates of vehicles in consecutive frame bird’s eye view (BEV) images. Subsequently, the discrete position coordinates are processed into complete vehicle historical trajectories through a processing block that includes affine coordinate transformation, vehicle tracking, and trajectory smoothing (ATS). To address the high computational complexity of the standard Transformer, the input sequence is decomposed in the time dimension. Additionally, layer normalization positions are adjusted, convolutional feed-forward layers are introduced, and hierarchical encoding is employed to enhance feature extraction capability and encoding efficiency. Thus, a hierarchical Transformer encoder based on convolutional feedforward with time decomposition attention (HT-CTA) is constructed. Considering the large workload and limited adaptability of clustering-based multimodal training strategies in complex scenarios, learnable anchor embedding features are introduced as model parameters to establish a multimodal trajectory decoder. Finally, experiments on the Waymo motion and nuScenes datasets demonstrate that, compared to existing baseline models, the VP-MTP achieves average improvements of 12.4% and 9.9% in minimum Average Displacement Error (minADE) and minimum Final Displacement Error (minFDE) on the Waymo dataset, and 9.3% and 10.0% on the nuScenes dataset, respectively. This enhancement provides higher prediction accuracy and good multimodality, achieving multimodal trajectory prediction based on raw visual perception information.
Get full access to this article
View all access options for this article.
