Vision-language models (VLMs) have become essential for automated traffic analysis but often require high computational resources and face challenges in fine-grained spatio-temporal understanding. The STER-VLM framework addresses these limitations by decomposing captions to separately process spatial and temporal information, enabling more precise interpretation of traffic scenes. It incorporates temporal frame selection with best-view filtering to ensure sufficient temporal context is captured without excessive computational load. Additionally, STER-VLM uses reference-driven understanding to better capture fine-grained motion and dynamic context within traffic environments. Curated visual and textual prompt techniques further enhance the model’s semantic richness and interpretative capabilities.
Experimental validation on the WTS and BDD datasets shows that STER-VLM significantly improves semantic understanding and traffic scene interpretation compared to existing methods. The framework’s effectiveness is further demonstrated by a competitive test score of 55.655 in the AI City Challenge 2025 Track 2, highlighting its potential for real-world deployment. By balancing computational efficiency with enhanced accuracy, STER-VLM offers a promising approach for advancing automated traffic analysis. This can lead to improved traffic management, safety monitoring, and urban planning. Future work may focus on extending the framework to other domains requiring spatio-temporal reasoning and further optimizing resource use.
👉 Pročitaj original: arXiv AI Papers