Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Source: arXiv AI Papers

Vision-Language-Action (VLA) models are increasingly recognized for their potential in robotic control applications. This research scrutinizes the efficacy of five distinct VLA models, including state-of-the-art versions and new architectures, while considering their performance across edge devices and datacenter GPUs. By utilizing the LIBERO benchmark, the researchers measure various system-level metrics alongside accuracy, revealing critical insights into how different architectural designs influence overall performance.

One of the notable findings is that power-constrained edge devices display a non-linear performance degradation, yet some configurations manage to perform on par with or even exceed older datacenter GPU models. This challenges existing beliefs about the dominance of high-performance hardware in robotic inference tasks. Moreover, the results indicate that achieving high throughput does not necessarily lead to substantial accuracy compromises, suggesting that careful selection and optimization of VLA models can yield effective solutions under a range of operational constraints.

👉 Pročitaj original: arXiv AI Papers