Recent progress in multimodal Parameter-Efficient Fine-Tuning (PEFT) has improved downstream task performance, particularly in few-shot retrieval scenarios. However, existing methods often optimize for task-specific gains without addressing the structural alignment of multimodal embedding spaces, resulting in isolated modality-specific representations that hinder cross-modal generalization. SPANER addresses this limitation by introducing a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related inputs from different modalities to spatially converge within a unified embedding space. This design is modality-agnostic and extensible, allowing new modalities such as audio to be incorporated without modifying the core architecture. Extensive experiments on vision-language and audio-visual benchmarks demonstrate that SPANER achieves competitive few-shot retrieval results while maintaining high semantic coherence across modalities. The framework’s emphasis on embedding structure alignment rather than solely tuning adapter weights marks a significant shift towards scalable multimodal learning. By fostering a shared semantic space, SPANER facilitates better cross-modal understanding and retrieval, which is crucial for applications requiring integration of diverse data types. This approach also suggests potential for future research in expanding multimodal systems with minimal architectural changes. Overall, SPANER provides a promising direction for enhancing multimodal representation learning by focusing on unified semantic alignment.
👉 Pročitaj original: arXiv AI Papers