Automatic Speech Recognition (ASR) systems typically face difficulties in preserving syntactic and semantic integrity when transcribing long audio segments, which negatively affects tasks such as Named Entity Recognition (NER), capitalization, and punctuation. To tackle this, the authors introduce a novel method that distills contextual knowledge from large language models (LLaMA) into the Whisper ASR system. Their approach involves two key strategies: token-level distillation using optimal transport to align sequence lengths and dimensions, and minimizing representation loss between sentence embeddings of Whisper and LLaMA, effectively blending syntactic and semantic information. The method was evaluated on the Spoken Wikipedia dataset, known for its long audio recordings and rich entity content, showing significant improvements in Word Error Rate (WER), NER accuracy, capitalization, and punctuation. Additionally, the authors introduce new NER metrics tailored for this task, emphasizing the importance of semantic awareness in ASR. This research highlights the potential of integrating linguistic context into speech transcription, which can lead to more robust and context-aware ASR systems. The improvements demonstrated suggest that future ASR models can benefit from leveraging large language models to enhance transcription quality, especially for long-form speech. Such advancements could impact various applications including automated transcription services, voice assistants, and information extraction from speech. The work sets a foundation for further exploration of semantics-aware ASR and encourages the development of evaluation metrics that better capture linguistic accuracy.
👉 Pročitaj original: arXiv AI Papers