Large Language Model inference is heavily limited by memory bandwidth due to frequent accesses to the key-value cache, which dominates data movement. Although attention sparsity can reduce some memory traffic, the entire KV cache must remain accessible because the relevance of past tokens changes over time, maintaining high pressure on both bandwidth and capacity. Modern AI hardware advancements, such as NVLink and LPDDR5X interconnects, now combine high-bandwidth memory (HBM) with high-speed off-package DRAM, enabling heterogeneous memory systems to be a practical solution for these challenges. This research investigates how to dynamically place the KV cache across heterogeneous memory to maximize aggregated bandwidth utilization while respecting capacity constraints. Instead of proposing a fixed scheduling policy, the authors mathematically formulate the placement problem and derive a theoretical upper bound, indicating significant potential for runtime optimization. This approach is novel as it represents the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems specifically for LLM inference. The findings highlight the importance of intelligent memory management to alleviate bandwidth bottlenecks and improve inference efficiency. Future work could focus on developing practical scheduling algorithms that approach the theoretical upper bound. This study provides a foundation for optimizing memory usage in AI hardware, potentially enhancing the performance and scalability of LLM inference tasks.
👉 Pročitaj original: arXiv AI Papers