The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context's size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.
This work was supported in part by the Korea Collaborative & High-tech Initiative for Prospective Semiconductor Research (K-CHIPS) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Grant RS-2025-02305531 and in part by the Ministry of Science and ICT (MSIT) through Information Technology Research Center (ITRC) support Program under Grant IITP-2025- 2020-0-01461 supervised by the IITP.