KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Paper review

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

오서영 2025. 11. 17. 13:47

논문 원본: https://nips.cc/virtual/2024/poster/96936

NeurIPS Poster KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Abstract: LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising ap

nips.cc

발표 자료:

엣지컴퓨팅 7조.pdf

5.26MB