kv-cache - 技术专题深度解读

LMCache / LMCache

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Jun 5, 2026
Python

HDT3213 / godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

go redis golang cluster redis-server redis-cluster godis kv-cache

Updated Sep 14, 2025
Go

Zefan-Cai / KVCache-Factory

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache llm kv-cache-compression

Updated Jan 4, 2025
Python

NVIDIA / kvpress

LLM KV cache compression made easy

python transformers inference pytorch kv-cache large-language-models llm long-context kv-cache-compression

Updated Jun 4, 2026
Python

harleyszhang / llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming transformer-models kv-cache llm vllm llm-inference triton-kernels

Updated May 10, 2026
Python

therealoliver / Deepdive-llama3-from-scratch

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Updated Feb 24, 2025
Jupyter Notebook

raymin0223 / mixture_of_recursions

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)

router early-exiting adaptive-computation kv-cache llm recursive-transformers

Updated Sep 26, 2025
Python

Anbeeld / beellama.cpp

DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM

inference quantization kv-cache llm llm-serving llama-cpp ggml llm-inference speculative-decoding dflash turboquant

Updated Jun 5, 2026
C++

FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

sparsity high-throughput heavy-hitters kv-cache gpt-3 large-language-models

Updated Aug 1, 2024
Python

Zefan-Cai / Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

kv-cache llm kv-cache-quantization kv-cache-compression

Updated Mar 3, 2025

thu-nics / C2C

[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"

multi-agent kv-cache llm

Updated Mar 13, 2026
Python

quantumaikr / quant.cpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 26, 2026
C

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Updated May 21, 2025
Python

jjiantong / Awesome-KV-Cache-Optimization

[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

machine-learning ai system computer-architecture neural-language-processing mlsys kv-cache serving-ml llm llm-serving llm-inference

Updated Apr 21, 2026
Python

psmarter / mini-infer

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Apr 24, 2026
Python

QuanjianSong / FashionChameleon

Official Pytorch Code of the Paper "FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization"

streaming real-time fashion interactive rewards dmd sft teacher-forcing kv-cache video-diffusion-models video-customization garment-switch self-forcing

Updated May 31, 2026
Python

huawei-csl / KVarN

KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

quantization kv-cache llm long-context vllm llm-inference agentic-ai

Updated Jun 4, 2026
Python

NVIDIA-Merlin / HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

gpu cuda recommender-system hashtable key-value-store kv-cache dynamic-embedding embedding-storage

Updated May 22, 2026
Cuda

Dynamis-Labs / spectralquant

SpectralQuant: Calibrated Eigenbasis Rotation and Water-Filled Bit Allocation for KV-Cache Compression

machine-learning compression pytorch transformer quantization research-paper spectral-analysis kv-cache large-language-models llm-inference

Updated May 15, 2026
Python

alibaba / tair-kvcache

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global cache management, inference simulation(HiSim), and more.

simulator kv-cache llm kvcache hisim

Updated Jun 5, 2026
C++

Here are 349 public repositories matching this topic...