LLM output optimization
Scaling large language systems can increase maintenance costs if memory and maintenance design are ignored.
Stacking transformer layers improves accuracy but also increases memory usage and the cost per call. Decoder-only models process tokens in two stages during model inference: a pre-fill stage that relies on GPU computation and a decoding stage that is memory-bound The decoding stage often introduces end-to-end latency.
Batching improves throughput, but increases KV cache and memory overhead. Therefore, parallelism, cache management, and runtime controls must be used to keep GPUs busy.
This guide describes optimization techniques such as paged attention, I/O-aware attention, quantization, and KV reduction, as well as infrastructure patterns for scaling safely while measuring performance gains.
Quick Take
- Use metrics to prove throughput and latency gains.
- Focus on memory-bound decoding work to reduce real latency.
- Balance batching with the KV cache size to avoid GPU OOMs.
- Use KV quantization and reduction to reduce model memory.
- Use pipeline and tensor parallelism to scale across GPUs.
Prefill vs. Decoding
Prefill is the first phase of inference, during which the model processes the entire input prompt. At this stage, the user text is tokenized and passed through all layers of the transformer. The model computes internal representations of the tokens and generates attention states, which are stored in the key-value (KV) cache. Prefill is computationally intensive because the entire sequence of input tokens is processed at once. However, it is only performed once for each request.
Decoding is the second phase, during which the model generates a response token by token. Using the context and KV cache formed during prefill, the model predicts the next token, adds it to the sequence, and repeats this process until a response is generated or a length limit is reached. Unlike prefill, decoding is performed multiple times. Although each step requires less computation, the overall response latency depends on the decoding speed.
Characteristic | Prefill | Decoding |
Main role | Analysis and encoding of the input prompt | Response generation |
What is processed | All prompt tokens at once | One new token at a time |
Goal | Building a contextual representation | Predicting the next token |
Memory usage | Key-value cache is created | Uses the previously created cache |
Computational load | High for long prompts | Lower per step, but many steps |
Speed | Depends on the prompt length | Depends on the response length |
Role in LLM operation | Forms the understanding of context | Produces the response text |
KV cache: memory size and formulas
KV-cache (key-value cache) is a mechanism that stores intermediate representations of keys and values from the attention mechanism. Thanks to this, the model does not enumerate the entire token sequence at each generation step. It uses already stored keys and values, which speeds up the decoding phase.
However, using KV-cache consumes a lot of memory, especially for large models and long contexts. Therefore, understanding how to estimate its size and control memory usage is essential for stable LLM deployment.
Calculate the total size of KV
The size of the KV-cache depends on several model parameters: the number of transformer layers, the number of attention heads, the dimensionality of each head, the context length, and the numeric representation type. For each token, the model stores keys (K) and values (V) for each layer and each attention head. The formula can estimate the total memory size:
KV cache memory ≈ 2 × L × H × D × T × bytes_per_value
where
- L is the number of model layers,
- H is the number of attention heads,
- D is the dimension of one head,
- T is the number of tokens in the context,
- bytes_per_value is the number of bytes per numeric value.
The multiplier of 2 appears because both keys and values are stored. From this formula, it is clear that the size of the KV-cache grows linearly with the length of the context and the number of model layers. In large LLMs with tens of thousands of tokens, the KV-cache can occupy several gigabytes of GPU memory, making it one of the main factors limiting inference scaling.
Controls to avoid OOM
To avoid out-of-memory (OOM) errors and maintain high throughput, use caching strategies such as KV-cache optimization, system-level caching, and precomputation.
- Limiting the maximum length of the context, since it directly affects the cache size.
- Using less precise numeric formats, such as FP16 or BF16, which reduce memory usage compared to FP32.
- Paged or segmented KV-cache allows you to manage memory and distribute it between multiple requests.
- Batch control and request scheduling are used to avoid running many long generations at once.
- Optimized engines that support KV-cache reuse, offloading to CPU memory, or cache compression, which helps to work with longer contexts without overloading the GPU.
Fuse and reordering for speed: FlashAttention in practice
FlashAttention is an optimized algorithm for computing self-attention in transformer models, reducing memory usage and speeding up computation. In the standard implementation of attention, large intermediate matrices are created during the calculation. In long-running contexts, this can strain GPU memory and cause slowness or even out-of-memory (OOM) errors.
FlashAttention solves this problem; instead of storing large intermediate matrices, the algorithm processes data in blocks (tiles) in fast GPU memory (SRAM). This reduces the number of global memory accesses and improves GPU bandwidth efficiency.
The peculiarity of FlashAttention is that it performs exact attention calculations. The algorithm combines several stages:
- calculation of the product of queries and keys;
- application of softmax;
- multiplication by values in a single optimized pass.
This avoids storing the full attention matrix in memory.
FlashAttention is useful in scenarios where long contexts or large batch sizes are used. It is used during LLM inference and training to increase GPU throughput and reduce processing latency.
Quantization for lower weights and faster inference
Quantization is a method for optimizing large language models that reduces the size of weights and speeds up inference by using lower-precision numbers. By default, model weights are stored in FP16 or FP32 format, which requires significant memory and computational resources. During quantization, these values are converted to more compact representations, such as 8- or 4-bit integers, while preserving most of the model parameter information.
The main advantage of quantization is that the model can be deployed on resource-constrained hardware without significant loss of quality. This is important for running LLM locally, inferring on edge devices, or processing a large number of requests in cloud services.
8-bit vs. 4-bit
Feature | 8-bit Quantization (INT8) | 4-bit Quantization (INT4) |
Memory savings | ~50% reduction vs FP16 | ~75% reduction vs FP16 |
Inference speed | Faster than FP16 | Fastest, especially on low-resource devices |
Minimal loss | Some quality degradation possible | |
Use case | Production systems, stable performance | Edge devices, experiments, very large models |
Compatibility | Widely supported on modern GPUs | Limited support; may need specialized kernels |
Trade-off | Balanced between speed, memory, and accuracy | Maximizes efficiency at the cost of some precision |
Model service strategies that maximize throughput
In today's query processing systems, maximizing throughput is a critical performance factor. Bandwidth determines how many requests a system can process per unit of time, and without the right servicing strategies in place, it can quickly become a bottleneck.
Service strategies
Strategy | How it increases throughput |
KV-cache optimization | Reuses computed token representations to avoid redundant calculations during decoding |
Model quantization (INT8/ INT4) | Reduces memory usage and speeds up computation while maintaining acceptable accuracy |
Batch processing & parallelism | Combines multiple requests for simultaneous processing and leverages multi-GPU/threads |
FlashAttention | Computes attention efficiently in blocks, reducing memory footprint and speeding up calculations |
Dynamic memory management & offloading | Moves data between GPU and CPU as needed, prevents OOM and maintains performance |
Load monitoring & balancing | Prioritizes long or “hot” requests and distributes load across GPUs/nodes |
System-level caching & precomputation | Stores frequent query results for instant retrieval, avoiding repeated computation |
Implementing these strategies allows systems to use resources efficiently, reduce processing latency, and ensure stable operation under high load. This is important for deploying LLM in a production environment or services with a large number of concurrent users.
FAQ
What is the difference between prefill and decode phases?
Prefilling forms a context by processing all incoming tokens at once, and decoding generates a response one token at a time using that context.
How to calculate the KV cache size for a model?
KV-cache size is calculated as 2 × number of layers × number of heads × head dimension × number of tokens × bytes per value (for keys and values).
What techniques help handle long contexts without OOM?
Using FlashAttention, quantization, dynamic memory offloading, and context length limits helps handle long contexts without OOM.
What is page attention and why use it?
Paged attention is a method of calculating attention in blocks to reduce memory usage for long sequences without losing accuracy.
How does the implementation of attention affect decoding throughput?
Implementing optimized attention, such as FlashAttention, increases decoding throughput, reducing memory usage and speeding up computation per token.
When should you use FlashAttention?
FlashAttention should be used when long contexts or large batches make standard attention slow or memory expensive.
How to choose between 8-bit and 4-bit quantization?
The choice between 8-bit and 4-bit quantization depends on the trade-off between model accuracy, memory savings, and inference speed.