LLM output optimization

Scaling large language systems can increase maintenance costs if memory and maintenance design are ignored.

Stacking transformer layers improves accuracy but also increases memory usage and the cost per call. Decoder-only models process tokens in two stages during model inference: a pre-fill stage that relies on GPU computation and a decoding stage that is memory-bound The decoding stage often introduces end-to-end latency.

Batching improves throughput, but increases KV cache and memory overhead. Therefore, parallelism, cache management, and runtime controls must be used to keep GPUs busy.

This guide describes optimization techniques such as paged attention, I/O-aware attention, quantization, and KV reduction, as well as infrastructure patterns for scaling safely while measuring performance gains.

Quick Take

Use metrics to prove throughput and latency gains.
Focus on memory-bound decoding work to reduce real latency.
Balance batching with the KV cache size to avoid GPU OOMs.
Use KV quantization and reduction to reduce model memory.
Use pipeline and tensor parallelism to scale across GPUs.

Prefill vs. Decoding

Prefill is the first phase of inference, during which the model processes the entire input prompt. At this stage, the user text is tokenized and passed through all layers of the transformer. The model computes internal representations of the tokens and generates attention states, which are stored in the key-value (KV) cache. Prefill is computationally intensive because the entire sequence of input tokens is processed at once. However, it is only performed once for each request.

Decoding is the second phase, during which the model generates a response token by token. Using the context and KV cache formed during prefill, the model predicts the next token, adds it to the sequence, and repeats this process until a response is generated or a length limit is reached. Unlike prefill, decoding is performed multiple times. Although each step requires less computation, the overall response latency depends on the decoding speed.

Characteristic	Prefill	Decoding
Main role	Analysis and encoding of the input prompt	Response generation
What is processed	All prompt tokens at once	One new token at a time
Goal	Building a contextual representation	Predicting the next token
Memory usage	Key-value cache is created	Uses the previously created cache
Computational load	High for long prompts	Lower per step, but many steps
Speed	Depends on the prompt length	Depends on the response length
Role in LLM operation	Forms the understanding of context	Produces the response text

KV cache: memory size and formulas

KV-cache (key-value cache) is a mechanism that stores intermediate representations of keys and values from the attention mechanism. Thanks to this, the model does not enumerate the entire token sequence at each generation step. It uses already stored keys and values, which speeds up the decoding phase.

However, using KV-cache consumes a lot of memory, especially for large models and long contexts. Therefore, understanding how to estimate its size and control memory usage is essential for stable LLM deployment.

Calculate the total size of KV

The size of the KV-cache depends on several model parameters: the number of transformer layers, the number of attention heads, the dimensionality of each head, the context length, and the numeric representation type. For each token, the model stores keys (K) and values (V) for each layer and each attention head. The formula can estimate the total memory size:

KV cache memory ≈ 2 × L × H × D × T × bytes_per_value

where

L is the number of model layers,
H is the number of attention heads,
D is the dimension of one head,
T is the number of tokens in the context,
bytes_per_value is the number of bytes per numeric value.

The multiplier of 2 appears because both keys and values are stored. From this formula, it is clear that the size of the KV-cache grows linearly with the length of the context and the number of model layers. In large LLMs with tens of thousands of tokens, the KV-cache can occupy several gigabytes of GPU memory, making it one of the main factors limiting inference scaling.

Controls to avoid OOM

To avoid out-of-memory (OOM) errors and maintain high throughput, use caching strategies such as KV-cache optimization, system-level caching, and precomputation.

Limiting the maximum length of the context, since it directly affects the cache size.
Using less precise numeric formats, such as FP16 or BF16, which reduce memory usage compared to FP32.
Paged or segmented KV-cache allows you to manage memory and distribute it between multiple requests.
Batch control and request scheduling are used to avoid running many long generations at once.
Optimized engines that support KV-cache reuse, offloading to CPU memory, or cache compression, which helps to work with longer contexts without overloading the GPU.

LLM Annotation | Keymakr

Fuse and reordering for speed: FlashAttention in practice

FlashAttention is an optimized algorithm for computing self-attention in transformer models, reducing memory usage and speeding up computation. In the standard implementation of attention, large intermediate matrices are created during the calculation. In long-running contexts, this can strain GPU memory and cause slowness or even out-of-memory (OOM) errors.

FlashAttention solves this problem; instead of storing large intermediate matrices, the algorithm processes data in blocks (tiles) in fast GPU memory (SRAM). This reduces the number of global memory accesses and improves GPU bandwidth efficiency.

The peculiarity of FlashAttention is that it performs exact attention calculations. The algorithm combines several stages:

calculation of the product of queries and keys;
application of softmax;
multiplication by values in a single optimized pass.

This avoids storing the full attention matrix in memory.

FlashAttention is useful in scenarios where long contexts or large batch sizes are used. It is used during LLM inference and training to increase GPU throughput and reduce processing latency.

Quantization for lower weights and faster inference

Quantization is a method for optimizing large language models that reduces the size of weights and speeds up inference by using lower-precision numbers. By default, model weights are stored in FP16 or FP32 format, which requires significant memory and computational resources. During quantization, these values are converted to more compact representations, such as 8- or 4-bit integers, while preserving most of the model parameter information.

The main advantage of quantization is that the model can be deployed on resource-constrained hardware without significant loss of quality. This is important for running LLM locally, inferring on edge devices, or processing a large number of requests in cloud services.

8-bit vs. 4-bit

Feature	8-bit Quantization (INT8)	4-bit Quantization (INT4)
Memory savings	~50% reduction vs FP16	~75% reduction vs FP16
Inference speed	Faster than FP16	Fastest, especially on low-resource devices
Model accuracy	Minimal loss	Some quality degradation possible
Use case	Production systems, stable performance	Edge devices, experiments, very large models
Compatibility	Widely supported on modern GPUs	Limited support; may need specialized kernels
Trade-off	Balanced between speed, memory, and accuracy	Maximizes efficiency at the cost of some precision

Model service strategies that maximize throughput

In today's query processing systems, maximizing throughput is a critical performance factor. Bandwidth determines how many requests a system can process per unit of time, and without the right servicing strategies in place, it can quickly become a bottleneck.

Service strategies

Strategy	How it increases throughput
KV-cache optimization	Reuses computed token representations to avoid redundant calculations during decoding
Model quantization (INT8/ INT4)	Reduces memory usage and speeds up computation while maintaining acceptable accuracy
Batch processing & parallelism	Combines multiple requests for simultaneous processing and leverages multi-GPU/threads
FlashAttention	Computes attention efficiently in blocks, reducing memory footprint and speeding up calculations
Dynamic memory management & offloading	Moves data between GPU and CPU as needed, prevents OOM and maintains performance
Load monitoring & balancing	Prioritizes long or “hot” requests and distributes load across GPUs/nodes
System-level caching & precomputation	Stores frequent query results for instant retrieval, avoiding repeated computation

Implementing these strategies allows systems to use resources efficiently, reduce processing latency, and ensure stable operation under high load. This is important for deploying LLM in a production environment or services with a large number of concurrent users.

FAQ

What is the difference between prefill and decode phases?

Prefilling forms a context by processing all incoming tokens at once, and decoding generates a response one token at a time using that context.

How to calculate the KV cache size for a model?

KV-cache size is calculated as 2 × number of layers × number of heads × head dimension × number of tokens × bytes per value (for keys and values).

What techniques help handle long contexts without OOM?

Using FlashAttention, quantization, dynamic memory offloading, and context length limits helps handle long contexts without OOM.

What is page attention and why use it?

Paged attention is a method of calculating attention in blocks to reduce memory usage for long sequences without losing accuracy.

How does the implementation of attention affect decoding throughput?

Implementing optimized attention, such as FlashAttention, increases decoding throughput, reducing memory usage and speeding up computation per token.

When should you use FlashAttention?

FlashAttention should be used when long contexts or large batches make standard attention slow or memory expensive.

How to choose between 8-bit and 4-bit quantization?

The choice between 8-bit and 4-bit quantization depends on the trade-off between model accuracy, memory savings, and inference speed.