A Practical Guide to LLM Inference Math: From Theory to Hardware
Large Language Models (LLMs) have become foundational in modern AI, but running them efficiently requires a deep understanding of the interplay between model architecture and hardware capabilities. Simply choosing the most powerful GPU isn't always the most cost-effective solution. The key is to know whether your workload is compute-bound or memory-bound.
This guide walks through the essential math to profile an LLM for inference, helping you select the right hardware and optimize its performance. We will apply these principles to a real-world example: running the Qwen/Qwen3-VL-32B-Instruct model on the powerful NVIDIA RTX PRO 6000 Blackwell Edition workstation GPU.
This article is inspired by the mathematical approach detailed in the Baseten blog post, "A guide to LLM inference and performance."
Step 1: Understanding Your Hardware's Capabilities
The first step is to analyze the key specifications of our GPU. These numbers define the theoretical limits of our hardware. For the NVIDIA RTX PRO 6000 Blackwell Edition, the critical specs are:
- GPU Memory (VRAM): 96 GB GDDR7 ECC 1
 - GPU Memory Bandwidth: 1790 GB/s 1
 - FP16/BF16 Compute Performance: 126.0 TFLOPS 1
 
These three metrics—capacity, speed, and raw power—are the pillars of our analysis.
Step 2: Calculating the GPU's Operational Intensity (Ops:Byte Ratio)
A GPU's operational intensity, or ops:byte ratio, tells us how many computations it can perform for every byte of data it moves from VRAM. This is a crucial, hardware-specific ratio that reveals the balance between computation and memory access.
The formula is straightforward:
ops:byte Ratio = Compute Bandwidth (FLOPS) / Memory Bandwidth (Bytes/s)
Let's calculate it for our RTX PRO 6000:
- Compute: 126.0 TFLOPS = 126,000,000,000,000 FLOPS
 - Memory: 1790 GB/s = 1,790,000,000,000 Bytes/s
 
ops_to_byte_ratio = 126,000,000,000,000 / 1,790,000,000,000
                  = 70.39 ops/byte
This means for our hardware to be fully utilized, our application must perform approximately 70.39 floating-point operations for every single byte it fetches from VRAM.
- If our model performs fewer operations per byte, we are memory-bound.
 - If our model requires more operations per byte, we are compute-bound.
 
Step 3: Calculating the Model's Arithmetic Intensity
Next, we need to calculate the arithmetic intensity of our model. For Transformers, the most demanding part of inference is the attention mechanism.
We'll use the parameters for the Qwen/Qwen3-VL-32B-Instruct model 2:
- Sequence Length (N): 4096
 - Model Dimension (d_model): 5120
 - Number of Attention Heads (n_heads): 40
 - Dimension per Head (d_head): 128
 
Let's calculate the arithmetic intensity using the simplified roofline model approach:
Arithmetic Intensity = (4 * N^2 * d_head) / (8 * N^2)
                     = d_head / 2
                     = 128 / 2
                     = 64.0 ops/byte
Our model's arithmetic intensity is approximately 64.0 operations per byte.
Step 4: Identifying the Bottleneck
Now we compare the two ratios:
- GPU Ops:Byte Ratio: 70.39 ops/byte
 - Model Arithmetic Intensity: 64.0 ops/byte
 
Since 64.0 < 70.39, our workload is overwhelmingly memory-bound. This is common in LLM inference and means memory bandwidth is the primary limiting factor for inference speed.
Step 5: VRAM and Performance Estimation
VRAM for Model Weights
A 32-billion parameter model at half-precision (FP16) requires:
VRAM for Weights = 32 Billion Parameters * 2 Bytes/Parameter = 64 GB
VRAM for the KV Cache
The KV cache size per token is:
KV Cache per Token = 2 * num_hidden_layers * hidden_size * 2 = 1,638,400 Bytes/token
With our 96 GB GPU, after loading the 64 GB model, we have:
Spare VRAM = 96 GB - 64 GB = 32 GB
This allows for a theoretical batch size of:
Batch Size = Spare VRAM / (KV Cache per Token * Sequence Length) ≈ 4 sequences
Estimating Performance
Time Per Output Token (Decoding Latency):
Time/Token = Model Size (Bytes) / Memory Bandwidth (Bytes/s) = 35.75 ms/token
This translates to a theoretical throughput of ~28 tokens/second.
Time to First Token (Prefill Latency): For a prompt of 512 tokens:
Prefill Time = (Prompt Tokens * Model Ops) / GPU Compute Power = 260.06 ms
Conclusion
Our analysis shows:
- Inference is Memory-Bound: The primary bottleneck is the 1790 GB/s memory bandwidth, not the 126.0 TFLOPS of compute.
 - VRAM for Batching: The 96 GB of VRAM is a significant advantage, allowing a theoretical batch size of 4 to better utilize the GPU's compute.
 - Performance Expectations: We can expect a theoretical throughput of around 28 tokens/second and a prefill time of approximately 260.06 ms for a 512-token prompt.
 
These calculations provide a solid foundation for understanding LLM inference performance and making informed hardware decisions.
References
Footnotes
- 
NVIDIA RTX PRO 6000 Blackwell Workstation Edition specifications, sourced from primeLine Solutions - PNY NVIDIA RTX PRO 6000 Blackwell ↩ ↩2 ↩3
 - 
Qwen/Qwen3-VL-32B-Instruct model architecture parameters, sourced from Hugging Face Model Card. ↩