Selecting the Right GPU for Qwen3 Inference

This overview extends the calculations from A Practical Guide to LLM Inference Math and applies them to concrete hardware + model pairings. The goal is to make it obvious when the NVIDIA RTX PRO 6000, H200, or the upcoming DGX Station delivers the best efficiency for Qwen3-class workloads ranging from 4B to 32B active parameters.

Important: Every number in this playbook is a theoretical ceiling derived from vendor specs and simplified roofline math. Real deployments often land lower because kernels are imperfect, host↔device pipelines add friction, and GPUs rarely sustain 100% efficiency across an entire decode pass.

Recap: the 60-second ops:byte checklist

Compute the GPU's ops:byte ratio (peak FLOPS ÷ memory bandwidth).
Compute the model's arithmetic intensity (≈ d_head ÷ 2 for attention-heavy steps).
If model_intensity < gpu_intensity, the workload is memory-bound; focus on bandwidth and VRAM.
Estimate time per output token with model_size_bytes ÷ memory_bandwidth_bytes_per_second to derive a theoretical tokens/s ceiling for batch size 1.

GPU capability snapshots

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

96 GB GDDR7 ECC VRAM fed by 1,792 GB/s bandwidth and ~125 TFLOPS of FP32 compute, which yields an ops:byte ratio near 70.
Peak board power of 600 W enables deskside deployments where noise + thermals matter but rack power is limited.
Ideal when you need local fine-tuning or multi-modal prototyping (vision, audio) with up to ~64 GB of weights plus a useful KV cache budget.¹

NVIDIA H200 Tensor Core GPU (SXM + NVL)

First Hopper-based accelerator with 141 GB of HBM3e and 4.8 TB/s of memory bandwidth; BF16 Tensor performance reaches 1.979 PFLOPS, so ops:byte rockets past 400.
Ships with hardware MIG slicing (7 instances) and optional NVL configurations for air-cooled racks.
Best suited for 14B+ dense models or MoE deployments where both capacity and streaming bandwidth dominate cost.²

NVIDIA DGX Station (Grace Blackwell Ultra)

Desktop supercomputer that combines one Blackwell-Ultra GPU (up to 288 GB HBM3e @ 8 TB/s) with a 72-core Grace CPU and 496 GB of LPDDR5X in a coherent 784 GB memory pool.
NVLink-C2C delivers 900 GB/s between CPU and GPU, so large retrieval datasets can stay resident without PCIe penalties.
Shares the same Blackwell Ultra silicon used in HGX B300 servers, which NVIDIA rates at 36 PFLOPS BF16 / 144 PFLOPS FP4 for the 8-GPU baseboard—roughly 4.5 PFLOPS BF16 per GPU—giving us concrete compute ceilings for DGX Station workloads.³⁴
Targets multi-user labs that need on-prem autonomy for iterative training, MoE routing experiments, and agent stacks before shipping them to a cluster.

Qwen3 model footprints (batch size 1, BF16)

Small, well-scoped translation chunks keep the translation pipeline happy. Each model reuses the KV-cache formula from the inference math guide (2 * layers * hidden_size * 2 bytes).

Model	Active params	Hidden size / layers	Weights (GB)	KV cache per token (MB)	Notes
Qwen3-4B-Instruct	~4B	2,560 / 36	~8	0.35	Sliding-window ready; great for CPU offload experiments.⁵
Qwen3-VL-8B-Instruct	~8B	4,096 / 36	~16	0.56	Vision-language encoder adds ~1152-dim vision tower.⁶
Qwen3-14B	~14B	5,120 / 40	~28	0.78	40-layer stack with 1M rope theta for 40k context.⁷
Qwen3-32B	~32B	5,120 / 64	~64	1.25	64 decoder layers; same d_head so arithmetic intensity stays ≈64 ops/byte.⁸

Matching scenarios

1. Workstation prototyping (RTX PRO 6000)

Recommended models: Qwen3-4B, Qwen3-VL-8B.
Why: Both weights (8–16 GB) plus KV cache for 4K tokens still leave >70 GB VRAM for batching, LoRA adapters, or vision embeddings.
Throughput: Theoretical tokens/s = 1.792 TB/s ÷ weights. Expect ~220 tok/s (4B) or ~110 tok/s (8B) before compute saturation, so latency is dominated by memory fetch, not tensor ops.
Tip: Stay compute-balanced by pushing batch size to 4 whenever latency budget allows; OBS or Whisper sidecars barely dent VRAM.

2. Enterprise copilots (H200 SXM)

Recommended models: Qwen3-14B dense, Qwen3-32B MoE routing with batch 1–2.
Why: 141 GB HBM3e accommodates the 64 GB model plus >70 GB for long-context KV caches. The ~412 ops:byte ratio means arithmetic intensity (64) keeps you memory-bound, so the 4.8 TB/s feed confers ~170 tok/s on 14B and ~75 tok/s on 32B without tensor parallelism.
Tip: Split attention + feed-forward layers across MIG instances when serving multiple tenants; each MIG slice still gets ≥18 GB.

3. Lab-scale supercomputer (DGX Station)

Recommended models: Any Qwen3 variant plus stacked tools (RAG, VLM agents) thanks to the 784 GB coherent pool.
Why: 288 GB of on-package HBM3e means you can pin two dense models or a dense+MoE pair simultaneously while the Grace CPU handles data preprocessing at 396 GB/s. NVLink-C2C eliminates PCIe resharding when streaming documents from RAM into KV caches.
Throughput: The HGX B300 spec (36 PFLOPS BF16 across eight GPUs, ≈4.5 PFLOPS per GPU) paired with the 8 TB/s bandwidth keeps the roofline memory-bound, so plan for ~125 tok/s on a 32B dense model and scale linearly with batch size until compute saturation approaches 4.5 PFLOPS.⁴
Tip: Use MIG (7 slices) to dedicate small partitions to telemetry or guardrail models without interrupting the main VLM job.

Quick pairing matrix

Scenario	Model	GPU	Est. tokens/s (batch 1)	Primary bottleneck	Notes
Edge copilots	Qwen3-4B	RTX PRO 6000	~220	Memory BW	Plenty of VRAM left for RAG embeddings.
Vision agent demos	Qwen3-VL-8B	RTX PRO 6000	~110	Memory BW	Vision tower benefits from 96 GB VRAM for image batches.
Customer support copilots	Qwen3-14B	H200	~170	Memory BW	MIG lets you mirror-prod topology in dev.
Technical assistant / codegen	Qwen3-32B	H200	~75	Memory BW	Requires tensor parallel if batching >2.
Multi-agent sandbox	Qwen3-32B + tools	DGX Station	~125	Memory BW	784 GB pool hosts RAG corpora in-memory.

Interactive calculator (beta)

Use the planner below to stress-test context windows, batch sizes, and precision assumptions against each GPU profile.

Note: Tokens per second plateau once the workload hits the compute roofline—the calculator now compares both limits and reports the stricter one.

Interactive Estimator

LLM inference planner beta

Adjust model, GPU, context, and precision settings to estimate VRAM usage, roofline balance, and theoretical tokens per second for batch 1 workloads.

Model presetGPU profilePrecision

Prompt tokensResponse tokensBatch size

Capacity check

Model weights: 8 GB
KV cache per token: 0.37 MB
KV budget (context × batch): 0.57 GB
Total VRAM needed: 8.57 GB
Headroom on GPU: 87.43 GB

Roofline alignment

GPU ops:byte70.31 ops/byte
Base intensity (d_head/2)64 ops/byte
Effective intensity (× batch)64 ops/byte
Gap6.31 ops/byte

Memory-bound: scaling batch size raises effective intensity because more tokens share the same weight fetch per decode step.

Latency snapshot

Tokens per second: 224 tok/s
Throughput limit: Memory
Prefill latency: 4,571.43 ms

Tokens/s picks the lower of the memory-bound ceiling (`bandwidth ÷ weights × batch`) and the compute ceiling (`FLOPS ÷ ops/token`). Prefill latency uses the single-sequence value because prompts stream tokens sequentially.

NVIDIA RTX PRO 6000 Blackwell Workstation Edition specifications, NVIDIA. ↩
NVIDIA H200 Tensor Core GPU specifications, NVIDIA. ↩
NVIDIA DGX Station (Grace Blackwell Ultra) specifications, NVIDIA. ↩
NVIDIA HGX Platform and Blackwell Ultra specifications (HGX B300), NVIDIA. ↩ ↩²
Qwen/Qwen3-4B-Instruct-2507 model card, Hugging Face. ↩
Qwen/Qwen3-VL-8B-Instruct model card, Hugging Face. ↩
Qwen/Qwen3-14B model card, Hugging Face. ↩
Qwen/Qwen3-32B model card, Hugging Face. ↩