Demystifying GPU Sizing for LLMs
The Context
Large Open Source (OSS) language models are springing up everywhere – Mistrals, Llama, surprise releases like Kimi K2 and, this month, OpenAI OSS models. Each of them promises open‑source autonomy and fine‑tuned intelligence, but there’s a very practical question lurking in the background: how much hardware does it take to run them? The answer isn’t just about having “some GPUs”; it hinges on understanding the interplay between model size, memory bandwidth and inference throughput. One of my hobby projects is a GPU infrastructure estimator that addresses exactly this question. Initially built as an unwieldy spreadsheet, it has evolved into a simple web tool designed to answer a surprisingly common question: “Can I run the latest open‑source models on‑prem for tens or hundreds of thousands of euros?”.
What the tool does
At its core, the estimator calculates three things: memory requirements, compute throughput, and latency. To make the experience approachable, it includes a small library of popular models like Mistral 7B, Llama 2 7B, and even the experimental Kimi K2 and OpenAI OSS recently released models. Each model entry stores basic facts – number of parameters, “active” parameters (for mixture‑of‑experts models), precision in bytes per parameter, hidden dimension and number of layers[2]. We can also freely enter your own LLM config. On the hardware side, the tool ships with a catalogue of GPUs ranging from the modest A10 to H100 NVL cards, complete with their compute performance, VRAM size, memory bandwidth and approximate cost. Users can tweak all of these values or enter their own hardware specifications.
Since it’s a hobby app, usual disclaimers apply and it’s important to study the LLM provider recommendation as a first insight for deployment decision.
Why sizing is non‑trivial
The memory footprint of a model is not just its weights. During inference a transformer maintains a key‑value (KV) cache for each token in the context. Basically, the whole LLM is loaded in memory to speed up access. The tool follows a straightforward formula: the weight memory in gibibytes (vram_model_gib) is calculated as [P × 10^9 × B / 1024^3] , where P is the total parameters (in billions) and B is the number of bytes per parameter. The KV cache per token (kv_cache_per_token_gib) is computed as 2 * B * H * L / 1024^3. The factor 2 represents the K and V matrices; H is the hidden dimension and L is the number of layers. Total KV cache footprint is this per‑token value multiplied by the batch size (queries per second) and the sequence length (input tokens + output tokens). Finally, a small fragmentation factor (≈5 %) is applied to get the total VRAM requirement.
Compute sizing is equally important. For each GPU model the estimator uses either 8‑bit TOPS or 16‑bit TFLOPS depending on the chosen precision. It then calculates how many tokens per second a single GPU can generate based on the active parameter count: tokens_per_s_per_gpu_compute = floor((TF × 10¹²) / (Pa × 10⁹ × 2)). The tool compares the GPU count required to satisfy memory, throughput and latency constraints and recommends the maximum of these numbers. Multiply by the per‑GPU price to get an estimated capital expenditure. All the formulas are listed in Github.
To see how these formulas work in practice, consider Mistral 7B. It has ~7 billion parameters with FP16 precision (2 bytes per parameter), a hidden size of 4,096 and 32 layers. Suppose we target 150 queries per second, each with 250 input tokens and 150 generated tokens, and pick an NVIDIA H100 PCIe card (80 GB VRAM, ~989 TFLOPS, 2 TB/s bandwidth and a price around €27,300). Clearly, a single H100 is enough; the estimator will show the memory, compute and latency bottlenecks, along with the estimated cost. But you could also run this same model with less powerful hardware using quantization. It's basically reducing the number of bytes used to store the parameters. Thanks to this app, you can see the result by simply changing the precision field. Then, you can find quantized version on Hugging Face, Sloth or create your own.
Building this tool forced me to dig deeper into transformer internals. Before I wrote a single line of TypeScript, I spent days playing with formulas in an Excel spreadsheet and reading about KV caches, activation recomputation and bandwidth bottlenecks. I wanted to maintain a balance: capture enough detail to make the sizing credible, but not drown users in the ocean of possible configurations. Some simplifications were deliberate—for example, the app assumes 90 % of a GPU’s VRAM is usable for the KV cache and ignores routing delays. It also focuses solely on inference; it does not model redundancy, networking or containerisation overheads. These trade‑offs keep the interface clean and allow architects to make quick, high‑level decisions.
The first version of the estimator was a bit of a monster spreadsheet. It worked, but sharing it was cumbersome. Converting it into a web application using React, TypeScript, and Vite brought immediate benefits: pre‑defined models and GPUs live in a constants file, calculations run in a single function, and users can export their configuration to CSV. Most of the heavy lifting was done by Google Lab Studio code‑generation service, which scaffolded the UI in minutes. My main task afterwards was double checking formulas using my original Excel calculations (Yes, AI makes mistakes to size itself :)).
Takeaways: cost isn’t trivial, but it’s achievable
Experimenting with the estimator has been eye‑opening. Running cutting‑edge models offline like Mistral 24B or Magistral start around 30K while Kimi K2 with his huge 1TB go quickly near 500K€ according to the Kimi K2 creator’s number of GPU’s recommendation.
The surprise comes with the recent release of Open AI open source models which start close to 120K€ (1x H100) for their 120B parameters. It’s actually really cheap for a LLM of this size. The trick comes from the number of bytes needed for the parameters. Reducing this part might affect accuracy so the next few days should tell us if this openAI OSS model is up to current expectations.
Overall, yet the total price tag, while not pocket change, is far from impossible for many companies.