GenAI App Architecture Explained (Part 3: The Hardware stack)

 In Part 1 and Part 2 of this series, we explored the high-level components of a modern GenAI application, from the user interface to the RAG pipeline. Now, let's get down to the metal. What type of resources do we need to make it all run?


What Resources Do We Need?

Like any application, LLMs rely on fundamental hardware resources to function: compute, memory, storage, and network. And sometimes, they need a lot of them.

Compute

LLM architectures, particularly Transformers, are heavy consumers of compute and memory. This is inherent to their design, which must process massive amounts of calculations to predict subsequent tokens.
Think about it: each new token generated is the result of a complex calculation. All tokens from the prompt are turned into mathematical representations (vectors) to understand the intricate relationships between them. Based on this analysis, the model performs billions of calculations to predict the most likely next token. This massive computational exercise is the barrier to entry in the AI race.
Hardware manufacturers like Nvidia provide the extremely powerful GPUs (Graphics Processing Units) needed to perform these calculations in parallel. On the software side, LLM creators are also innovating with architectures like Mixture of Experts (MoE), which aims for more compute-efficient scaling by only activating a subset of the model's parameters for any given token.
Ultimately, the amount of resources needed is directly connected to the use case. Some small LLMs can run on CPUs for simple tasks, but if you're looking to run an application with reasoning and a fair level of intelligence, you will quickly need powerful GPUs.

Memory (vRAM)

Compute is not the only bottleneck; memory is another critical factor. A common mistake is to only consider the memory needed to store the model's weights. For example, a 24-billion parameter model quantized to 8-bits (INT8) will require a minimum of 24GiB of vRAM (Video RAM on the GPU) just to load the weights.
However, the mechanism called "self-attention" in the Transformer architecture is also a massive consumer of memory to store the state of the sequence. This is called the KV cache.
The size of this cache is not tied to the model's weight size, but rather to the context length (sequence length) and the batch size. A long context window (e.g., 200,000 tokens) with a high batch size can consume tens or even hundreds of GiB of vRAM in addition to the memory needed for the model weights.
To help understand the memory and compute needs, I have created an application where you can enter the LLM name, modify key sizing elements, and set expectations for input tokens, output tokens, and queries per second (QPS). It will then estimate which GPU is needed, how many, the vRAM requirements, and the approximate cost.
The goal is to give a "ballpark" estimation of the hardware and cost needed to run an LLM app.
You can find this GPU infrastructure estimator for LLMs here: https://gpu-infrastructure-estimator-for-llms-542321080600.us-west1.run.app/

Storage

LLMs have significant storage needs. Again, this is related to the size of the LLM. A 70B parameter model (like Llama 3 70B) can take ~140GB of disk space. A hypothetical 1-trillion parameter model could require over 1TB of storage for its weights alone.
It’s also worth noting that the CUDA drivers and other tooling take a good amount of space. As an example, the base vLLM container available on Docker Hub (https://hub.docker.com/r/vllm/vllm-openai/tags) uses 13GB before you even add any LLM weights!
Here’s what two examples might require on top of those 13GB:
  • Qwen2-7B-Instruct: ~15GB
  • Kimi K2 1T: 1TB of storage space!!!

How Do We Access These Resources?

There are multiple ways to run an LLM. You can do it locally with solutions like Ollama, you can run LLMs in virtual machines, but most likely, a production application will run as a container in Kubernetes (K8s).

An LLM is essentially a set of weights that are loaded and served by a runtime suchas vLLM. All of this can be wrapped in a Docker container that needs to access the hardware we mentioned.

Given the size of the LLM weights, you will usually separate the serving runtime storage from the storage hosting the weights. You’ll often end up with a container connected to two storage volumes (called Persistent Volumes in K8s): one for the application runtime and one for the model weights.

Then you’ll need a K8s node (a "worker") to host this pod, and that node must provide access to the GPU.

Be careful on this part, because the GPU model will have a strong impact on your bill. If you use a hyperscaler (like GCP, AWS, or Azure), you will have to ask for a "service quota" increase to be able to deploy nodes with GPUs. Hyperscalers have created these quotas to protect you from accidental, massive bills. It’s a good thing! A single GPU can cost from a few hundred euros per month (like an Nvidia A10G) to nearly 29,000€/month for a high-end H100-powered instance!

So, K8s is the de-facto platform to run LLMs. You can also use this same platform to run the other parts of the application we mentioned in the previous articles. Of course, you can also consume abstracted services (like RAG-as-a-Service or LLM APIs), but getting your hands dirty with all the components running in K8s is a great school to understand how it all works.

Scalability and Resiliency

While running your entire app in a small K8s cluster with one GPU node is fine for testing, you will certainly want more power for scalability and resiliency in your production environment.

When a model is too large for a single GPU, or you need to handle higher query volumes, you must distribute the load. This is a complex topic, but the primary strategies fall into two categories: Data Parallelism and Model Parallelism.

1. Data Parallelism (Scaling for Throughput)

This is the most common and simplest way to scale.

  • How it works: You replicate the entire model onto several different GPUs. A load balancer then distributes incoming user requests (queries) across these different copies.

  • Analogy: Think of it like opening multiple checkout lanes in a supermarket. Each lane (GPU) has a full cash register (the complete model) and processes customers (data batches) independently.

  • Best for: Increasing the Queries Per Second (QPS) for a model that already fits on a single GPU.

2. Model Parallelism (Scaling for Size)

This is what you use when the model itself is too big to fit into a single GPU's memory. You split the model itself across multiple GPUs. There are two main ways to do this:

  • Tensor Parallelism (Intra-Layer):

    • How it works: This method splits a single large operation (like a massive weight matrix in a Transformer layer) across multiple GPUs. All GPUs work on the same operation at the same time.

    • Analogy: This is like two cashiers working on the same customer's shopping cart simultaneously to scan items faster.

    • Requirement: This requires an extremely fast, low-latency connection between the GPUs (like NVLink) to be efficient.

  • Pipeline Parallelism (Inter-Layer):

    • How it works: This method splits the model's layers into stages and puts each stage on a different GPU. GPU 1 handles layers 1-10, GPU 2 handles layers 11-20, and so on.

    • Analogy: This is like an assembly line. One worker (GPU 1) assembles the base, then passes it to the next worker (GPU 2) to add the next part, who then passes it to GPU 3.

    • Challenge: This can create "bubbles" where GPU 2 is idle, waiting for GPU 1 to finish. This is managed with a technique called micro-batching.

In practice, most large-scale serving systems (like vLLM or Hugging Face's Text Generation Inference) use a hybrid approach. For example, they might use Tensor Parallelism to run one very large model within a multi-GPU node, and then use Data Parallelism to replicate that entire multi-GPU node to handle more user requests.

What’s next?

We have seen in this article that K8s is a great platform to run your LLM-powered app. In the next chapter, we will go hands-on to deploy a full K8s stack with a GPU node and most of the components we mentioned. It can be seen as a LLM app playground as a service on Kubernetes, which is shortened as LLASTAK. We’ll dive into LLastaks in our next chapter.

most viewed articles

From Chat to Action: The New Gen AI Revolution

GenAI App Architecture Explained (Part 1: The Big Picture)

How to Pick the Best LLM for Programming Cost vs Capability in 2025

GenAI App Architecture Explained (Part 2: Completing the Big Picture)

Can Natural Language Really Replace Code? The Revolution Is Already Underway

Internet Is Nasty? Really? How?

Demystifying GPU Sizing for LLMs