GenAI App Architecture Explained (Part 3: The Hardware stack)
In Part 1 and Part 2 of this series, we explored the high-level components of a modern GenAI application, from the user interface to the RAG pipeline. Now, let's get down to the metal. What type of resources do we need to make it all run?
What Resources Do We Need?
Compute
Memory (vRAM)
Storage
- Qwen2-7B-Instruct: ~15GB
- Kimi K2 1T: 1TB of storage space!!!
How Do We Access These Resources?
Scalability and Resiliency
While running your entire app in a small K8s cluster with one GPU node is fine for testing, you will certainly want more power for scalability and resiliency in your production environment.
When a model is too large for a single GPU, or you need to handle higher query volumes, you must distribute the load. This is a complex topic, but the primary strategies fall into two categories: Data Parallelism and Model Parallelism.
1. Data Parallelism (Scaling for Throughput)
This is the most common and simplest way to scale.
How it works: You replicate the entire model onto several different GPUs. A load balancer then distributes incoming user requests (queries) across these different copies.
Analogy: Think of it like opening multiple checkout lanes in a supermarket. Each lane (GPU) has a full cash register (the complete model) and processes customers (data batches) independently.
Best for: Increasing the Queries Per Second (QPS) for a model that already fits on a single GPU.
2. Model Parallelism (Scaling for Size)
This is what you use when the model itself is too big to fit into a single GPU's memory. You split the model itself across multiple GPUs. There are two main ways to do this:
Tensor Parallelism (Intra-Layer):
How it works: This method splits a single large operation (like a massive weight matrix in a Transformer layer) across multiple GPUs. All GPUs work on the same operation at the same time.
Analogy: This is like two cashiers working on the same customer's shopping cart simultaneously to scan items faster.
Requirement: This requires an extremely fast, low-latency connection between the GPUs (like NVLink) to be efficient.
Pipeline Parallelism (Inter-Layer):
How it works: This method splits the model's layers into stages and puts each stage on a different GPU. GPU 1 handles layers 1-10, GPU 2 handles layers 11-20, and so on.
Analogy: This is like an assembly line. One worker (GPU 1) assembles the base, then passes it to the next worker (GPU 2) to add the next part, who then passes it to GPU 3.
Challenge: This can create "bubbles" where GPU 2 is idle, waiting for GPU 1 to finish. This is managed with a technique called micro-batching.
In practice, most large-scale serving systems (like vLLM or Hugging Face's Text Generation Inference) use a hybrid approach. For example, they might use Tensor Parallelism to run one very large model within a multi-GPU node, and then use Data Parallelism to replicate that entire multi-GPU node to handle more user requests.
What’s next?
We have seen in this article that K8s is a great platform to run your LLM-powered app. In the next chapter, we will go hands-on to deploy a full K8s stack with a GPU node and most of the components we mentioned. It can be seen as a LLM app playground as a service on Kubernetes, which is shortened as LLASTAK. We’ll dive into LLastaks in our next chapter.