GenAI App Architecture Explained (Part 4: LLASTAKS — A Full LLM App Playground on Kubernetes)

Méoc Vincent November 11, 2025

A practical guide to spin up a complete, test-friendly GenAI stack (LLM + RAG + app + observability) on AWS EKS.

TL;DR

LLASTAKS (LLM App STAck on Kubernetes as a Service) is a reproducible, low-touch blueprint to deploy an end-to-end GenAI playground—not just a model endpoint. In ~one command you provision AWS EKS with a GPU node, run vLLM behind an OpenAI-compatible API, add a chatbot frontend, a RAG pipeline (ingestion → FAISS search → RAG chatbot), and wire observability (metrics/logs/traces). Use it to prototype features (e.g., function calling, LoRA fine-tuning) and to learn real-world constraints before building production systems.

Overview of LLASTAKS

Why this project?

Most “playgrounds” stop at the LLM API. Real applications need more: storage for weights, a serving layer, an app surface, retrieval, orchestration, and instrumentation. LLASTAKS gives you the whole app loop so you can:

Experiment quickly with local + cluster deployments
Validate architectural choices (EKS + GPU, EBS, internal DNS, service wiring)
Benchmark UX & quality with RAG
Measure reality with Prometheus/Grafana (and optional Loki/Tempo)
Iterate safely (IaC + scripts) and clean up when you’re done
Work safely: you connect to the app through port forward, no direct public Internet access.

I used a Qwen3:8B (quantized on 4bit) which is intelligent enough for various scenarios and, at the same time, lite enough to reduce the GPU need and the price. Feel free to use your own model and change the EC2 size to modify the price. With Qwen3-8b-INT4, the price is roughly 1€/H for the whole platform.

AWS provide new accounts with 100$ credits so it gives you around 80H of playground for free per new accounts.

What you get (capabilities)

EKS cluster with a dedicated GPU node for inference (e.g., g5.xlarge) and a CPU node for app/RAG jobs
vLLM serving an OpenAI-compatible API at http://vllm.llasta.cluster.local:8000
Chatbot (FastAPI backend + minimal HTML/JS) for quick, realistic testing
RAG pipeline: data ingestion → FAISS wrapper service → RAG chatbot with tunables (top-k, context size)
Observability: Prometheus metrics out of the box; optional Grafana Cloud dashboards; tracing with OpenTelemetry (Tempo)
Automation: end-to-end Deploy.sh (Terraform + K8s manifests) and clean structure per component

The repository folders provide a step by step structure: 000-K8 deployment/, 001-Copy weights to ebs/, 002-vLLM deployment/, 003-chatbot/, 004-RAG/, 005-observability/. Each stage contain a Stage_readme.md.

Prerequisites

AWS account with GPU quota for G-instances
Local tools: awscli, kubectl, terraform, docker, python3
Basic familiarity with Kubernetes contexts and port-forwarding

Cost note: GPU instances are not free; spin up only when needed and destroy when done.

Quickstart — from zero to working LLM

Clone the repository, follow the Readme.md instruction.

This will:

Provision EKS with the right node groups (GPU/CPU)
Prepare EBS + PVCs for model weights and FAISS index
Deploy vLLM and expose vllm.llasta.cluster.local:8000

Verify the cluster and service

kubectl get nodes
kubectl -n llasta get pods,svc

Test vLLM

kubectl -n llasta port-forward svc/vllm 8000:8000
curl -s http://localhost:8000/v1/models | jq
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b-…",
    "messages": [{"role":"user","content":"Give me one design goal of LLASTAKS."}]
  }' | jq

Run the Chatbot

In-cluster: kubectl apply -f 003-chatbot/chatbot.yaml then port-forward svc/chatbot to http://localhost:8080/
Local dev: install deps from 003-chatbot/backend/requirements.txt and run FastAPI

You'll find all the information the Stage_Readme.md of the different parts 000-K8 deployment/, 001-Copy weights to ebs/, 002-vLLM deployment/, 003-chatbot/

Stage RAG — Objective & Scope

Build a simple, reliable, low-cost RAG V1 on EKS with vLLM (Qwen3-8B INT4) for generation, FAISS for retrieval, and a reranker for quality. The goal is to understand end-to-end behavior—from ingest to retrieval—its limitations, and how the LLM’s reasoning complements retrieval.

What to do

Deploy faiss-wrap + chatbot-RAG.
Run ingest.py twice on the sample sets (Hard-to-read vs Easy-to-read) and reflect on outcomes.
Ingest Easy-to-read into FAISS and test the RAG; compare with a large hosted LLM (e.g., GPT-5) to understand differences.
Reset FAISS and ingest Build your own electric car; reflect again.

Follow 0004-RAG/Stage_Readme.md to set up this part.

005-Observability — simplified preview

Observability is a core pillar of LLASTAKS, but we’ll dive deep into it in an upcoming article. This section just gives a quick preview:

Metrics: Prometheus collects system and app metrics (requests, latency, FAISS search time, GPU usage). You can plug it into Grafana Cloud later.
Logs: Loki can centralize logs from all pods for debugging.
Traces: Tempo and OpenTelemetry trace requests from chatbot → FAISS → vLLM.

The full setup, dashboards, and alerting best practices will be detailed in Part 5 — Observability and Debugging GenAI Apps.

Lessons learned

LLM are hungry GPU beast and Quantization matters: Using an INT4-quantized Qwen3-8B made the model fit GPU memory and start reliably on smaller accelerators—while keeping reasonable quality. KV cache can still dominate; adjust --max-model-len and utilization.
RAG ≠ math: Vector search over PDFs with many numbers underperforms for analytics questions. For structured data (transactions), prefer tool-use / function calling to a SQL/Parquet store (PAL-style workflows) and let the LLM plan/format queries.
Chunking quality drives retrieval: Page-sized chunks are too coarse. Favor semantic chunking (by section/paragraph), enrich with metadata (chapter/section/type), and track sources precisely.
IaC keeps you sane: Terraform is a great tool to quickly spin up and clean up this platform.

The current stack

Model server: vLLM (OpenAI-compatible API)
Base model: Qwen3-8B (INT4 variant for memory efficiency)
Embedding: BGE-M3 (CPU OK)
Vector DB: FAISS (local index on PVC/EBS via wrapper service)
Apps: Chatbot : act as the orchestrator
Observability: Prometheus, Grafana Cloud, Loki, Tempo

Conclusion & Next Steps

LLASTAKS isn’t meant to be a polished production system—it’s a living playground. Its purpose is to let you experience how a real GenAI app behaves: from the model serving layer to retrieval and orchestration, all the way to the metrics that reveal where the bottlenecks lie. By deploying it yourself, you start to understand how LLMs, retrieval systems, and observability tools fit together in a practical, reproducible stack.

Running it on EKS gives you full control: every network rule, storage choice, and node behavior is visible and adjustable. You can swap any component—vLLM for another model server, FAISS for another vector DB, or Prometheus for another observability tool—and see instantly how your stack evolves. And if you don’t have GPUs handy, smaller or quantized models can make experimentation affordable, while local tools like Ollama keep things simple.

As you move forward, remember that LLASTAKS is just the foundation. Part 5 of this series will dive deeper into observability—showing how to trace requests, analyze performance, and optimize the LLM–RAG loop in real time. Until then, treat this as your sandbox: break it, fix it, and learn from it.

When you’re done exploring, destroy your EKS cluster to avoid unnecessary GPU costs, and keep snapshots only if you plan to reuse weights. And finally, if this project sparks ideas or helps you in your journey, ⭐ star the repository and share your feedback—every insight helps improve the stack for the community.

Search This Blog

A journey in the Gen AI world