GenAI App Architecture Explained (Part 4: LLASTAKS — A Full LLM App Playground on Kubernetes)
A practical guide to spin up a complete, test-friendly GenAI stack (LLM + RAG + app + observability) on AWS EKS.
TL;DR
LLASTAKS (LLM App STAck on Kubernetes as a Service) is a reproducible, low-touch blueprint to deploy an end-to-end GenAI playground—not just a model endpoint. In ~one command you provision AWS EKS with a GPU node, run vLLM behind an OpenAI-compatible API, add a chatbot frontend, a RAG pipeline (ingestion → FAISS search → RAG chatbot), and wire observability (metrics/logs/traces). Use it to prototype features (e.g., function calling, LoRA fine-tuning) and to learn real-world constraints before building production systems.
| Overview of LLASTAKS |
Why this project?
Most “playgrounds” stop at the LLM API. Real applications need more: storage for weights, a serving layer, an app surface, retrieval, orchestration, and instrumentation. LLASTAKS gives you the whole app loop so you can:
- Experiment quickly with local + cluster deployments
- Validate architectural choices (EKS + GPU, EBS, internal DNS, service wiring)
- Benchmark UX & quality with RAG
- Measure reality with Prometheus/Grafana (and optional Loki/Tempo)
- Iterate safely (IaC + scripts) and clean up when you’re done
- Work safely: you connect to the app through port forward, no direct public Internet access.
I used a Qwen3:8B (quantized on 4bit) which is intelligent enough for various scenarios and, at the same time, lite enough to reduce the GPU need and the price. Feel free to use your own model and change the EC2 size to modify the price. With Qwen3-8b-INT4, the price is roughly 1€/H for the whole platform.
AWS provide new accounts with 100$ credits so it gives you around 80H of playground for free per new accounts.
What you get (capabilities)
- EKS cluster with a dedicated GPU node for inference (e.g., g5.xlarge) and a CPU node for app/RAG jobs
- vLLM serving an OpenAI-compatible API at
http://vllm.llasta.cluster.local:8000 - Chatbot (FastAPI backend + minimal HTML/JS) for quick, realistic testing
- RAG pipeline: data ingestion → FAISS wrapper service → RAG chatbot with tunables (top-k, context size)
- Observability: Prometheus metrics out of the box; optional Grafana Cloud dashboards; tracing with OpenTelemetry (Tempo)
- Automation: end-to-end Deploy.sh (Terraform + K8s manifests) and clean structure per component
The repository folders provide a step by step structure: 000-K8 deployment/, 001-Copy weights to ebs/, 002-vLLM deployment/, 003-chatbot/, 004-RAG/, 005-observability/. Each stage contain a Stage_readme.md.
Prerequisites
- AWS account with GPU quota for G-instances
- Local tools:
awscli,kubectl,terraform,docker,python3 - Basic familiarity with Kubernetes contexts and port-forwarding
Cost note: GPU instances are not free; spin up only when needed and destroy when done.
Quickstart — from zero to working LLM
- Clone the repository, follow the Readme.md instruction.
This will:
- Provision EKS with the right node groups (GPU/CPU)
- Prepare EBS + PVCs for model weights and FAISS index
- Deploy vLLM and expose
vllm.llasta.cluster.local:8000
- Verify the cluster and service
kubectl get nodes
kubectl -n llasta get pods,svc
- Test vLLM
kubectl -n llasta port-forward svc/vllm 8000:8000
curl -s http://localhost:8000/v1/models | jq
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-…",
"messages": [{"role":"user","content":"Give me one design goal of LLASTAKS."}]
}' | jq
- Run the Chatbot
- In-cluster:
kubectl apply -f 003-chatbot/chatbot.yamlthen port-forwardsvc/chatbottohttp://localhost:8080/ - Local dev: install deps from
003-chatbot/backend/requirements.txtand run FastAPI
You'll find all the information the Stage_Readme.md of the different parts 000-K8 deployment/, 001-Copy weights to ebs/, 002-vLLM deployment/, 003-chatbot/
Stage RAG — Objective & Scope
Build a simple, reliable, low-cost RAG V1 on EKS with vLLM (Qwen3-8B INT4) for generation, FAISS for retrieval, and a reranker for quality. The goal is to understand end-to-end behavior—from ingest to retrieval—its limitations, and how the LLM’s reasoning complements retrieval.
What to do
- Deploy faiss-wrap + chatbot-RAG.
- Run ingest.py twice on the sample sets (Hard-to-read vs Easy-to-read) and reflect on outcomes.
- Ingest Easy-to-read into FAISS and test the RAG; compare with a large hosted LLM (e.g., GPT-5) to understand differences.
- Reset FAISS and ingest Build your own electric car; reflect again.
Follow 0004-RAG/Stage_Readme.md to set up this part.
005-Observability — simplified preview
Observability is a core pillar of LLASTAKS, but we’ll dive deep into it in an upcoming article. This section just gives a quick preview:
- Metrics: Prometheus collects system and app metrics (requests, latency, FAISS search time, GPU usage). You can plug it into Grafana Cloud later.
- Logs: Loki can centralize logs from all pods for debugging.
- Traces: Tempo and OpenTelemetry trace requests from chatbot → FAISS → vLLM.
The full setup, dashboards, and alerting best practices will be detailed in Part 5 — Observability and Debugging GenAI Apps.
Lessons learned
- LLM are hungry GPU beast and Quantization matters: Using an INT4-quantized Qwen3-8B made the model fit GPU memory and start reliably on smaller accelerators—while keeping reasonable quality. KV cache can still dominate; adjust
--max-model-lenand utilization. - RAG ≠ math: Vector search over PDFs with many numbers underperforms for analytics questions. For structured data (transactions), prefer tool-use / function calling to a SQL/Parquet store (PAL-style workflows) and let the LLM plan/format queries.
- Chunking quality drives retrieval: Page-sized chunks are too coarse. Favor semantic chunking (by section/paragraph), enrich with metadata (chapter/section/type), and track sources precisely.
- IaC keeps you sane: Terraform is a great tool to quickly spin up and clean up this platform.
The current stack
- Model server: vLLM (OpenAI-compatible API)
- Base model: Qwen3-8B (INT4 variant for memory efficiency)
- Embedding: BGE-M3 (CPU OK)
- Vector DB: FAISS (local index on PVC/EBS via wrapper service)
- Apps: Chatbot : act as the orchestrator
- Observability: Prometheus, Grafana Cloud, Loki, Tempo
Conclusion & Next Steps
LLASTAKS isn’t meant to be a polished production system—it’s a living playground. Its purpose is to let you experience how a real GenAI app behaves: from the model serving layer to retrieval and orchestration, all the way to the metrics that reveal where the bottlenecks lie. By deploying it yourself, you start to understand how LLMs, retrieval systems, and observability tools fit together in a practical, reproducible stack.
Running it on EKS gives you full control: every network rule, storage choice, and node behavior is visible and adjustable. You can swap any component—vLLM for another model server, FAISS for another vector DB, or Prometheus for another observability tool—and see instantly how your stack evolves. And if you don’t have GPUs handy, smaller or quantized models can make experimentation affordable, while local tools like Ollama keep things simple.
As you move forward, remember that LLASTAKS is just the foundation. Part 5 of this series will dive deeper into observability—showing how to trace requests, analyze performance, and optimize the LLM–RAG loop in real time. Until then, treat this as your sandbox: break it, fix it, and learn from it.
When you’re done exploring, destroy your EKS cluster to avoid unnecessary GPU costs, and keep snapshots only if you plan to reuse weights. And finally, if this project sparks ideas or helps you in your journey, ⭐ star the repository and share your feedback—every insight helps improve the stack for the community.