GenAI App Architecture Explained (Part 5: Full-Stack Observability with LLASTAKS)
From Kubernetes Health to End-to-End RAG Tracing with Grafana Cloud.
| end to end tracing |
From "It's Built" to "It Works"
In this series, we've journeyed from the high-level architecture of GenAI apps (
In
But now, the real Ops work begins.
A RAG application is a complex, distributed system. If a user says "the chatbot is slow," what does that mean? Is it the Kubernetes cluster? The FAISS search? The vLLM token generation?
Without data, you're just guessing.
In this article, we'll implement the 005-observability stage of LLASTAKS. Our goal is to get full visibility across three critical layers:
Infrastructure (Kubernetes): The health of our cluster, nodes, and pods.
Services: Key metrics for each service, such as vLLM latency and FAISS search counts.
Application: End-to-end traceability, from the user's request to the final response.
To do this, we will connect our EKS cluster to Grafana Cloud and its LGTM stack (Loki, Grafana, Tempo, Mimir).
1. The Foundation: Infrastructure Metrics (Kubernetes)
Using Terraform, LLASTAKS deploys the Grafana Alloy Agents as a DaemonSet (one per node) on our cluster. These agents immediately begin collecting and sending basic Kubernetes metrics to Mimir, which is our Prometheus-compatible metrics storage.
In just a few minutes, we get this "Kubernetes Overview" dashboard.
We can see the health of our nodes, CPU and memory usage, and pod status at a glance. If a vllm pod is in a CrashLoopBackOff state due to a GPU memory shortage, this is the first place we'll see it.
2. The Core: Service Metrics
Knowing the cluster is healthy is good. Knowing what our application services are doing is better.
The next step is to collect custom metrics for each microservice in our LLM App stack. The Alloy agents are configured to automatically "scrape" any pod that exposes a /metrics endpoint and has the correct annotation (e.g., k8s.grafana.com/scrape: "true").
We have instrumented all our services to expose these critical metrics:
vLLM: Exposes
vllm_request_duration_seconds,vllm_tokens_generated_total, andvllm_queue_size.FAISS-wrap: Exposes
faiss_wrap_request_latency_seconds,faiss_wrap_search_total, andfaiss_wrap_index_size.chatbot-rag (The Orchestrator): Exposes
chatbot_requests_total,chatbot_vllm_requests_total(to track calls to the LLM), andchatbot_rag_chunks_sent(to see how much context we are sending).
By placing these key metrics on a single dashboard, we get a "single pane of glass" for our entire application.
Now, we can correlate events. If we see a latency spike on chatbot_request_duration_seconds, we can instantly compare it to vllm_request_duration_seconds and faiss_wrap_request_latency_seconds to find the culprit.
3. The "Holy Grail": Tracing a RAG Query End-to-End
Metrics tell us what is slow. Traces tell us why.
This is the most critical part of GenAI observability. Using OpenTelemetry, we instrumented the code for our chatbot-rag and faiss-wrap services. Every user request now generates a trace that is sent to Grafana Tempo.
| chatbot-rag POST /api/chat |
| Log view for each trace |
| UX POV |
Let's look at the anatomy of a single RAG query in Grafana.
What we see is the complete "waterfall" of the request, showing how one operation leads to the next:
POST /api/chat (Parent Span): This is the total time the user experienced for the entire request.
rag.retrieve_context: This is a manual span I added, which in turn calls the next span.
POST /search (HTTP): This span shows the orchestrator's call to the FAISS service.
faiss.search: This is another manual span, which contains the following child spans.
faiss.embed_query and faiss.index_search: These spans clearly separate the time spent creating the query embedding from the time spent on the pure vector search.
vllm.generate: Back in the orchestrator, this span measures the HTTP call to vLLM.
The real magic, however, lies in the span attributes. By clicking on any span, I can see rich business context:
On the
chatbot-ragspan, I can see attributes likerag.query,rag.chunks_sent, andllm.tokens_prompt.On the
faiss-wrapspan, I can seefaiss.results_countandfaiss.score_avg.
I can now answer complex questions. "Is this request slow because we are sending too many chunks (rag.chunks_sent)?" or "Is the relevance poor because the faiss.score_avg is low?"
We are no longer just debugging performance; we are debugging the effectiveness of our RAG pipeline.
Conclusion: From Black Box to Glass Box
In
Observability is not an "add-on" for GenAI applications; it is a fundamental requirement. Without it, you are flying blind.
Feel free to implement it on your own, it's a lot of fun!
You can find all the Terraform code, annotations, and OpenTelemetry instrumentation in the