GenAI App Architecture Explained (Part 5: Full-Stack Observability with LLASTAKS)

 From Kubernetes Health to End-to-End RAG Tracing with Grafana Cloud.

end to end tracing

From "It's Built" to "It Works"

In this series, we've journeyed from the high-level architecture of GenAI apps (Part 1) and their reliability components (Part 2), down to the hardware that powers them (Part 3).

In Part 4, we finally got our hands dirty by deploying LLASTAKS, our complete GenAI playground on Kubernetes (EKS). Our stack is running, complete with vLLM, a FAISS vector store, and a RAG chatbot.

But now, the real Ops work begins.

A RAG application is a complex, distributed system. If a user says "the chatbot is slow," what does that mean? Is it the Kubernetes cluster? The FAISS search? The vLLM token generation?

Without data, you're just guessing.

In this article, we'll implement the 005-observability stage of LLASTAKS. Our goal is to get full visibility across three critical layers:

  1. Infrastructure (Kubernetes): The health of our cluster, nodes, and pods.

  2. Services: Key metrics for each service, such as vLLM latency and FAISS search counts.

  3. Application: End-to-end traceability, from the user's request to the final response.

To do this, we will connect our EKS cluster to Grafana Cloud and its LGTM stack (Loki, Grafana, Tempo, Mimir).


1. The Foundation: Infrastructure Metrics (Kubernetes)

Kubernetes overview

Before blaming the AI, we must check the plumbing.

Using Terraform, LLASTAKS deploys the Grafana Alloy Agents as a DaemonSet (one per node) on our cluster. These agents immediately begin collecting and sending basic Kubernetes metrics to Mimir, which is our Prometheus-compatible metrics storage.

In just a few minutes, we get this "Kubernetes Overview" dashboard.

We can see the health of our nodes, CPU and memory usage, and pod status at a glance. If a vllm pod is in a CrashLoopBackOff state due to a GPU memory shortage, this is the first place we'll see it.


2. The Core: Service Metrics

Chatbot metrics

Faiss metrics


vLLM metrics



Knowing the cluster is healthy is good. Knowing what our application services are doing is better.

The next step is to collect custom metrics for each microservice in our LLM App stack. The Alloy agents are configured to automatically "scrape" any pod that exposes a /metrics endpoint and has the correct annotation (e.g., k8s.grafana.com/scrape: "true").

We have instrumented all our services to expose these critical metrics:

  • vLLM: Exposes vllm_request_duration_seconds, vllm_tokens_generated_total, and vllm_queue_size.

  • FAISS-wrap: Exposes faiss_wrap_request_latency_seconds, faiss_wrap_search_total, and faiss_wrap_index_size.

  • chatbot-rag (The Orchestrator): Exposes chatbot_requests_total, chatbot_vllm_requests_total (to track calls to the LLM), and chatbot_rag_chunks_sent (to see how much context we are sending).

By placing these key metrics on a single dashboard, we get a "single pane of glass" for our entire application.

Now, we can correlate events. If we see a latency spike on chatbot_request_duration_seconds, we can instantly compare it to vllm_request_duration_seconds and faiss_wrap_request_latency_seconds to find the culprit.


3. The "Holy Grail": Tracing a RAG Query End-to-End

Metrics tell us what is slow. Traces tell us why.

This is the most critical part of GenAI observability. Using OpenTelemetry, we instrumented the code for our chatbot-rag and faiss-wrap services. Every user request now generates a trace that is sent to Grafana Tempo.

chatbot-rag POST /api/chat


Log view for each trace


UX POV


Let's look at the anatomy of a single RAG query in Grafana.

What we see is the complete "waterfall" of the request, showing how one operation leads to the next:

  1. POST /api/chat (Parent Span): This is the total time the user experienced for the entire request.

  2. rag.retrieve_context: This is a manual span I added, which in turn calls the next span.

  3. POST /search (HTTP): This span shows the orchestrator's call to the FAISS service.

  4. faiss.search: This is another manual span, which contains the following child spans.

  5. faiss.embed_query and faiss.index_search: These spans clearly separate the time spent creating the query embedding from the time spent on the pure vector search.

  6. vllm.generate: Back in the orchestrator, this span measures the HTTP call to vLLM.

The real magic, however, lies in the span attributes. By clicking on any span, I can see rich business context:

  • On the chatbot-rag span, I can see attributes like rag.query, rag.chunks_sent, and llm.tokens_prompt.

  • On the faiss-wrap span, I can see faiss.results_count and faiss.score_avg.

I can now answer complex questions. "Is this request slow because we are sending too many chunks (rag.chunks_sent)?" or "Is the relevance poor because the faiss.score_avg is low?"

We are no longer just debugging performance; we are debugging the effectiveness of our RAG pipeline.


Conclusion: From Black Box to Glass Box

In Part 4, we built the LLASTAKS playground. In this post, we've lit it up.

Observability is not an "add-on" for GenAI applications; it is a fundamental requirement. Without it, you are flying blind.

Feel free to implement it on your own, it's a lot of fun!

You can find all the Terraform code, annotations, and OpenTelemetry instrumentation in the LLASTAKS GitHub repository.

most viewed articles

From Chat to Action: The New Gen AI Revolution

GenAI App Architecture Explained (Part 1: The Big Picture)

How to Pick the Best LLM for Programming Cost vs Capability in 2025

GenAI App Architecture Explained (Part 2: Completing the Big Picture)

Can Natural Language Really Replace Code? The Revolution Is Already Underway

Internet Is Nasty? Really? How?

Demystifying GPU Sizing for LLMs

GenAI App Architecture Explained (Part 3: The Hardware stack)