Posts

GenAI App Architecture Explained (Part 5: Full-Stack Observability with LLASTAKS)

Image
 From Kubernetes Health to End-to-End RAG Tracing with Grafana Cloud. end to end tracing From "It's Built" to "It Works" In this series, we've journeyed from the high-level architecture of GenAI apps ( Part 1 ) and their reliability components ( Part 2 ), down to the hardware that powers them ( Part 3 ). In Part 4 , we finally got our hands dirty by deploying LLASTAKS , our complete GenAI playground on Kubernetes (EKS). Our stack is running, complete with vLLM, a FAISS vector store, and a RAG chatbot. But now, the real Ops work begins. A RAG application is a complex, distributed system. If a user says "the chatbot is slow," what does that mean? Is it the Kubernetes cluster? The FAISS search? The vLLM token generation? Without data, you're just guessing. In this article, we'll implement the 005-observability stage of LLASTAKS. Our goal is to get full visibility across three critical layers : Infrastructure (Kubernetes): The health of our c...

GenAI App Architecture Explained (Part 4: LLASTAKS — A Full LLM App Playground on Kubernetes)

Image
A practical guide to spin up a complete, test-friendly GenAI stack (LLM + RAG + app + observability) on AWS EKS. TL;DR LLASTAKS (LLM App STAck on Kubernetes as a Service ) is a reproducible, low-touch blueprint to deploy an end-to-end GenAI playground —not just a model endpoint. In ~one command you provision AWS EKS with a GPU node, run vLLM behind an OpenAI-compatible API, add a chatbot frontend, a RAG pipeline (ingestion → FAISS search → RAG chatbot), and wire observability (metrics/logs/traces). Use it to prototype features (e.g., function calling, LoRA fine-tuning) and to learn real-world constraints before building production systems. Overview of LLASTAKS Why this project? Most “playgrounds” stop at the LLM API. Real applications need more: storage for weights, a serving layer, an app surface, retrieval, orchestration, and instrumentation. LLASTAKS gives you the whole app loop so you can: Experiment quickly with local + cluster deployments Validate architectura...

GenAI App Architecture Explained (Part 3: The Hardware stack)

Image
 In  Part 1  and  Part 2  of this series, we explored the high-level components of a modern GenAI application, from the user interface to the RAG pipeline. Now, let's get down to the metal. What type of resources do we need to make it all run? What Resources Do We Need? Like any application, LLMs rely on fundamental hardware resources to function: compute, memory, storage, and network. And sometimes, they need a  lot  of them. Compute LLM architectures, particularly Transformers, are heavy consumers of compute and memory. This is inherent to their design, which must process massive amounts of calculations to predict subsequent tokens. Think about it: each new token generated is the result of a complex calculation. All tokens from the prompt are turned into mathematical representations (vectors) to understand the intricate relationships between them. Based on this analysis, the model performs billions of calculations to predict the most likely next toke...