GenAI App Architecture Explained (Part 1: The Big Picture)

 The first time you interact with an app like ChatGPT, all you see is the chat box. It feels magical — you type, it answers. But for anyone in Ops or engineering, one question immediately comes up: what’s really happening behind the scenes?

At the heart of every GenAI app lies an LLM (Large Language Model), but the LLM alone is just one piece of the puzzle. Around it sits an entire stack: orchestration layers, data pipelines, embeddings, vector databases, plugins, caches, observability tools, and guardrails. Together, they transform a “highly clever text generator” into a production-grade GenAI application.
In this post, we’ll start with the big picture: the main components and flows that make an app like ChatGPT work. In the next parts, we’ll drill down into each layer (RAG, Ops monitoring, validation, etc.), highlighting both how they work and what usually breaks in production.


Picture of what it feels like the first time we interact with a LLM:


Which remind me of this famous cartoon:

In reality, it’s closer to something like this:

Credits to a16z for this schema slightly modified.


It’s a bit more complicated, is it?
And that’s what make it interesting.

The first thing we see is that the LLMs (yes, there are many) are only 1 component of a bigger architecture that sit between the user and the LLMs.
Incredible companies create these LLMs which open many new possibilities but it’s only the beginning of the story: there are plenty of components to build around a LLM to build a GenAI app.

The importance of the orchestration in a Gen AI App

Let’s start from the user:


The user connects through the front end, it’s typically a simple chat interface where we can type “give me a good recipe with Broccoli”. This query is sent to the Orchestrator. This orchestrator can be written in Python and, basically, orchestrate all the components needed to answer the user query. We will see what are the different components that the orchestrator connect with but let’s keep it basic for now. The only part that we orchestrate here are the interactions with the front end and the LLM. The orchestrator adds to the user query a System prompt which instruct the LLM what we expect from it, how to answer, etc… A simple system prompt can be “you are a very useful assistant, be sharp and a bit provocative”. So the orchestrator sends the user query (also named user prompt) + the system prompt. The LLM process it and generate an output. This output may be “There’s no such thing as a good recipe with Broccoli” (He’s being provocative… :)). The orchestrator receives this output and send it to the front end for the user to read it. Voila.

While I described a simple flow here, there’s actually a number of components not shown in the above diagram that need to be taken into accounts: discussion history is one of them. LLMs don’t have short-term memory. It’s the orchestrator’s job to simulate continuity by feeding back the conversation history. Without this, every user message would feel like a fresh session.

From an architecture point of view, the orchestrator is the control tower: it decides what gets passed to the LLM, how prompts are structured, and how context is preserved.

I will not go, for the sake of clarity, in other components such as user authentication, SSL management, load balancing, etc… which are common app components.


Enrich the context with the RAG components



Let’s say your company, a bank, wants to build a chatbot that will boost productivity of the financial advisers. This chatbot will know everything about the company products and procedures. To achieve this, this chatbot will often used a RAG mechanism (Retrieval augmented Generation). Let’s go into the details.

The financial advisor writes: “I want to know how to close a life insurance account for user xxx”. The orchestrator retrieves the query, ask to the document database which chunk of which documents relate to the user query, add the chunk to the user query along with the system prompt we mentioned earlier. The LLM is going to process all these information (system prompt + user prompt + chunks) and answer something like “according to document yyy, the procedure to close the life insurance of user xxx, is blablabla”. 
That’s the overall plan but it takes a bit of plumbing to get there and, most importantly, to get there right. As we all know, LLMs tend to hallucinate and are not deterministic so it’s important to secure all parts of the value chain to increase the reliability of the final output.
So how did we actually build this document database? It’s not a typical SQL or NoSQL database. In the NLP world, typical databases are not the right vessels to accurately find information related to one another. The texts need to be turned into mathematical representations called vectors which are much easier for a computer to calculate relationships between different words. 
So, in this schema, we have:
  • Data pipelines: take unstructured data such as PDF, FAQ, etc… and turn them into meaningful chunks of text.
  • Embedding model which turn these texts into vectors, also called embeddings. The embeddings are then sent to the vector database. 
  • Vector database: store, update, delete the vectors and search for correlations through complex mathematics such a cosine similarity.

All of this data management chain can be triggered every hours/days/weeks or at specific point in time. It depends of your needs. 

When the data are in the vector database, we have this flow where the orchestrator requests the vector database with the user query, receive the most relevant chunks… and we are back to the flow I mentioned earlier with the financial advisor request.

These components are not trivial at all. There’s a lot of things to take into account to increase relevancy, security and accuracy. The data pipeline is key to properly turn the raw data in usable format, the vector database can use multiple ways (semantics, keywords, etc…) to increase the search accuracy. The orchestrator can rewrite the user query to also increase the accuracy and relevance of the retrieved data. There’s quite some work to make it right here and it usually takes multiple iterations, testing, and failure cycles before the system reaches acceptable reliability.

From an architecture perspective, this is often the most fragile step — pipelines break, embeddings drift, and search accuracy degrades without proper evaluations and monitoring.


Example of an overall RAG flow

Tools and actions: the legs and arms of a LLM


LLMs are well known for their ability to generate text and their reasoning power but… can they act? Yes, but it’s take a bit of plumbing to make it happen. Let’s see that.

There are many reasons why we would like a LLM to do actions. For example, if you ask “what was the result of last week swimming world championship?”, your LLM can’t answer without taking the action of going to Internet to find this information. One way to do it is to use an external service named SERP (Search Engine Results Page) to retrieve the information. A service accessible by a LLM is named a tool. The use of this tool is named an action.
Other example: when the user ask “calculate the number of hours before my daughter's birthday which is the 27th of May?”, it’s a tough question. First, the LLM doesn’t know today’s date unless we tell it. Second, it needs to calculate — something LLMs are notoriously bad at. The right solution would be to code the needed algorithm & execute it (we could also use an external services dedicated to math resolution). Where will the code run? Well, that’s another action. 

screenshot of GPT 5 using a tool in the background


So, there are a lot of different actions that LLM need to be able to do. Let’s see how it works. 


he user ask “what was the result of last week swimming world championship?”, the orchestrator send the user query + system prompt to the LLM. The query also communicate to the LLM that it has tools (such as SERP) that it can use. The LLM generate an answer saying “I need to send this request to the SERP tool”. The orchestrator call the SERP API, including the LLM query which contain the user query, potentially rewritten to be more effective for the search. The SERP solution crawl the web and return the results to the orchestrator. The orchestrator send the information (and the conversation history) to the LLM. Based on this data, the LLM generates the answer “Leon Marchand won last week swimming world championship”. The orchestrator sends this answer to the user.
Since late 2024, a technology called MCP (Model Context Protocol) has helped standardize how LLMs interact with external tools. From an architecture and operational point of view, this reduces the plumbing required and makes integrations less brittle. 

These actions open an infinite range of possibilities: databases connections, CI/CD pipelines control, Github commit, connect to calendar, emails, ERPs, etc..

Conclusion

What looks like a simple chat hides a complex, multi-layered GenAI app architecture. From orchestration to RAG pipelines to actions, every component plays a role in making outputs reliable, contextual, and actionable.
And this is just the surface. In the next post, we’ll look into the caching/LLMOps and validation components which are key for maintainability, accuracy and reliability. We’ll also drill into the hardware part to clarify how resources are consumed.

Stay tuned — this series is about turning the “then a miracle occurs” step into something concrete, with an Ops lens.


most viewed articles

From Chat to Action: The New Gen AI Revolution

How to Pick the Best LLM for Programming Cost vs Capability in 2025

GenAI App Architecture Explained (Part 2: Completing the Big Picture)

Can Natural Language Really Replace Code? The Revolution Is Already Underway

Internet Is Nasty? Really? How?

Demystifying GPU Sizing for LLMs

GenAI App Architecture Explained (Part 3: The Hardware stack)