LLM Orchestration in Practice: What Actually Works in Production
How do you actually build reliable orchestration layer so that your AI application just works
Building with Generative AI is not at all a smooth ride. Sometimes you hit the wall and get stuck when what you are trying to build is not going the way you had imagined. When you first get started, it's easy to be impressed by the well-crafted demo. But production exposes the silent cracks. The smooth working demo often breaks apart once real users start to use it, cost constraints, and latency expectations come into play.
AI systems differ from conventional software systems in many ways. Working with LLMs is less about chasing perfect prompts and more about designing resilient systems that can handle uncertainty, failure and edge cases efficiently. This blog is a reflection of those lessons and hacks that worked, and the pitfalls we stumbled into while turning raw LLMs into reliable products. An AI multi-agent system is an application that involves collaboration between multiple nodes. These agents work together to achieve common goals by sharing information, coordinating actions, and learning from each other.
What is an AI agent really?
"AI Agent" has become an umbrella term for describing any autonomous system that operates with little to no human intervention. But is that really accurate? Probably not.
In reality, AI systems are built on chained or parallel loops of tool calls. These can follow one of two approaches:
A predefined sequence of actions/tool calls.
LLM-driven decision-making, where the model controls the flow.
This is where the distinction between Workflows and Agents becomes crucial.
Workflows - A deterministic AI system where the expected outcome is known at each step. There is no autonomy in decision-making. Only the input varies at each node, not the action.
Agents - An AI system where the LLM decides how to navigate between multiple possible actions. Autonomy is key. The model dynamically chooses actions until it reaches a solution. Both the action path and the final solution are variable.
You are better off building a workflow instead of agent
Letting an LLM autonomously choose actions from a predefined set can do more harm than good. Autonomy introduces hidden stochasticity. Let’s be honest, LLMs are still quite dumb. The more control you surrender to LLM, the more chaos it can invite.
You wouldn’t trust an LLM to decide how much to bid in an auction or choose the "best" flight for your trip.
Why? Because without hard, rational boundaries, their "decisions" are just probabilistic guesses. Instead of granting full autonomy, you should:
Define strict criteria for action selection.
Spoon-feed the model the most plausible action or at least narrow its options to near-deterministic choices.
Please avoid third party agentic libraries!
The market is flooded with third-party libraries promising "easy AI agent development." Here’s the uncomfortable truth: most are over-engineered abstractions that create more problems than they solve.
Your AI application is probably having the LLM as elephant in the room which is already a pretty much major abstraction or black box to you. Adding one more surface of abstraction is a recipe for destruction.
These frameworks add unnecessary layers between your core logic and the LLM. What starts as a "quick integration" quickly becomes a black box that turns into massive tech debt:
Obfuscates tool call logic
Introduces opaque error handling
Locks you into someone else’s design choices
These agentic frameworks step by step introduce major blockers and concerns when you go to production with your agent running on one of these frameworks.
Here is what actually happens
"Just three lines of code to automate your workflow!"
→ What they don't show: 50 hidden dependencies, 3 abstraction layers, and a complete surrender of control.
Now your:
→ Tool calls route through framework-specific decorators
→ Error handling depends on their callback system
→ Scaling requires understanding their proprietary "agent state" model
Wait until you:
→ Need custom routing logic their API doesn't expose
→ Discover their "agent memory" system can't handle your scale
→ Get pinned to an old LLM version because their SDK hasn't updated
You're already dealing with the ultimate black box - the LLM itself. Why would you voluntarily wrap it in another layer of someone else's technical debt?
Structured response is all you need…
Workflows or agents are internally communicating the outcomes from one node to another. The output of one node becomes the input signal for the next one. The handoff between nodes is the signalling mechanism that determines what happens next. Ambiguity here derails everything downstream. Unstructured outputs make tool selection, task routing, and logic branching prone to errors. Structured response between nodes serves similar purpose as what json schema does to the API contracts.
Think of an AI travel assistant. When you say "Book me a trip to Bali under 20k rupees for upcoming weekend", here's what actually needs to happen under the hood:
Structured Output:
Makes the further API call to book tickets no brainer. Think of it like this - every output is a contract, every tool call should be traceable, every error has predefined recovery path.
Enforcement Techniques (Ranked by Robustness)
Pydantic Models (Gold Standard)
Validates types/ranges at runtime
Auto-generates OpenAPI specs
JSON Schema (Prompt-Level)
3. Tool Call
LLM outputs directly into function args
If you’re using models from LLM API providers like Grok, OpenRouter, or Claude, Instructor is a solid choice for generating reliable structured output. On the other hand, if you’re working with self-hosted models, Outlines tends to be the better fit.
Bonus tip:
If you don’t want to rely on scaffolding frameworks like Instructor or Outlines, you might face a tricky situation: what if your chosen LLM doesn’t support native tool-calling or structured JSON output?
This is critical and often overlooked question in production systems. In scenario’s like this you need to enforce the output constraints more rigorously on the prompt level such that it can only respond in one valid way.
Through experimentation, we found XML to be the most reliable choice for structured outputs. Its explicit opening and closing tags make it easier for the model to generate consistently and much simpler for your code to parse with confidence.
The sample system prompt is below.
With the prompt like above the LLM is configured to consistently generate responses wrapped in the XML format you defined. You need to write a parser that reliably extracts the structured parameters from this XML for downstream tool execution.
Note: As a fallback strategy, if the LLM fails to produce valid XML, the system will automatically retry by prompting the model to correct its format.
Start with intent, then navigate
Before hardcoding tool sequences or granting full autonomy, first make the model classify user intent. This acts as a decision tree root:
Map the high-level intent (e.g., "Is this a booking request, inquiry, or complaint?").
Branch logically from there - Each intent unlocks a constrained set of next-step actions.
Only then proceed with tool calls or workflows.
Why?
This approach prevents chaos by avoiding the trap of letting an LLM wander into an unbounded decision space. It strikes a balance between structure and flexibility by allowing you to define guardrails per intent. For example, in a flight booking flow, you can strictly enforce date and price rules. And importantly, it fails fast: if an intent is misclassified, you can reset early instead of wasting time debugging a broken action chain.
Have a fallback mechanism
Let’s be honest. LLMs are still kind of dumb when it comes to following instructions. I’ve lost count of how many times they’ve ignored explicit formatting rules and answered in plain text, hallucinated parameters that never existed, or returned malformed structures with missing brackets and broken nesting that instantly break parsers. This isn’t some rare edge case but it’s the default reality of working with LLMs in production.
Over time, we realized the real challenge isn’t just about writing the perfect prompt or designing a neat schema. It’s about building resilient systems that can
Detect when the model has deviated from instructions.
Automatically retry or reformat the query.
Gracefully reroute the request or escalate to human intervention when repeated failures occur.
Allow room for clarification
Directly forcing the model to output the action is not a wise strategy, especially when you’re building an AI system with a chat interface. There’s always a possibility of unrelated or ambiguous questions slipping in, and if your system blindly tries to map every response to an action, things can break fast.
This is where a clarification step becomes essential. Instead of guessing, the model (or a helper tool) should be able to ask back: ‘Do you mean X or Y?’ or ‘Can you provide the missing details?’. By giving the system permission to pause and clarify, you dramatically reduce the risk of misfires.
I’ve found that introducing a dedicated clarification questions tool is a game-changer. It acts as a safeguard between intent detection and action execution ensuring that when the user’s request is vague, incomplete, or contradictory, the system doesn’t rush ahead but instead engages in a short back-and-forth to understand the details. Only once the intent is clear does the system proceed to trigger the corresponding action.
Effective RAG > Finetune
It’s probably generally accepted at this point. We get the hype around having your own customized model but in reality, you almost never need it. Fine-tuning often does more harm than good if your dataset is noisy, imbalanced, or simply too small. What you actually need is a solid RAG (Retrieval-Augmented Generation) pipeline.
A well-designed RAG setup gives you the best of both worlds: you keep the general reasoning and language capabilities of a foundation model, while grounding its responses in your domain-specific knowledge. This not only reduces hallucinations but also makes your system far easier to update because instead of retraining a model every time your knowledge changes, you just update the underlying data source.
In my experience, investing time in designing the right retrieval strategy (chunking, embeddings, ranking, and filtering) pays off far more than jumping straight into fine-tuning. Think of it this way: fine-tuning tries to force the model to ‘remember’ your knowledge, while RAG teaches it how to look things up reliably whenever needed.
Give your model some space to think!
Well, we humans perform better when we have a scratchpad to think and jot down the plan on before jumping into execution. LLMs are no different. Forcing them to go straight from input → final answer often leads to brittle outputs. But if you give the model an intermediate space; a structured way to “think out loud” before committing it produces far more reliable results.
This is where techniques like chain-of-thought prompting, scratchpad reasoning, or even a simple hidden planning step come in. The idea is to separate reasoning from answering: let the model work through the logic first, then distill that into the final structured output or user-facing response. In production systems, this can mean capturing a reasoning trace, validating it, and only then executing the corresponding action.
We found a huge jump in both precision and accuracy of my outputs once we started giving the model this ‘scratchpad space.’ Instead of rushing straight into the final answer, the model could first reason through the problem, validate intermediate steps, and then commit to a cleaner, more reliable response. What surprised us was how much this reduced downstream errors such as malformed JSONs, hallucinated parameters, or missing fields suddenly became far less frequent.
Okay, but why does it happen?
LLMs are autoregressive models, which means they generate text token by token, with each next token being chosen based on the likelihood of following the previous ones. When you directly force them to jump to the final answer, they often default to the most statistically likely continuation even if it’s shallow, imprecise, or structurally wrong.
By contrast, when you allow them a scratchpad or reasoning space, you’re effectively guiding the model to lay out intermediate steps. This shifts the distribution of likely outputs toward more structured, logical sequences, reducing the chances of it drifting into malformed or hallucinated answers. In other words, you’re hacking the model’s probability space: instead of gambling on it hitting the perfect response in one go, you let it build the path step by step.
Pro tip: Make the model think before it answers
Want the non-reasoning model to reason first and only then produce the final action/output? Give it a private scratchpad to think out loud.
How: Ask the model to emit reasoning inside a confined XML tag, e.g. <thinking>...</thinking>
, followed by the final, machine-readable output. You then strip the <thinking>
block before returning anything to the user or downstream tools.
How does it help?
As an AI Engineer, you get visibility into the model’s reasoning. This makes debugging far easier as you now can see why the model made a certain choice.
You can define the reasoning/thinking constraint so that model always starts thinking in valid direction before producing an output.
Keep the end user engaged
All AI applications struggle to balance two north-star metrics: latency and cost. If your system takes too long to respond, users feel like it’s broken. If you throw too much compute at it, the costs spiral quickly.
One practical way to reduce the perception of latency is to stream intermediate steps back to the user while the system is working. For example, when a chain of tool calls is being executed; let’s say the model first classifies intent, then calls a retrieval step, then hits an external API. You don’t need to keep the user staring at a blank chat window. Instead, you can surface a running commentary: “Searching flight options…”, “Checking prices under ₹8,000…”, “Validating dates…”.
This trick serves two purposes. First, it reassures the user that the system is alive and making progress (instead of stalling). Second, it mirrors the way humans think aloud when solving a task, which makes the interaction feel more natural. As a result users perceive the system as faster and more intelligent, even if the underlying operations still take the same amount of time.
In my experience, adding these “micro-updates” improve the UX and it also buys you engineering flexibility. You can afford a slightly longer pipeline (say, with more rigorous validation or extra retrieval steps) without the user feeling frustrated by latency.