AI Agent Architecture Reference

/resources · Reference

Written by Josh Ausmus · Updated April 2026

Live reference · updated continuously

How AI Agents Work Under the Hood

ReAct (thought-action-observation loop) The agent alternates between generating a thought, selecting an action or tool, executing it, and incorporating the observation back into context. It repeats this loop until it has enough information to answer. This handles dynamic tasks where outcomes are uncertain. ReAct grounds the model in real feedback instead of hallucinated steps.[1]

Plan-and-Execute (upfront planning then execution) The model first creates a complete step-by-step plan, then executes each step in sequence with minimal replanning. It trades adaptability for predictability and lower per-step token use. Good when the task decomposes cleanly and the environment stays stable. It fails harder if the initial plan misses key details.[1]

Tool Calling / Function Calling (structured tool use) The LLM outputs structured JSON or schema-compliant calls that the framework routes to predefined functions or APIs. The system parses the call, runs the tool, and injects the result back into the conversation. This gives reliable, typed interactions without free-form text parsing. Most modern models ship with native support for this.[2]

Code Execution Agents (sandboxed code running) The agent writes Python or other code, sends it to a sandboxed interpreter, and receives stdout, results, or errors. It iterates by fixing or extending the code based on output. This works well for math, data analysis, or procedural tasks. The sandbox must limit CPU, memory, and network access or it becomes dangerous.

Multi-Agent: Supervisor pattern One coordinator agent decomposes the task, assigns subtasks to specialized worker agents via routing or tool calls, then synthesizes their outputs. Workers run independently with narrow tools and scopes. This scales better than one monolithic agent on complex jobs. The supervisor becomes a bottleneck if coordination logic grows too complex.[1]

Multi-Agent: Debate pattern Multiple agents take opposing or varied positions, critique each other's outputs in rounds, and a moderator or judge synthesizes the best conclusion. It surfaces flaws and improves reasoning on ambiguous or high-stakes topics. Communication overhead adds latency and tokens. Works when you can afford multiple inference passes.[3]

Multi-Agent: Pipeline pattern Agents run in fixed sequential or staged order. Each takes the prior agent's output as input and performs one specialized transformation. Simple to debug and trace. Rigid structure limits adaptation if early stages produce poor data. Common for content generation or data processing workflows.[4]

Memory: Context window (short-term) The raw token buffer that holds the current conversation, recent thoughts, tool outputs, and system prompt. Everything in the window influences the next generation directly. Hits hard limits fast on long interactions. Most frameworks truncate or summarize when it fills.

Memory: Vector store (long-term semantic) Embeddings of past interactions, documents, or facts are stored in a vector database and retrieved by similarity when the current query matches. This pulls relevant history without blowing up the context window. Retrieval quality depends on chunking and embedding model. Adds latency for the query step.

Memory: Conversation history (session tracking) Persistent log of user messages, agent responses, and key decisions across turns or sessions. Often stored in a database and selectively injected or summarized into the context window. Handles multi-turn coherence better than raw context alone. Requires summarization logic for very long sessions.

Pattern Comparison

Pattern	Best For	Weakness	Example Framework
ReAct	Dynamic tasks, tool exploration	High token use, unpredictable cost	LangChain, LlamaIndex
Plan-and-Execute	Structured, predictable workflows	Brittle if plan is wrong	LangGraph, AutoGen
Tool Calling	Reliable API/tool integration	Limited to predefined schemas	OpenAI, Anthropic
Code Execution	Math, data, procedural generation	Sandbox escape risk, slow loops	E2B, LangChain code
Supervisor (Multi)	Task routing across specialties	Coordinator bottleneck	CrewAI, LangGraph
Debate (Multi)	Reasoning, fact checking	High latency and cost	AutoGen, custom
Pipeline (Multi)	Linear multi-stage processing	No adaptation to early errors	LangGraph, CrewAI
Context Window	Immediate short-term state	Strict length limits	All LLMs
Vector Store	Long-term knowledge retrieval	Retrieval quality varies	Pinecone, Chroma, Redis
Conversation History	Session coherence	Needs summarization at scale	LangChain memory

If your task has high uncertainty or changing data, start with ReAct or supervisor. Stable pipelines reward Plan-and-Execute or strict sequential agents. Memory choices matter more than the reasoning loop once sessions exceed a few thousand tokens. The real system is usually a hybrid.