Architecture terms
The reasoning patterns and loops that make an agent work.
- 🏗️ArchitectureAdaptive RAG
A RAG variant that routes queries to different retrieval strategies based on complexity — simple questions skip retrieval, hard ones get multi-hop retrieval.
- 🏗️ArchitectureAgent orchestration
The control layer that coordinates multiple agents or agent steps — routing work, managing state, enforcing hand-off rules, and resolving conflicts between specialized agents.
- 🏗️ArchitectureAgentic loop
The core control flow of an agent: observe → reason → act → observe, repeated until the goal is met or a stop condition fires.
- 🏗️ArchitectureAttention mechanism
The neural-network primitive that lets a transformer model weigh the importance of every input token when generating each output token — the core innovation behind LLMs.
- 🏗️ArchitectureAutoGPT
The 2023 open-source project that popularized autonomous LLM agents — wraps an LLM in a recursive plan-execute-reflect loop with persistent goals and tool use.
- 🏗️ArchitectureBabyAGI
A minimal 2023 reference implementation of an autonomous task-driven agent — three loops (task creation, prioritization, execution) in ~100 lines of Python.
- 🏗️ArchitectureChain of Agents
An architecture where agents run in sequence, each refining or extending the previous agent's output. Used for long documents or multi-stage workflows.
- 🏗️ArchitectureChain of thought
A prompting technique that asks the model to lay out its reasoning step-by-step before committing to an answer — improves accuracy on multi-step tasks.
- 🏗️ArchitectureConstitutional AI
An alignment technique developed by Anthropic where the model is trained to follow a written set of principles ("a constitution") rather than per-example human preferences — produces safer behavior without massive human-labeling effort.
- 🏗️ArchitectureContext engineering
The discipline of curating what information goes into an LLM's context window — selecting, ordering, and formatting the system prompt, examples, retrieved documents, and conversation history for maximum effectiveness.
- 🏗️ArchitectureCorrective RAG
A RAG variant that grades retrieved documents and triggers fallback retrieval (web search, alternative sources) when the initial retrieval scores low on relevance.
- 🏗️ArchitectureDPO
Direct Preference Optimization — a simpler alternative to RLHF that trains models directly on preference data without needing a separate reward model or reinforcement learning loop.
- 🏗️ArchitectureEmbeddings
Dense numerical vector representations of text, images, or audio — used to measure semantic similarity, power search, and ground LLM outputs in your data.
- 🏗️ArchitectureEmergent abilities
Capabilities that appear suddenly above a certain model scale — chain-of-thought reasoning, in-context learning, instruction following — and are absent or near-zero in smaller models.
- 🏗️ArchitectureFew-shot learning
A prompting technique where the LLM sees a small number of input/output examples in the prompt before being asked to perform the same task on a new input.
- 🏗️ArchitectureFine-tuning
The process of training a pre-trained LLM on additional data to adapt it for a specific task, domain, or style — produces a specialized model derived from a general-purpose base.
- 🏗️ArchitectureFrontier model
The current generation of state-of-the-art LLMs — typically the largest models from OpenAI, Anthropic, Google, and a small number of others.
- 🏗️ArchitectureGraph of Thoughts
A reasoning structure that generalizes Tree of Thoughts to an arbitrary DAG — intermediate thoughts can be combined, refined, or referenced from multiple branches.
- 🏗️ArchitectureIn-context learning
An LLM's ability to learn a new task at inference time by reading examples in the prompt — no weight updates, just pattern-matching from context.
- 🏗️ArchitectureInference
The process of running a trained LLM to produce outputs — the production phase, distinct from training. Inference is what you pay for when you use an LLM API.
- 🏗️ArchitectureInstruction tuning
A fine-tuning technique where a pre-trained LLM is trained on instruction-response pairs so it learns to follow natural-language commands instead of just predicting next tokens.
- 🏗️ArchitectureLoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that updates a small number of additional weights instead of the full model, cutting compute and storage cost by 100×+ with minimal accuracy loss.
- 🏗️ArchitectureMixture of Agents
An architecture where multiple agents (often using different models) generate candidate responses, then an aggregator agent synthesizes them. Higher quality at higher cost.
- 🏗️ArchitectureMixture of experts
A model architecture where multiple specialized expert networks share the work — a routing layer activates only a few experts per input, cutting inference cost while keeping total parameter count high.
- 🏗️ArchitectureModel distillation
A training technique that transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to match the teacher's outputs — produces a faster, cheaper model that retains most of the teacher's capability.
- 🏗️ArchitectureMulti-step reasoning
The ability of an LLM or agent to chain multiple inferences together to solve a problem — answer A leads to question B, which leads to question C, and so on until the final answer.
- 🏗️ArchitectureNeural network
A computational model loosely inspired by biological neurons — layers of weighted nodes that transform inputs to outputs. LLMs are large neural networks; so are image classifiers, recommendation systems, and most modern AI.
- 🏗️ArchitecturePlan-and-execute
A canonical two-stage agent pattern: a planner LLM produces a structured multi-step plan, then an executor (often a cheaper model) carries out each step using tools.
- 🏗️ArchitecturePlanning
The phase where an agent decomposes a goal into a structured sequence of sub-tasks before executing any of them.
- 🏗️ArchitecturePrompt engineering
The practice of designing, refining, and testing the text instructions sent to an LLM to maximize output quality — covers system prompts, few-shot examples, formatting, and meta-instructions.
- 🏗️ArchitectureQuantization
A technique that reduces model weights from 16-bit or 32-bit floats to smaller representations (8-bit, 4-bit, or lower), cutting memory use and inference cost by 2–8× with minimal accuracy loss.
- 🏗️ArchitectureReAct agent
An agent built on the ReAct pattern: an interleaved loop of reasoning (the model thinks out loud) and acting (the model calls a tool), repeated until the goal is met.
- 🏗️ArchitectureReasoning model
A class of LLM (o3, Claude Sonnet 4.6, Gemini 2.5 reasoning) that produces a long internal chain of thought before responding — trading latency for accuracy on hard problems.
- 🏗️ArchitectureReflexion
An agent design pattern where the agent reflects on its previous attempts, generates a critique, and uses the critique to improve subsequent attempts — produces measurable accuracy gains on hard tasks.
- 🏗️ArchitectureRLHF
Reinforcement Learning from Human Feedback — a training technique where humans rate model outputs to teach the model which responses are preferred, dramatically improving instruction-following and safety.
- 🏗️ArchitectureScaling laws
Empirical power-law relationships between model size, training data, and compute that predict loss and capability — the basis for every major frontier-model training plan since 2020.
- 🏗️ArchitectureSelf-consistency
A reasoning technique where the model samples multiple chain-of-thought traces for the same problem and selects the most common final answer — cheap accuracy boost on math and logic tasks.
- 🏗️ArchitectureSelf-RAG
A RAG variant where the model decides on the fly whether to retrieve, what to retrieve, and whether its own draft is grounded — emitting reflection tokens at each step.
- 🏗️ArchitectureSpeculative decoding
An inference optimization where a small draft model proposes multiple tokens at once and the large model verifies them in parallel — same output, 2–4× faster.
- 🏗️ArchitectureTask decomposition
The agent reasoning step where a high-level goal is broken into ordered, executable sub-tasks before any tool call is made. Foundational to plan-and-execute, ReAct, and tree-of-thoughts patterns.
- 🏗️ArchitectureTemperature
The LLM sampling parameter that controls randomness — low values stay near the most-likely next token, high values explore the distribution. 0–2 typical range.
- 🏗️ArchitectureTest-time compute
Spending more compute at inference time — longer reasoning, more samples, search — to get higher accuracy without retraining the model.
- 🏗️ArchitectureTokenization
The process of splitting text into subword units (tokens) that LLMs consume — a word like "tokenization" might become two or three tokens depending on the model's tokenizer.
- 🏗️ArchitectureTop-p sampling
A sampling strategy that picks from the smallest set of tokens whose cumulative probability exceeds p. Trims the long tail without hard top-k cutoff.
- 🏗️ArchitectureTransformer
The neural network architecture that powers every modern LLM — uses self-attention to process sequences in parallel, replacing the older RNN approach for language modeling.
- 🏗️ArchitectureTree of thoughts
A reasoning pattern where the LLM explores multiple solution paths in parallel as a tree, evaluates partial paths, and backtracks — outperforming linear chain-of-thought on hard problems.
- 🏗️ArchitectureWorld model
An internal predictive representation of the environment that an agent uses to simulate the outcomes of candidate actions before acting — central to 2026 frontier-agent research.
- 🏗️ArchitectureZero-shot learning
An LLM's ability to perform a task it has never been explicitly trained on or shown examples of — relying entirely on the model's pre-training and the prompt instructions.