🏗️

Architecture terms

The reasoning patterns and loops that make an agent work.

🏗️Architecture
Adaptive RAG
A RAG variant that routes queries to different retrieval strategies based on complexity — simple questions skip retrieval, hard ones get multi-hop retrieval.
🏗️Architecture
Agent orchestration
The control layer that coordinates multiple agents or agent steps — routing work, managing state, enforcing hand-off rules, and resolving conflicts between specialized agents.
🏗️Architecture
Agentic loop
The core control flow of an agent: observe → reason → act → observe, repeated until the goal is met or a stop condition fires.
🏗️Architecture
Attention mechanism
The neural-network primitive that lets a transformer model weigh the importance of every input token when generating each output token — the core innovation behind LLMs.
🏗️Architecture
AutoGPT
The 2023 open-source project that popularized autonomous LLM agents — wraps an LLM in a recursive plan-execute-reflect loop with persistent goals and tool use.
🏗️Architecture
BabyAGI
A minimal 2023 reference implementation of an autonomous task-driven agent — three loops (task creation, prioritization, execution) in ~100 lines of Python.
🏗️Architecture
Chain of Agents
An architecture where agents run in sequence, each refining or extending the previous agent's output. Used for long documents or multi-stage workflows.
🏗️Architecture
Chain of thought
A prompting technique that asks the model to lay out its reasoning step-by-step before committing to an answer — improves accuracy on multi-step tasks.
🏗️Architecture
Constitutional AI
An alignment technique developed by Anthropic where the model is trained to follow a written set of principles ("a constitution") rather than per-example human preferences — produces safer behavior without massive human-labeling effort.
🏗️Architecture
Context engineering
The discipline of curating what information goes into an LLM's context window — selecting, ordering, and formatting the system prompt, examples, retrieved documents, and conversation history for maximum effectiveness.
🏗️Architecture
Corrective RAG
A RAG variant that grades retrieved documents and triggers fallback retrieval (web search, alternative sources) when the initial retrieval scores low on relevance.
🏗️Architecture
DPO
Direct Preference Optimization — a simpler alternative to RLHF that trains models directly on preference data without needing a separate reward model or reinforcement learning loop.
🏗️Architecture
Embeddings
Dense numerical vector representations of text, images, or audio — used to measure semantic similarity, power search, and ground LLM outputs in your data.
🏗️Architecture
Emergent abilities
Capabilities that appear suddenly above a certain model scale — chain-of-thought reasoning, in-context learning, instruction following — and are absent or near-zero in smaller models.
🏗️Architecture
Few-shot learning
A prompting technique where the LLM sees a small number of input/output examples in the prompt before being asked to perform the same task on a new input.
🏗️Architecture
Fine-tuning
The process of training a pre-trained LLM on additional data to adapt it for a specific task, domain, or style — produces a specialized model derived from a general-purpose base.
🏗️Architecture
Frontier model
The current generation of state-of-the-art LLMs — typically the largest models from OpenAI, Anthropic, Google, and a small number of others.
🏗️Architecture
Graph of Thoughts
A reasoning structure that generalizes Tree of Thoughts to an arbitrary DAG — intermediate thoughts can be combined, refined, or referenced from multiple branches.
🏗️Architecture
In-context learning
An LLM's ability to learn a new task at inference time by reading examples in the prompt — no weight updates, just pattern-matching from context.
🏗️Architecture
Inference
The process of running a trained LLM to produce outputs — the production phase, distinct from training. Inference is what you pay for when you use an LLM API.
🏗️Architecture
Instruction tuning
A fine-tuning technique where a pre-trained LLM is trained on instruction-response pairs so it learns to follow natural-language commands instead of just predicting next tokens.
🏗️Architecture
LoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that updates a small number of additional weights instead of the full model, cutting compute and storage cost by 100×+ with minimal accuracy loss.
🏗️Architecture
Mixture of Agents
An architecture where multiple agents (often using different models) generate candidate responses, then an aggregator agent synthesizes them. Higher quality at higher cost.
🏗️Architecture
Mixture of experts
A model architecture where multiple specialized expert networks share the work — a routing layer activates only a few experts per input, cutting inference cost while keeping total parameter count high.
🏗️Architecture
Model distillation
A training technique that transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to match the teacher's outputs — produces a faster, cheaper model that retains most of the teacher's capability.
🏗️Architecture
Multi-step reasoning
The ability of an LLM or agent to chain multiple inferences together to solve a problem — answer A leads to question B, which leads to question C, and so on until the final answer.
🏗️Architecture
Neural network
A computational model loosely inspired by biological neurons — layers of weighted nodes that transform inputs to outputs. LLMs are large neural networks; so are image classifiers, recommendation systems, and most modern AI.
🏗️Architecture
Plan-and-execute
A canonical two-stage agent pattern: a planner LLM produces a structured multi-step plan, then an executor (often a cheaper model) carries out each step using tools.
🏗️Architecture
Planning
The phase where an agent decomposes a goal into a structured sequence of sub-tasks before executing any of them.
🏗️Architecture
Prompt engineering
The practice of designing, refining, and testing the text instructions sent to an LLM to maximize output quality — covers system prompts, few-shot examples, formatting, and meta-instructions.
🏗️Architecture
Quantization
A technique that reduces model weights from 16-bit or 32-bit floats to smaller representations (8-bit, 4-bit, or lower), cutting memory use and inference cost by 2–8× with minimal accuracy loss.
🏗️Architecture
ReAct agent
An agent built on the ReAct pattern: an interleaved loop of reasoning (the model thinks out loud) and acting (the model calls a tool), repeated until the goal is met.
🏗️Architecture
Reasoning model
A class of LLM (o3, Claude Sonnet 4.6, Gemini 2.5 reasoning) that produces a long internal chain of thought before responding — trading latency for accuracy on hard problems.
🏗️Architecture
Reflexion
An agent design pattern where the agent reflects on its previous attempts, generates a critique, and uses the critique to improve subsequent attempts — produces measurable accuracy gains on hard tasks.
🏗️Architecture
RLHF
Reinforcement Learning from Human Feedback — a training technique where humans rate model outputs to teach the model which responses are preferred, dramatically improving instruction-following and safety.
🏗️Architecture
Scaling laws
Empirical power-law relationships between model size, training data, and compute that predict loss and capability — the basis for every major frontier-model training plan since 2020.
🏗️Architecture
Self-consistency
A reasoning technique where the model samples multiple chain-of-thought traces for the same problem and selects the most common final answer — cheap accuracy boost on math and logic tasks.
🏗️Architecture
Self-RAG
A RAG variant where the model decides on the fly whether to retrieve, what to retrieve, and whether its own draft is grounded — emitting reflection tokens at each step.
🏗️Architecture
Speculative decoding
An inference optimization where a small draft model proposes multiple tokens at once and the large model verifies them in parallel — same output, 2–4× faster.
🏗️Architecture
Task decomposition
The agent reasoning step where a high-level goal is broken into ordered, executable sub-tasks before any tool call is made. Foundational to plan-and-execute, ReAct, and tree-of-thoughts patterns.
🏗️Architecture
Temperature
The LLM sampling parameter that controls randomness — low values stay near the most-likely next token, high values explore the distribution. 0–2 typical range.
🏗️Architecture
Test-time compute
Spending more compute at inference time — longer reasoning, more samples, search — to get higher accuracy without retraining the model.
🏗️Architecture
Tokenization
The process of splitting text into subword units (tokens) that LLMs consume — a word like "tokenization" might become two or three tokens depending on the model's tokenizer.
🏗️Architecture
Top-p sampling
A sampling strategy that picks from the smallest set of tokens whose cumulative probability exceeds p. Trims the long tail without hard top-k cutoff.
🏗️Architecture
Transformer
The neural network architecture that powers every modern LLM — uses self-attention to process sequences in parallel, replacing the older RNN approach for language modeling.
🏗️Architecture
Tree of thoughts
A reasoning pattern where the LLM explores multiple solution paths in parallel as a tree, evaluates partial paths, and backtracks — outperforming linear chain-of-thought on hard problems.
🏗️Architecture
World model
An internal predictive representation of the environment that an agent uses to simulate the outcomes of candidate actions before acting — central to 2026 frontier-agent research.
🏗️Architecture
Zero-shot learning
An LLM's ability to perform a task it has never been explicitly trained on or shown examples of — relying entirely on the model's pre-training and the prompt instructions.