Deployment terms
How agents get hosted, paid for, and shipped to production.
- ๐DeploymentAI agent framework
A library or toolkit for building AI agents โ providing primitives for tool calling, planning, memory, and orchestration so you do not rebuild the agent loop from scratch.
- ๐DeploymentAI drift
The phenomenon where an AI system's behavior changes over time without explicit code changes โ caused by model version updates, training data shifts, or vendor-side changes.
- ๐DeploymentAI pilot
A time-boxed, scope-limited deployment of an AI agent against a real workflow to measure quality, cost, and adoption before broader rollout. The standard 2026 enterprise procurement pattern.
- ๐DeploymentBatch inference
Running model inference asynchronously over a large batch of inputs, traded for latency. OpenAI/Anthropic batch APIs are typically 50% cheaper than sync calls.
- ๐DeploymentBYO key
A deployment pattern where you supply your own model API key to the agent โ token costs are billed to you directly, the agent vendor charges only for the software.
- ๐DeploymentDense retrieval
The standard modern retrieval approach where queries and documents are encoded as dense embedding vectors and matched by similarity โ distinct from sparse retrieval (BM25, keyword search).
- ๐DeploymentEdge AI
AI that runs on the device where data is generated โ phone, laptop, IoT, vehicle, factory floor โ rather than in a remote data center. Trades model size for latency, privacy, and offline operation.
- ๐DeploymentHybrid search
A retrieval technique that combines vector (semantic) search with keyword (lexical) search, fusing the scores to get higher precision than either alone. The 2026 production-grade default for RAG.
- ๐DeploymentInference-time compute
Spending more compute at inference (longer reasoning chains, multiple samples, search) to improve quality on hard problems โ the architectural bet of 2025โ2026 reasoning models.
- ๐DeploymentLocal LLM
A large language model running entirely on hardware you control โ your laptop, your server, or your data center โ with no calls to external APIs.
- ๐DeploymentModel serving
The infrastructure layer that hosts a model and exposes inference over HTTP โ covering batching, scheduling, KV-cache management, and request routing.
- ๐DeploymentNo-code AI
AI tools that let non-engineers build agents, workflows, or applications via visual interfaces โ drag-and-drop, prompts, or declarative configuration instead of writing code.
- ๐DeploymentOn-prem
A deployment where the agent runs entirely on infrastructure the customer controls โ no agent code or customer data leaves the customer's network.
- ๐DeploymentOpen source agent
An agent whose source code is publicly licensed (MIT, Apache, AGPL) โ you can self-host, fork, and audit.
- ๐DeploymentPrivate inference
Running LLM inference inside your security perimeter (VPC, on-prem, confidential compute) so prompts and outputs never leave your control. Mandatory for regulated industries.
- ๐DeploymentSemantic chunking
A document-splitting technique that uses embeddings to detect semantic boundaries โ produces more coherent chunks for RAG than fixed-size chunking, improving retrieval quality.
- ๐DeploymentSemantic routing
A routing technique that uses embedding similarity to send each request to the right model, agent, or workflow โ instead of brittle keyword rules or expensive LLM classifiers.
- ๐DeploymentSGLang
An LLM inference and programming framework optimized for structured generation, agent workloads, and complex prompting patterns โ competitive with vLLM on throughput and faster on JSON/grammar-constrained output.
- ๐DeploymentSmall language model
A capable LLM in the 1Bโ13B parameter range โ trained to compete with frontier-quality on specific tasks while running on consumer hardware or at fraction-of-frontier cost.
- ๐DeploymentStreaming inference
Serving LLM outputs token-by-token as they're generated, typically over SSE or WebSocket โ the default deployment pattern for any user-facing AI in 2026.
- ๐DeploymentTCO
Total cost of ownership โ the all-in cost of running an agent including subscription, token spend, ops time, and integration work.
- ๐DeploymentvLLM
A high-throughput open-source LLM inference engine โ pioneered PagedAttention to manage KV cache like virtual memory, dramatically improving GPU utilization for serving open models.