Deployment terms
How agents get hosted, paid for, and shipped to production.
- 🚀DeploymentAI agent framework
A library or toolkit for building AI agents — providing primitives for tool calling, planning, memory, and orchestration so you do not rebuild the agent loop from scratch.
- 🚀DeploymentAI drift
The phenomenon where an AI system's behavior changes over time without explicit code changes — caused by model version updates, training data shifts, or vendor-side changes.
- 🚀DeploymentAI pilot
A time-boxed, scope-limited deployment of an AI agent against a real workflow to measure quality, cost, and adoption before broader rollout. The standard 2026 enterprise procurement pattern.
- 🚀DeploymentBatch inference
Running model inference asynchronously over a large batch of inputs, traded for latency. OpenAI/Anthropic batch APIs are typically 50% cheaper than sync calls.
- 🚀DeploymentBYO key
A deployment pattern where you supply your own model API key to the agent — token costs are billed to you directly, the agent vendor charges only for the software.
- 🚀DeploymentDense retrieval
The standard modern retrieval approach where queries and documents are encoded as dense embedding vectors and matched by similarity — distinct from sparse retrieval (BM25, keyword search).
- 🚀DeploymentEdge AI
AI that runs on the device where data is generated — phone, laptop, IoT, vehicle, factory floor — rather than in a remote data center. Trades model size for latency, privacy, and offline operation.
- 🚀DeploymentHybrid search
A retrieval technique that combines vector (semantic) search with keyword (lexical) search, fusing the scores to get higher precision than either alone. The 2026 production-grade default for RAG.
- 🚀DeploymentInference-time compute
Spending more compute at inference (longer reasoning chains, multiple samples, search) to improve quality on hard problems — the architectural bet of 2025–2026 reasoning models.
- 🚀DeploymentLocal LLM
A large language model running entirely on hardware you control — your laptop, your server, or your data center — with no calls to external APIs.
- 🚀DeploymentModel serving
The infrastructure layer that hosts a model and exposes inference over HTTP — covering batching, scheduling, KV-cache management, and request routing.
- 🚀DeploymentNo-code AI
AI tools that let non-engineers build agents, workflows, or applications via visual interfaces — drag-and-drop, prompts, or declarative configuration instead of writing code.
- 🚀DeploymentOn-prem
A deployment where the agent runs entirely on infrastructure the customer controls — no agent code or customer data leaves the customer's network.
- 🚀DeploymentOpen source agent
An agent whose source code is publicly licensed (MIT, Apache, AGPL) — you can self-host, fork, and audit.
- 🚀DeploymentPrivate inference
Running LLM inference inside your security perimeter (VPC, on-prem, confidential compute) so prompts and outputs never leave your control. Mandatory for regulated industries.
- 🚀DeploymentSemantic chunking
A document-splitting technique that uses embeddings to detect semantic boundaries — produces more coherent chunks for RAG than fixed-size chunking, improving retrieval quality.
- 🚀DeploymentSemantic routing
A routing technique that uses embedding similarity to send each request to the right model, agent, or workflow — instead of brittle keyword rules or expensive LLM classifiers.
- 🚀DeploymentSGLang
An LLM inference and programming framework optimized for structured generation, agent workloads, and complex prompting patterns — competitive with vLLM on throughput and faster on JSON/grammar-constrained output.
- 🚀DeploymentSmall language model
A capable LLM in the 1B–13B parameter range — trained to compete with frontier-quality on specific tasks while running on consumer hardware or at fraction-of-frontier cost.
- 🚀DeploymentStreaming inference
Serving LLM outputs token-by-token as they're generated, typically over SSE or WebSocket — the default deployment pattern for any user-facing AI in 2026.
- 🚀DeploymentTCO
Total cost of ownership — the all-in cost of running an agent including subscription, token spend, ops time, and integration work.
- 🚀DeploymentvLLM
A high-throughput open-source LLM inference engine — pioneered PagedAttention to manage KV cache like virtual memory, dramatically improving GPU utilization for serving open models.