🚀

Deployment terms

How agents get hosted, paid for, and shipped to production.

🚀Deployment
AI agent framework
A library or toolkit for building AI agents — providing primitives for tool calling, planning, memory, and orchestration so you do not rebuild the agent loop from scratch.
🚀Deployment
AI drift
The phenomenon where an AI system's behavior changes over time without explicit code changes — caused by model version updates, training data shifts, or vendor-side changes.
🚀Deployment
AI pilot
A time-boxed, scope-limited deployment of an AI agent against a real workflow to measure quality, cost, and adoption before broader rollout. The standard 2026 enterprise procurement pattern.
🚀Deployment
Batch inference
Running model inference asynchronously over a large batch of inputs, traded for latency. OpenAI/Anthropic batch APIs are typically 50% cheaper than sync calls.
🚀Deployment
BYO key
A deployment pattern where you supply your own model API key to the agent — token costs are billed to you directly, the agent vendor charges only for the software.
🚀Deployment
Dense retrieval
The standard modern retrieval approach where queries and documents are encoded as dense embedding vectors and matched by similarity — distinct from sparse retrieval (BM25, keyword search).
🚀Deployment
Edge AI
AI that runs on the device where data is generated — phone, laptop, IoT, vehicle, factory floor — rather than in a remote data center. Trades model size for latency, privacy, and offline operation.
🚀Deployment
Hybrid search
A retrieval technique that combines vector (semantic) search with keyword (lexical) search, fusing the scores to get higher precision than either alone. The 2026 production-grade default for RAG.
🚀Deployment
Inference-time compute
Spending more compute at inference (longer reasoning chains, multiple samples, search) to improve quality on hard problems — the architectural bet of 2025–2026 reasoning models.
🚀Deployment
Local LLM
A large language model running entirely on hardware you control — your laptop, your server, or your data center — with no calls to external APIs.
🚀Deployment
Model serving
The infrastructure layer that hosts a model and exposes inference over HTTP — covering batching, scheduling, KV-cache management, and request routing.
🚀Deployment
No-code AI
AI tools that let non-engineers build agents, workflows, or applications via visual interfaces — drag-and-drop, prompts, or declarative configuration instead of writing code.
🚀Deployment
On-prem
A deployment where the agent runs entirely on infrastructure the customer controls — no agent code or customer data leaves the customer's network.
🚀Deployment
Open source agent
An agent whose source code is publicly licensed (MIT, Apache, AGPL) — you can self-host, fork, and audit.
🚀Deployment
Private inference
Running LLM inference inside your security perimeter (VPC, on-prem, confidential compute) so prompts and outputs never leave your control. Mandatory for regulated industries.
🚀Deployment
Semantic chunking
A document-splitting technique that uses embeddings to detect semantic boundaries — produces more coherent chunks for RAG than fixed-size chunking, improving retrieval quality.
🚀Deployment
Semantic routing
A routing technique that uses embedding similarity to send each request to the right model, agent, or workflow — instead of brittle keyword rules or expensive LLM classifiers.
🚀Deployment
SGLang
An LLM inference and programming framework optimized for structured generation, agent workloads, and complex prompting patterns — competitive with vLLM on throughput and faster on JSON/grammar-constrained output.
🚀Deployment
Small language model
A capable LLM in the 1B–13B parameter range — trained to compete with frontier-quality on specific tasks while running on consumer hardware or at fraction-of-frontier cost.
🚀Deployment
Streaming inference
Serving LLM outputs token-by-token as they're generated, typically over SSE or WebSocket — the default deployment pattern for any user-facing AI in 2026.
🚀Deployment
TCO
Total cost of ownership — the all-in cost of running an agent including subscription, token spend, ops time, and integration work.
🚀Deployment
vLLM
A high-throughput open-source LLM inference engine — pioneered PagedAttention to manage KV cache like virtual memory, dramatically improving GPU utilization for serving open models.