🚀Deploymentalso: model serving, llm serving, inference serving

Model servingdefinition and how it works in 2026

Model serving: The infrastructure layer that hosts a model and exposes inference over HTTP — covering batching, scheduling, KV-cache management, and request routing.

Model serving is the production-grade hosting of LLMs. It's the layer between "I have model weights" and "developers can call an API." A serving stack handles request batching, GPU scheduling, KV-cache management, tensor parallelism, speculative decoding, autoscaling, and observability.

In 2026, the standard self-hosted stacks are vLLM (the de-facto open-source serving engine), SGLang (structured-generation-optimized), TensorRT-LLM (NVIDIA-optimized), and a few proprietary stacks (Together, Anyscale, AWS SageMaker). Hosted serving (OpenAI, Anthropic, Google Vertex, Bedrock) is the alternative for teams that don't want to operate the infrastructure.

The build-vs-buy decision for model serving comes down to scale + privacy + cost. Below ~$50K/month in inference spend, hosted APIs almost always win on TCO. Above that, self-hosted starts to compete — especially with privacy or sovereignty requirements.

Frequently asked

When should I self-host model serving?+

When you have $50K+/month of inference, privacy/sovereignty requirements, or need custom models the hosted APIs don't support. Below that scale, hosted APIs (OpenAI, Anthropic, etc.) usually beat self-hosted on TCO.

vLLM, SGLang, or TensorRT-LLM — which to pick?+

vLLM for general-purpose serving, broadest model support. SGLang for workloads heavy on structured outputs or RAG-style multi-step generation. TensorRT-LLM for NVIDIA-optimized peak throughput when you can sacrifice flexibility.

Frequently asked

Related terms