Inference Engineering | Quan H. Nguyen

Inference Engineering cover — Cover of *Inference Engineering*

Access free via Philip Kiely’s website or Baseten’s website (digital and paper copy available).

Inference is the most valuable category in AI, but inference engineering is still in its infancy.

Inference engineers work across the stack from CUDA to Kubernetes in pursuit of faster, less expensive, more reliable serving of generative AI models in production.

While the potential and impact of inference are becoming clear, the space is young. There are relatively few people working on inference, and newcomers can become experts quickly. There are opportunities to solve novel, interesting, and deeply technical problems at every level of the stack.

Inference Engineering is your guide to becoming an expert in inference. This book is based on the hundreds of thousands of words of documentation, blogs, and talks I’ve published on inference; interviews with dozens of experts from Baseten’s engineering team; and countless conversations with customers and builders around the world.

Chapter 1: Prerequisites

Scale and Specialization
About Your App
- AI-Native Applications
- Online versus Offline
- Consumer versus B2B
Model Selection
- Model Evaluation
- Fine-Tuning for Domain-Specific Quality
- Distillation
Measuring Latency and Throughput
- Latency Percentiles
- End-to-End Metrics

Chapter 2: Architecture

Neural Networks
- Linear Layers and Matmul
- Activation Functions
LLM Inference Mechanics
- LLM Architecture
- Transformer Blocks
- Attention
- Mixture of Experts Models
Image Generation Inference Mechanics
- Image Generation Model Architecture
- Few-Step Image Generation Models
- Video Generation
Calculating Inference Bottlenecks
- Ops:Byte Ratio and Arithmetic Intensity
- LLM Inference Bottlenecks
- Image Generation Inference Bottlenecks
Optimizing Attention

Chapter 3: Hardware

GPU Architecture
- Compute
- Memory and Caches
GPU Architecture Generations
- Hopper GPUs
- Ada Lovelace GPUs
- Blackwell GPUs
- Rubin GPUs
- Grace and Vera CPUs
Instances
- Multi-GPU Instances
- Multi-Instance GPUs
Other Datacenter Accelerator Options
Local Inference
- Desktop Inference
- Mobile Inference

Chapter 4: Software

CUDA
- CUDA Kernels for Inference
- CUDA Kernel Selection
- Reducing Memory Accesses with Kernel Fusion
Deep Learning Frameworks and Libraries
- PyTorch
- Model File Formats
- ONNX Runtime and TensorRT
- Transformers and Diffusers
Inference Engines
- vLLM
- SGLang
- TensorRT-LLM
NVIDIA Dynamo
Performance Benchmarking and Load Testing
- Performance Benchmarking Tooling
- Performance Benchmarking Tips
- Profiling Performance

Chapter 5: Techniques

Quantization
- Number Formats
- Quantization Approaches
- Measuring Quality Impact
Speculative Decoding
- Draft-Target Speculative Decoding
- Medusa
- EAGLE
- N-gram Speculation and Lookahead Decoding
Caching
- Prefix Caching and KV Cache Re-Use
- Where to Store the KV Cache
- Cache-Aware Routing
- Long Context Handling
Model Parallelism
- Tensor Parallelism for Lower Latency
- Expert Parallelism for Higher Throughput
- Multi-Node Inference
Disaggregation
- How Disaggregation Works
- When to Use Disaggregation
- Dynamic Disaggregation with NVIDIA Dynamo

Chapter 6: Modalities

Vision Language Models
- Video Processing for Vision Language Models
- Omni-Modal Models
Embedding Models
- Embedding Model Architecture
- Embedding Model Inference
ASR Models
- Single-Chunk Latency Optimization
- Long File Latency Optimization
- Diarization
TTS Models
- Streaming Real-Time Text to Speech
- Speech-to-Speech Models
Image Generation Models
- Image Generation Kernel Optimization
- One Weird Trick for Faster Image Generation
Video Generation Models
- Attention Optimization and Quantization
- Context Parallelism

Chapter 7: Production

Containerization
- Dependency Management
- NIMs
Autoscaling
- Concurrency and Batch Sizing
- Cold Starts
- Routing, Load Balancing, and Queueing
- Scale to Zero
- Independent Component Scaling
Multi-Cloud Capacity Management
- GPU Procurement
- Geo-Aware Load Balancing
- Building for Reliability
- Security and Compliance
Testing and Deployment
- Zero-Downtime Deployment
- Cost Estimation
- Observability
Client Code
- Client Latency Overhead
- Asynchronous Inference
- Streaming and Protocol Support
Production Inference with Baseten