KISTI TensorRT-LLM Inference Benchmarks

Throughput (tokens/sec) · higher is better

14px
GPU Model Precision PP TP ISL OSL Requests Throughput (tok/s) Version
14px
GPU Model Precision PP TP ISL OSL Requests Throughput (tok/s) Version

1. Offline (Throughput) vs Online (Serving / Latency)

Feature Offline Benchmarking Online Benchmarking
Goal Peak throughput (Tokens/sec) Low latency (TTFT) + Throughput
Typical Load Massive, simultaneous batches Dynamic/streaming, concurrent requests
Key Optimizer Pre-compiled Engine, High KV Cache In-flight batching, Paged Attention
Network Minimal or no network overhead Measures network/API latency
Tools trtllm-bench Server: trtllm-serve  | Client: AIPerf
  • Offline — focuses on measuring maximum throughput (tokens per second) by processing large batches of requests simultaneously, without the constraints of network latency or a request rate limit.
  • Online — measures end-to-end performance under realistic, streaming scenarios where requests arrive continuously and require low first-token latency (TTFT) and high user-perceived speed.

2. Workflow Comparison

  • TensorRT-LLM v0.14.0 is the last version that supports V100 GPUs.
    • Only weight-only quantization is supported on V100. GPTQ, AWQ, SmoothQuant, and INT8-KV-cache are not supported.
    • Reference: TensorRT-LLM issue #200.

3. llama3.py