KISTI TensorRT-LLM Inference Benchmarks

Throughput (tokens/sec) · higher is better ● LOCAL EDITS ACTIVE

Table size 14px

GPU ↕	Model ↕	Precision ↕	PP ↕	TP ↕	ISL ↕	OSL ↕	Requests ↕	Throughput (tok/s) ↕	Version ↕

Table size 14px

GPU ↕	Model ↕	Precision ↕	PP ↕	TP ↕	ISL ↕	OSL ↕	Requests ↕	Throughput (tok/s) ↕	Version ↕

1. Offline (Throughput) vs Online (Serving / Latency)

Feature	Offline Benchmarking	Online Benchmarking
Goal	Peak throughput (Tokens/sec)	Low latency (TTFT) + Throughput
Typical Load	Massive, simultaneous batches	Dynamic/streaming, concurrent requests
Key Optimizer	Pre-compiled Engine, High KV Cache	In-flight batching, Paged Attention
Network	Minimal or no network overhead	Measures network/API latency
Tools	`trtllm-bench`	Server: `trtllm-serve` \| Client: `AIPerf`

Offline — focuses on measuring maximum throughput (tokens per second) by processing large batches of requests simultaneously, without the constraints of network latency or a request rate limit.
Online — measures end-to-end performance under realistic, streaming scenarios where requests arrive continuously and require low first-token latency (TTFT) and high user-perceived speed.

2. Workflow Comparison

TensorRT-LLM v0.14.0 is the last version that supports V100 GPUs.
- Only weight-only quantization is supported on V100. GPTQ, AWQ, SmoothQuant, and INT8-KV-cache are not supported.
- Reference: TensorRT-LLM issue #200.

3. llama3.py