Testing was performed on models with quantized weights from NVIDIA's
Model Optimizer HuggingFace Collection.
•
PP: Pipeline parallelism (multi-node inference)
•
TP: Tensor parallelism (multi-GPU inference)
•
ISL: Benchmark input sequence length
•
OSL: Benchmark output sequence length
•
Requests: The number of requests to generate for dataset generation
•
For shorter (ISL/OSL), a larger number of messages were used to guarantee that the system hit a steady state because requests enter and exit the system at a much faster rate
•
For longer (ISL/OSL), requests remain in the system longer and therefore require less requests to achieve steady state