NVIDIA recently released the much-anticipated GeForce RTX 30 Series of Graphics cards, with the largest and most powerful, the RTX 3090, boasting 24GB of memory and 10,500 CUDA cores. This is the natural upgrade to 2018’s 24GB RTX Titan and we were eager to benchmark the training performance performance of the latest GPU against the Titan with modern deep learning workloads.
Based on the specs alone, the 3090 RTX offers a great improvement in the number of CUDA cores, which should give us a nice speed up on FP32 tasks. However, NVIDIA decided to cut the number of tensor cores in GA102 (compared to GA100 found in A100 cards) which might impact FP16 performance.
|Titan RTX||3090 RTX|
|Architecture||Turing TU102||Ampere GA102|
|Memory bandwidth||672 GB/sec||936 GB/sec|
Driver Version: 455.23.05
CUDA Version: 11.1
Tensorflow: tf-nightly 2.4.0.dev20200928
It is very important to use the latest version of CUDA (11.1) and latest tensorflow, some features like TensorFloat are not yet available in a stable release at the time of writing.
We use our own fork of the Lambda Tensorflow Benchmark which measures the training performance for several deep learning models trained on ImageNet.
|Titan RTX||RTX 3090||Titan RTX||RTX 3090|
We’re able to achieve a 1.4-1.6x training speed-up for all the models training with FP32! As expected, the FP16 is not quite as significant, with a 1.0-1.2x speed-up for most models and a drop for Inception.
Please get in touch at firstname.lastname@example.org with any questions or comments!