Budget GPUs for CUDA development: a short buying guide
In this post we go through some important considerations on how to pick a budget GPU for CUDA development.
GPU computing has been all the rage for the last few years, and that is a trend which is likely to continue in the future. From machine learning and scientific computing to computer graphics, there is a lot to be excited about in the area, so it makes sense to be a little worried about missing out of the potential benefits of GPU computing in general, and CUDA as the dominant framework in particular.
But how can you get your feet wet in CUDA development? Do you need to buy an expensive GPU? Should you try to do everything in the cloud, leveraging the free tier for experimentation and then using a large instance for the large runs?
If you are serious about learning GPU computing, it’s probably going to take a lot of learning and experimenting. Besides, once you are developing a real application, you’ll need at least one computer you fully control (which free-tiers that allow access to 1/10 of a card do not provide you) to conduct thorough tests to understand your code’s performance. Once the application is well tested in a development environment, it is true that intensive runs could be performed on the cloud or on a computer cluster.
So, what is a reasonable budget GPU that we should choose to start developing CUDA applications?
GPU vs CPU
CUDA has been around for more than a decade, but that doesn’t mean it makes sense to get a 10-year old GPU. After all, probably the first thing to try to understand is what is the the CPU vs GPU trade-off. If you get a GPU that is too old and underpowered, you’ll see no benefits over your current CPU, which can be confusing. On the other hand, if the card you get is too overpowered vs your current CPU, it could lead you to think that GPUs are just fantastic for every possible task.
A possible approach to get a sound understanding on the bang-for-buck of GPU vs CPUs, is to spend the same money on a GPU that what your current CPU goes for. So, if you have a $200 CPU, look for a $200 GPU. If you do just that, and run your own benchmarks for the type of application you are interested, you’ll get very fair estimates on the speed-up factors per invested dollar.
Gaming vs Data-center GPUs
One of the basic considerations to make is whether buying maybe one of the older data-centre GPU models as a budget GPU for CUDA development, like the Tesla P100. However, I believe buying old server-grade GPUs will generally be a bad idea for a budget CUDA-development GPU, for the following reasons:
- Data-center GPUs don’t have a display output, so if you intend to use a monitor on your development PC, you would need an extra GPU just to use for display. Alternatively, you could set up your GPU in a server and connect to it remotely, but this extra work is hardly worthwile if your goal is to learn CUDA.
- A server-grade GPU will tipically be 4-5 years older than a similary-priced gaming GPU, which is associated with higher power consumption, and more importantly, fewer modern features to try out such as the availability of tensor cores.
CUDA version support and tensor cores
Before looking for very cheap gaming GPUs just to try them out, another thing to consider is whether those GPUs are supported by the latest CUDA version. In order to check this out, you need to check the architecture (or equivalently, the major version of the compute capability) of the different NVIDIA cards.
For convenience, I’ve compiled the following information from various online sources, including the NVIDIA website:
|Tensor Core Precision
|CUDA 3.2 until CUDA 8
|CUDA 5 until CUDA 10
|CUDA 6 until CUDA 11
|CUDA 8 and later
|CUDA 9 and later
|CUDA 10 and later
|CUDA 11.1 and later
|CUDA 11.8 and later
|CUDA 12 and later
In a few words, Fermi, Kepler, and Maxwell cards are unsupported by the current version of the CUDA Toolkit. This means that in order to use CUDA, you would have to install an older version of the CUDA Toolkit, which is doable, but can lead to some software compatibility problems down the line, so I would personally try to avoid those models just to potentially save a lot of headaches in the development side.
An other important feature to consider are tensor cores, which are compute cores specialized for matrix-matrix multiplicatio. Tensor cores were introduced with the Volta architecture, and back then provided only half-precision arithmetic. Tensor cores gain notoriety because they are customarily leveraged by machine learning frameworks such as TensorFlow or PyTorch to provide so-called mixed-precision training (see for example the TensorFlow documentation on this).
Interestingly, since the Ampere architecture was introduced, 3rd generation tensor cores provide accelerated matrix multiplication in full double-precision arithmetic.
So the minimum compute capability you should aim for depends on the type of CUDA development that you indend to do. If you want to leverage low-precision tensor cores, there might be some budget GPU cards (such as the RTX 2060/80 series) you could buy. If low-precision arithmetic is not your cup of tea (for example, if the numerical stability concearns outweight the potential benefit of accelerated matrix-multiplication), you could either consider cheaper alternatives (such as the RTX 1060/80 series) without tensor cores at all, or go straight to the Ampere architecture (RTX 3060/90 series).
Now let’s look at more detailed specs of different gaming GPUs.
Cuda cores, memory bandwith, and other specs of NVIDIA Geforce cards
There are many relevant specs to check out when buying a GPU for CUDA development. For reference, I compiled the following table from various online sources, with some of the most relevant specs for the most common GeForce models.
|GeForce GTX 1050
|GeForce GTX 1060
|GeForce GTX 1070
|GeForce GTX 1070 Ti
|GeForce GTX 1080
|GeForce GTX 1080 Ti
|Geforce RTX 2060
|Geforce RTX 2070
|Geforce RTX 2080
|Geforce RTX 2080 Ti
|Geforce RTX 3060
|Geforce RTX 3060 Ti
|GeForce RTX 3070
|GeForce RTX 3070 Ti
|GeForce RTX 3080
|GeForce RTX 3080 Ti
|GeForce RTX 3090
|GeForce RTX 3090 Ti
|GeForce RTX 4070 Ti
|GeForce RTX 4080
|GeForce RTX 4090
Not included in the above table is the clock speed, for a very simple reason: it is around 1500 MHz all the way from the GTX 1070 to the RTX 3090. Only since the RTX 4070 version the GPU clock started to see some improvements, at 2300 MHz for the 4070 Ti and up.
The memory bandwith refers to that of the GPU, not to the speed of CPU-GPU communication, which is substantially slower. In a few words, the memory bandiwth specifies how fast memory in the GPU memory can get to the CUDA cores. It’s a very important metric, as it can quickly become the bottleneck of low arithmetic intensity computations that require a lot of memory, leaving your CUDA cores idle while waiting for the memory to arrive.
Final remarks: what to buy now and what I bought
GPU prices vary sharply, and I bought mine at a time when all were super-high.
If you only want to get your feet wet with CUDA development, going from CPU programming with 4, 8 or even 12 cores into the several thousands is a real game changer, and it will probably teach you all the CUDA fundamentals you want to learn. The bare minimum GPU you want for this is probably one of the 1060 series, that can be bought second-hand for about $100, or even less.
In my case, I was not interested in the potential headaches of low-precision arithmetic, so that ruled out the 2060 series, and the 3060 series was a little too expensive at the moment. Not only prices were really high back then (they have reciently came down to some extent), but the increased power draw implied that I would have also needed to replace my 550 Watts power supply for a larger one.
In the end, I bought a used GeForce GTX 1070 Ti from a friend for about $100, and had to replace one cooler plus the thermal paste to ensure durability.
I will certainly write more about this in later posts, but the first benchmark I ran resulted in a speed-up between 10 and 60 times vs my AMD Ryzen 5 1400 4-core CPU (which is a little dated, I know). So it was good enough for my needs at the moment.
I do regret a little bit not having tensor cores to experiment with, but I’m not sure I would have paid 3X just for that privilege. Also, my tipical workflow does not revolve around massive matrix-matrix multiplications, but rather integration of ODEs, FFTs, and custom interpolation routines, so I’m not even sure it would have been worth it. I also wanted to begin my journey into CUDA development with a general purpose GPU and saw the tensor cores as too specialized to begin with.
Now that chip prices seem to have dropped in general, I believe I would probably buy a Geforce RTX 3060, that is going for around $280 on Amazon (or $220 second-hand), just to get my feet wet on the double-precision tensor cores. The Geforce RTX 3060, being among the newer models, is also quite power efficient, so it can be paired with a 550W / 650W power supply, depending on your CPU’s power draw.
Depending on the price, stretching out to the RTX 3060 Ti could be worthy for this matter, as the tensor core count nearly triples (112 vs 328). However, also note that the CUDA core / tensor core ratio seems off the chart for the RTX 3060 Ti (at 14 cuda cores per tensor core), and that ratio actually went down in the most expensive server-grade GPUs like the H100, that has 18,432 CUDA cores and 640 tensor cores, or almost 29 CUDA cores per tensor core. The RTX 3060 actually provides a similar ratio, at 32 CUDA cores per tensor core, so it seems a very sensible choice for a CUDA development card.
In any case, all this cards are perfectly suitable if the goal is to get started with CUDA development. However, given the price fluctuation, the heavy marketing involved, I believe that from a developer’s perspective is important to keep an eye on the fine details of the specs of the different models (like the memory bandwith!) to see if the advertised improvements over previous models are worth it for the application at hand.
Personally, I have a growing interest in CUDA development and will problaby start posting how-to tutorials and benchmarks, so stay tuned!