Budget GPUs for CUDA development: a short buying guide

by Martin D. Maas, Ph.D

In this post we go through some important considerations on how to pick a budget GPU for CUDA development.

GPU computing has been all the rage for the last few years, and that is a trend which is likely to continue in the future. From machine learning and scientific computing to computer graphics, there is a lot to be excited about in the area, so it makes sense to be a little worried about missing out of the potential benefits of GPU computing in general, and CUDA as the dominant framework in particular.

But how can you get your feet wet in CUDA development? Do you need to buy an expensive GPU? Should you try to do everything in the cloud, leveraging the free tier for experimentation and then using a large instance for the large runs?

If you are serious about learning GPU computing, it’s probably going to take a lot of learning and experimenting. Besides, once you are developing a real application, you’ll need at least one computer you fully control (which free-tiers that allow access to 1/10 of a card do not provide you) to conduct thorough tests to understand your code’s performance. Once the application is well tested in a development environment, it is true that intensive runs could be performed on the cloud or on a computer cluster.

So, what is a reasonable budget GPU that we should choose to start developing CUDA applications?

GPU vs CPU

CUDA has been around for more than a decade, but that doesn’t mean it makes sense to get a 10-year old GPU. After all, probably the first thing to try to understand is what is the the CPU vs GPU trade-off. If you get a GPU that is too old and underpowered, you’ll see no benefits over your current CPU, which can be confusing. On the other hand, if the card you get is too overpowered vs your current CPU, it could lead you to think that GPUs are just fantastic for every possible task.

A possible approach to get a sound understanding on the bang-for-buck of GPU vs CPUs, is to spend the same money on a GPU that what your current CPU goes for. So, if you have a $200 CPU, look for a $200 GPU. If you do just that, and run your own benchmarks for the type of application you are interested, you’ll get very fair estimates on the speed-up factors per invested dollar.

Consumer-grade vs old data-center GPUs

Server-grade GPUs will be tipically be 4-5 years older than a similary-priced gaming GPU, so one of the most basic considerations to make is should you get one of those as a budget GPU for CUDA development, like the Tesla P100?

There are a couple of downsides to the older server-grade GPUs, most importantly:

  • Data-center GPUs don’t have a display output, so if you intend to use a monitor on your development PC, you would need an extra GPU just to use for display. Alternatively, you could set up your GPU in a server and connect to it remotely.
  • Older GPUs have higher power consumption.
  • Older GPUs support fewer features to try out such as the availability of tensor cores. This is reflected in the so-called compute capability, which for example is 6.0 for the Tesla P100.
  • Coolling can be challenging, as server-grade cards expect you to care about cooling, they usually don’t ship with dedicated fans.

Now, on the positive side for server-grade GPUs, there is a big difference in double-precision floating-point performance.

Arguably, mostly for market segmentation reasons, NVIDIA deliberately dampens the F64 performance in consumer-grade GPUs to 1/32 or even 1/64 of the F32 clock-speed. In server-grade GPUs, the performance is about 1/2, as you would normally expect. So if raw double-precision (not just single-precision) performance is one of your goals, then you should definitely skip the consumer-grade GPUs.

CUDA version support and tensor cores

Before looking for very cheap gaming GPUs just to try them out, another thing to consider is whether those GPUs are supported by the latest CUDA version. In order to check this out, you need to check the architecture (or equivalently, the major version of the compute capability) of the different NVIDIA cards.

For convenience, I’ve compiled the following information from various online sources, including the NVIDIA website:

ArchitectureCUDA supportCompute CapabilityTensor Core Precision
FermiCUDA 3.2 until CUDA 82.0No
KeplerCUDA 5 until CUDA 103.0No
MaxwellCUDA 6 until CUDA 115.0No
PascalCUDA 8 and later6.0No
VoltaCUDA 9 and later7.0F16
TuringCUDA 10 and later7.5F16
AmpereCUDA 11.1 and later8.0F64
LovelaceCUDA 11.8 and later8.9F64
HopperCUDA 12 and later9.0F64

In a few words, Fermi, Kepler, and Maxwell cards are unsupported by the current version of the CUDA Toolkit. This means that in order to use CUDA, you would have to install an older version of the CUDA Toolkit, which is doable, but can lead to some software compatibility problems down the line, so I would personally try to avoid those models just to potentially save a lot of headaches in the development side.

An other important feature to consider are tensor cores, which are compute cores specialized for matrix-matrix multiplicatio. Tensor cores were introduced with the Volta architecture, and back then provided only half-precision arithmetic. Tensor cores gain notoriety because they are customarily leveraged by machine learning frameworks such as TensorFlow or PyTorch to provide so-called mixed-precision training (see for example the TensorFlow documentation on this).

Interestingly, since the Ampere architecture was introduced, 3rd generation tensor cores provide accelerated matrix multiplication in full double-precision arithmetic.

So the minimum compute capability you should aim for depends on the type of CUDA development that you indend to do. If you want to leverage low-precision tensor cores, there might be some budget GPU cards (such as the RTX 2060/80 series) you could buy. If low-precision arithmetic is not your cup of tea (for example, if the numerical stability concearns outweight the potential benefit of accelerated matrix-multiplication), you could either consider cheaper alternatives (such as the RTX 1060/80 series) without tensor cores at all, or go straight to the Ampere architecture (RTX 3060/90 series).

Now let’s look at more detailed specs of different gaming GPUs.

Cuda cores, memory bandwith, and other specs of NVIDIA Geforce cards

There are many relevant specs to check out when buying a GPU for CUDA development. For reference, I compiled the following table from various online sources, with some of the most relevant specs for the most common GeForce models.

CardCompute CapabilityCUDA CoresTensor CoresMemoryMemory bandwith
GeForce GTX 10506.1640-2,048 MB112 GB/s
GeForce GTX 10606.11,280-6 GB192 GB/s
GeForce GTX 10706.11,920-8 GB256 GB/s
GeForce GTX 1070 Ti6.12,432-8 GB256 GB/s
GeForce GTX 10806.12,560-8 GB320 GB/s
GeForce GTX 1080 Ti6.13,584-11 GB484 GB/s
Geforce RTX 20607.51,9202406 GB336 GB/s
Geforce RTX 20707.52,3042888 GB448 GB/s
Geforce RTX 20807.52,9443688 GB448 GB/s
Geforce RTX 2080 Ti7.54,35254411 GB616 GB/s
Geforce RTX 30608.63,5841128/12 GB360 GB/s
Geforce RTX 3060 Ti8.64,8643288 GB448 GB/s
GeForce RTX 30708.65,8881848 GB448 GB/s
GeForce RTX 3070 Ti8.66,1441928 GB608 GB/s
GeForce RTX 30808.68,96027210/12 GB760 GB/s
GeForce RTX 3080 Ti8.610,24032012 GB912 GB/s
GeForce RTX 30908.610,49632824 GB936 GB/s
GeForce RTX 3090 Ti8.610,75233624 GB936 GB/s
GeForce RTX 4070 Ti8.97,68024012 GB288 GB/s
GeForce RTX 40808.99,72830416 GB716 GB/s
GeForce RTX 40908.916,38451224 GB1 TB/s

Not included in the above table is the clock speed, for a very simple reason: it is around 1500 MHz all the way from the GTX 1070 to the RTX 3090. Only since the RTX 4070 version the GPU clock started to see some improvements, at 2300 MHz for the 4070 Ti and up.

The memory bandwith refers to that of the GPU, not to the speed of CPU-GPU communication, which is substantially slower. In a few words, the memory bandiwth specifies how fast memory in the GPU memory can get to the CUDA cores. It’s a very important metric, as it can quickly become the bottleneck of low arithmetic intensity computations that require a lot of memory, leaving your CUDA cores idle while waiting for the memory to arrive.

What I’ve learned from my GeForce GTX 1070 Ti

GPU prices vary sharply, and I bought mine at a time when all were super-high. I was being very cheap, and didn’t even want to replace my 550 Watts power supply, so I bought a used GeForce GTX 1070 Ti from a friend, and had to replace one cooler plus the thermal paste to ensure durability, for a total cost of around $100.

The first benchmark I ran resulted in a speed-up between 10 and 60 times for single-precision performance vs the AMD Ryzen 5 1400 4-core CPU I had at that time. The comparison isn’t completly fair, because for $100 I could have gotten a much better CPU. I later upgraded my CPU, but spent more than $100. I estimate that $100 worth of CPU could have resulted in a 2X improvement over my older CPU, so a fair dollar-to-dollar comparison would have been of a speed-up between 5 and 30 times.

I do regret a little bit not having tensor cores to experiment with, but I’m not sure I would have paid 3X just for that privilege. I wanted to begin my journey into CUDA development with a general purpose GPU and saw the tensor cores as too specialized to begin with. Also, my tipical workflow does not revolve exclusively around massive matrix-matrix multiplications, but rather integration of ODEs, FFTs, and custom interpolation routines, so I’m still not even sure it would have been worth it.

Now, what I absolutely regret is that the double-precision performance of this GPU is tipically worse than my CPU. This is simply because it is a consumer-grade GPU, and NVIDIA, while allowing F64 operations, deliberately worsens their performance. I didn’t know that at the time of buying it, and it has become a problem for me, because in some of the applications that I’m interested, being limited to single-precisio can cause big headaches.

The Tesla P100… My next budget CUDA card?

I fortunately got my feet wet in CUDA development for under $100, and bought a card that can still be used for gaming ;-). However, for my next card I still don’t want to break the bank, I don’t want to be limited to single-precision, and I also don’t mind ignoring the Tensor cores for a while.

Having this in mind, and considering that an interesting old server-grade GPU, the Tesla P100, finally went down in price, as it can be found for around $200 in the second-hand market, I think that could be my next CUDA development card.

For completness, let’s add its specs to the comparison.

CardCompute CapabilityCUDA CoresTensor CoresMemoryMemory bandwith
Tesla P1006.03584-16 GB732 GB/s

When compared to the GeForce GTX 1080 Ti, the Tesla P100 has 2X more memory and almost 2X more memory bandwith. And at 5.3 TeraFLOPS of F64, it seems it can be up to 10X faster than a similarly-priced CPU.

As the Tesla P100 doesn’t have a display output, it could be paired either with an integrated-graphics CPU, or installed in a proper home-server where I would connect to it remotely, like one that some network enthusiasts are building using X99 motherboards with old Xeon CPUs.

While the Tesla P100 is a budget solution to the single-precision limitation typical of consumer-grade GPUs, it needs to be installed on a server, which in particular requires proper cooling — and server cooling is typically very noisy.

What about AMD Radeon Pro VII?

For those of us running scientific simulations, the limitation of consumer-grade GPUs to single-precision can be very frustrating or, at least, a big potential source for headaches.

While not strictly CUDA based, there is a single exception to this rule: AMD’s Radeon VII and Radeon VII Pro. These are both consumer-grade GPUs which can reach 3.52 TFLOPs and 6.5 TFLOPs in double-precision, respectively. This cards can be installed in an ordinary PC, and come with standard cooling and video output.