Skip Navigation

Consumer GPUs to run LLMs

Not sure if this is the right place, if not please let me know.

GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.

Thanks


You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

39 comments
    • I don't mind multiple GPUs but my motherboard doesn't have 2+ electrically connected X16 slots. I could build a new homeserver (I've been thinking about it) but consumer platforms simply don't have the PCIE lanes for 2 actual x16 slots. I'd have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don't think it matters considering how power hungry GPUs are now.

      • I haven't looked into the issue of PCIe lanes and the GPU.

        I don't think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.

        So assuming the kernel modules and hardware support the more narrow bandwidth, it should work... I think. There are laptops that have options for an external FireWire GPU too, so I don't think the PCIe bus is too baked in.

  • Thing is, you can trade off speed for quality. For coding support you can settle for Llama 3.2 or a smaller deepseek-r1 and still get most of what you need on a smaller GPU, then scale up to a bigger model that will run slower if you need something cleaner. I've had a small laptop with 16 GB of total memory and a 4060 mobile serving as a makeshift home server with a LLM and a few other things and... well, it's not instant, but I can get the sort of thing you need out of it.

    Sure, if I'm digging in and want something faster I can run something else in my bigger PC GPU, but a lot of the time I don't have to.

    Like I said below, though, I'm in the process of trying to move that to an Arc A770 with 16 GB of VRAM that I had just lying around because I saw it on sale for a couple hundred bucks and I needed a temporary GPU replacement for a smaller PC. I've tried running LLMs on it before and it's not... super fast, but it'll do what you want for 14B models just fine. That's going to be your sweet spot on home GPUs anyway, anything larger than 16GB and you're talking 3090, 4090 or 5090, pretty much exclusively.

39 comments