Running LLMs on your computer locally — focus on the hardware!
I’m using a Nvidia GeForce RTX 4060 Ti 16GB VRam GPU (make sure you don’t get the older 8GB VRam model). That gets me into models up to 16 GB, it has cuda and tensor cores and it doesn’t draw too much power. I usually get a good token per second rate and it doesn’t break my bank. I do use the larger bit quantized models for more accuracy and less hallucinations. In addition I’ve text-generation-webui setup, with nice speech-to-text and text-to-speech locally. Yes, my models speak with me in conversation! Also I like LM Studio. I’m starting to write my own python code for integrating with my local run models.
I’ve haven’t gotten into training my own models (my graphics card is a bit weak for that — you seem to need 24GB of VRam for that minimum; an those are expensive) , however I am prompt engineering some of them that are uncensored, and getting good results. I think it is truely amazing that a personal PC can do all this locally! No cloud AI fees for me . . .
It’s been over a year since I have been running AI large language models (LLMs) locally on my computer setup. There are many advantages to doing it this way: savings, privacy, control, just to name a few — plus just learning and fun!
The models are free and can be downloaded and installed from Hugging Face. Your main requirements are powerful enough hardware, that means GPU (VRam mostly), CPU ( core count and speed — which doesn’t have to be fast if you have a lot of cores), Memory (a large main memory), Storage (multiple terabytes if you are trying out a lot of different models).
Some of you have much of this already in your gaming rigs, but running AI locally can demand even more power.
I found this video which really explains it simply, in one sitting, and felt it was good and accurate enough to share. Watch this at the top of this article, and then I will tell you more about my cost-effective hardware system setup. Of course it also helps if you can build your own computers. But if not, you’ll still understand what to buy or upgrade.
So if you have an Apple Macbook with one of the powerful new M chips series then you can try some of the AI software tools and models. However, I’m focusing on hardware, specifically desktop PCs, and GPUs, primarily NVIDIA, and building your own. I’m vendor CPU agnostic, as long as it can do the job.
I have a “used” 2013 Intel Xeon 18 core processor with 64 GB of “new” main memory in a “used” ASUS motherbord (X99 Deluxe II) with BIOS updates in a very big computer case. My power supply has been increased to 1000 watts and I used liquid cooling for the processor. I keep having to add more Samsung SSDs for storage of more AI models as mentioned before. I’m up to 2TB and now considering 4TB SSD, because of all the other things I do with the machine, graphics editing in 360 video, etc.
For graphics processing I have a ASUS ProArt GeForce RTX™ 4060 Ti 16GB OC Edition GDDR6 Graphics Card (PCIe 4.0, 16GB GDDR6, DLSS 3, HDMI 2.1a, DisplayPort 1.4a). This gives me 16 GB of VRam to work with, since I have an older motherboard. Yes, it’s not the fastest gaming card, but it’s what I could afford and the extra VRam really comes in handy for the models above 8GB of VRam, even quantized.
As you can tell my system is not new or even the fastest, but its fast enough and powerful enough to do the job, and I can tell you it cost less too! Plus it can be expanded and grown with more graphics cards. So you can get used Xeon processors with new Chinese motherboards on Aliexpress for very affordable prices. Then slap in a decent graphics card like the one above and your off to the races . . . You can even go dual core motherboards, unfortunately mine isn’t. Absolutely love these videos on the building below as an example of the work involved, nice job guys! Every one of these guys has an interesting and unique tech geek vive . . . and hardware building chops. Use their ideas to build your own.
Update: here is a nice article with a formula for the amount of VRAM needed for very large LLMs. Although the largest models I run are usually over 8GB but less than what will fit in my VRAM: