Thread 'Running local AI inference without tying up my GPU 24/7

Author	Message
nzt2048 New member Send message Joined: 15 Jun 26 Posts: 1	Message 119403 - Posted: 15 Jun 2026, 19:05:03 UTC Hey everyone, I've been crunching BOINC on a rig with a decent RTX card (keeping it dedicated mostly to GPU projects like Einstein or whatever's available), but I've also been experimenting with local LLMs for some personal/research stuff — things like coding assistance, data analysis, and creative writing tasks. The problem is obvious: running even moderately sized models for inference locally eats GPU memory and power pretty quickly, especially if you want decent speed. I don't have the budget or space for multiple high-end cards right now, and I hate throttling my BOINC contribution just to chat with a model. Has anyone found good ways to offload AI inference without self-hosting everything? Quantization helps a bit (I've tried GGUF with llama.cpp and Ollama), but for bigger or uncensored models it's still a pain on consumer hardware. Update: Just found icelake.io and have been using them (along with some SAM tools) for inference. It's private, no heavy local GPU load, and handles the heavy lifting remotely while giving me full control/flexibility without the usual corporate filters. Pretty handy when I just need quick results without dedicating my rig. Curious what others in the GPU/BOINC crowd are doing these days — cloud options, remote inference services, or clever local tricks that play nice with volunteer computing? Any recommendations or gotchas I should watch for (power draw, latency, privacy, etc.)? Thanks! ID: 119403 · Reply Quote

Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.

Thread 'Running local AI inference without tying up my GPU 24/7 – any better options?'