Thread 'Maximize GPU Science Throughput via Parallel Task Saturation'

Author	Message
kasdashdfjsah Send message Joined: 29 Jan 24 Posts: 108	Message 118131 - Posted: 19 Jan 2026, 21:50:56 UTC Most GPUs and iGPUs are sadly underutilized when running BOINC because they default to running only one task at a time, creating idle gaps while the hardware waits for data. By forcing multiple concurrent tasks, you can fill these gaps and maintain 100% hardware saturation, which is especially important for modern high-core-count silicon. On my base M4 Mac Mini, I found that jumping straight to 20 tasks caused a system freeze, but backing off to 10 tasks (1 per GPU core) achieved perfect stability and maximum output. Based on that experience, I would recommend these Safety Tiers for anyone looking to optimize their setup: For iGPUs (M4, Panther Lake, etc.): Run 1 task per GPU core (e.g., 10 tasks for a 10-core chip). For Mid-Range Discrete GPUs: Run 1 task per GB of VRAM (e.g., 10-12 tasks for a 12GB card). For High-End Cards (RTX 4090, etc.): Try 1 task per 1,000 CUDA cores (approx 16 tasks). To enable this, you need to create an app_config.xml file in your specific project folder. Replace PROJECT_URL with the folder name and APP_NAME with the application's internal name found in your task properties. Mac/Linux Terminal: cd "/Library/Application Support/BOINC Data/projects/PROJECT_URL/" && sudo printf "<app_config>\n <app>\n <name>APP_NAME</name>\n <gpu_versions>\n <gpu_usage>0.1</gpu_usage>\n <cpu_usage>0.1</cpu_usage>\n </gpu_versions>\n </app>\n</app_config>" > app_config.xml Windows (PowerShell Admin): Set-Location "C:\ProgramData\BOINC\projects\PROJECT_URL"; $xml = '<app_config><app><name>APP_NAME</name><gpu_versions><gpu_usage>0.1</gpu_usage><cpu_usage>0.1</cpu_usage></gpu_versions></app></app_config>'; $xml \| Out-File -FilePath "app_config.xml" -Encoding ascii To apply this, go to Options in the BOINC Manager and click Read config files. You can scale the 0.1 value up or down based on your core count (e.g., 0.05 for 20 tasks). Hoping this will increase the total contributed BOINC GPU compute power significantly. Let me know if you have any questions, and I'll try to help out as best I can :) P.s. Keep an eye on your CPU usage so the GPU doesn't get "starved" of instructions. If your CPU is pinned at 100%, your GPU throughput will actually drop. ID: 118131 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 938	Message 118133 - Posted: 19 Jan 2026, 22:33:15 UTC - in response to Message 118131. A very silly recommendation ignoring known gpu crunching facts. You won't be able to run 16 concurrent gpu tasks on any discrete video card because they simply don't have enough VRAM, even the datacenter versions with 32GB or more. I don't know of any gpu project with such little VRAM usage per task. Certainly not any ones producing serious science. For example, the latest Einstein O4MDG tasks were using as much as 5GB per task. So you were only able to fit 2 concurrent tasks on typical 10-12GB VRAM cards. Even with the more common series which were running 3-4GB per tasks, you could only run 3X concurrent tasks safely without erroring out because of insufficient memory. You also bring up a salient point about cpu bandwidth starvation pushing the data in and out of the cards. That is also another limitation to attempting to run high gpu task concurrency especially with typical 2 memory channel desktop systems. ID: 118133 ·

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 284	Message 118139 - Posted: 20 Jan 2026, 6:34:38 UTC Last modified: 20 Jan 2026, 6:37:44 UTC Being able to run multiple Tasks, and producing more work per hour than when running just a single Task, indicates that the application can use further optimisation in order to take advantage of the hardware. Each and every application is different, and the only way to see what is best is to read the forums at the project for that particular application, and see what others have found. Then try it for yourself. For Numberfields, my RTX 2060 Super provides maximum output when running one Task at a time. It can run more, but then throughput drops off significantly as it's already at 100% load with just one Task. My RTX 4070Ti Super gives the most work per hour when running 2 Tasks. My RTX 4080 Super is best with 3 Tasks, And for each case each Task requiring 1 CPU core/thread to support it. That's for Windows running an OpenCL application. For LINUX which has a CUDA application, the numbers could be very different- particularly the need for CPU support. Much less support is most likely required. Back in the days of Seti, the optimised applications for the GPUs (both Windows and LINUX) gave their best output when running only 1 Task at a time. While the stock application gave it's best output when running multiple Tasks at a time (but was still less than the optimised applications). And in the case of iGPUs, 1 Task at a time will generally result in 100% load of the iGPU. Running any more Tasks will reduce it's already pathetic output even further, as well as impact even more on the CPU work output. Grant Darwin NT. ID: 118139 ·

kasdashdfjsah Send message Joined: 29 Jan 24 Posts: 108	Message 118157 - Posted: 21 Jan 2026, 16:56:19 UTC Thanks for the feedback. I see the concern about VRAM and 'time slicing' on high-end discrete cards. My 'Safety Tiers' were based on testing the new M4 Mac Mini, where unified memory handles high concurrency differently than traditional setups. On this iGPU, 10 tasks was the sweet spot for stability and saturation. Grant and Keith, you're right that on a 4090 or 5090, 1-3 tasks usually max out the hardware, and Einstein O4MDG would hit a VRAM wall much sooner. I should have been clearer that this is heavily app-dependent. Iâ€™ll update my notes: users should start with 1-2 tasks and only scale up if the GPU is clearly underutilized. The goal is filling 'idle gaps' to keep the silicon busy. Appreciate the reality check on the high-end side! ID: 118157 ·

ahorek Send message Joined: 18 Jan 26 Posts: 2	Message 118205 - Posted: 26 Jan 2026, 19:20:34 UTC There's a misconception about how a GPU operates. You canâ€™t just split a GPU into smaller pieces like a CPU to boost utilization (MPS allows it, but itâ€™s specific to CUDA & Linux) Running multiple work units only improves performance if thereâ€™s a memory bottleneck or if thread utilization is too low. Switching between multiple apps on the same GPU could help fill those gaps and improve the overall performance. This usually helps on: * Einstein, GPUGrid, Amicable, and AP27 (a little) * Small GFN (16-18) (helps because the tasks are small on high-end GPUs, and running more of them reduces the CPU overhead) * Large GFN, PG Sieve, Asteroids, and Minecraft already max out GPU usage, so running multiple tasks will probably degrade performance rather than improve it Monitoring GPU utilization with GPU-Z (on Windows) is a good way to see whether the app might actually benefit from it... VRAM consumption is another limit, especially with EaH tasks that benefit the most due to "inefficiencies" of the current apps. 4x is the recommended maximum on high-end GPUs like 5090. Running more is rarely helpful. 10 GPU tasks in parallel is pointless, especially on an iGPU. Even server GPUs like the B200 donâ€™t see any benefit from so many tasks, despite having enough memory to hold them all. On slow GPUs and iGPUs, itâ€™s generally recommended to run only one GPU WU. You can experiment with multiple WUs to see if performance improves, but it really depends on the application, your GPU performance, and available VRAM. Running multiple GPU WUs on a slow GPU is likely to reduce performance rather than improve it, and since itâ€™s hard to predict how a given app will behave on your GPU, this option isnâ€™t enabled by default. ID: 118205 ·

Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.