Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid
Message board moderation
Previous · 1 . . . 34 · 35 · 36 · 37 · 38 · 39 · 40 . . . 52 · Next
| Author | Message |
|---|---|
|
Send message Joined: 30 Mar 20 Posts: 707
|
For the moment WCG work is flowing again. But, for how long? That is the question. |
Bill FreauffSend message Joined: 26 Mar 11 Posts: 242
|
In reply to Grumpy Swede's message of 2 Dec 2025: For the moment WCG work is flowing again. But, for how long? That is the question. And to ask a question after "the" question ... and when will there be "Stat's" ... ? |
|
Send message Joined: 30 Mar 20 Posts: 707
|
I don't want to jinx things, but the flow of work the last couple of days has been very good. I also noticed now, that the validations of tasks crunched, uploaded, and reported before the migration, is now starting to validate. Same goes for cached tasks, crunched during the migration, and uploaded, and reported after the system came back. Good job Dylan, and the entire team. |
|
Send message Joined: 30 Mar 20 Posts: 707
|
New update from WCG: December 4, 2025 BOINC feeder/scheduler reporting "tasks committed to other platforms" is resolved - details are further down about the resolution and future plans to keep this issue from coming back. Validation Backlog has begun for workunits that were held over the break, and workunits that fell through our new validation logic unvalidated. We intend to ramp up these passes in the coming days, and will report on progress and project expected dates for fully backfilling all such cases and finally catching up validations to in flight work next week, now that we know our scripting works to backfill validations. We will not restart the file_deleter or db_purge BOINC services until we have validated every file we possess that was uploaded before/after the break, including sending resends for some cases of "orphans". What was the workaround for the feeder/scheduler blockage due to hr_class mismatch between results for the same workunit? The resolution to the issue that we chose for now, was to simply purge stale feeder entries effectively resetting their hr_class (homogenous redundancy) to 0 and allowing any host/platform to download the result if the result sits in memory for too long. The feeder can be started with a CLI option and specified time frame for occupancy of a result in a slot before it considers this course. What does resetting hr_class=0 as a workaround accomplish? The hr_class=0 reset matches the value assigned to fresh workunit results being sent out for the first time, essentially dictating to the scheduler that any host/platform may claim and compute this result (i.e., _0 and _1 results have hr_class=0, resends consult the hr_class of the host that reported results already). There is some computational overhead, as a second tier of validation is then required to validate the exact gene signatures and their scores are "the same" between these results computed on different platforms in the case of purged resends that had their hr_class reset to 0. We intend to disable hr_class (homogenous redundancy) completely for MCM1 at some point in the future, and instead rely directly on this currently secondary validation, and record of the delta between exact scores and verification of equivalent gene signatures found for these results sent to different platforms to ensure they are within a reasonable error bound/tolerance as a rule. Does this workaround affect the integrity of MCM1 results? No, but it does introduce a new edge cases to account for. The score can vary within the upper and lower bound of possible floating point error between platforms for the same workunit. Ensuring that the floating point calculations are not different enough to invalidate the computational result is a vastly easier problem when using the hr_class mechanism. However, because MCM1 produces a list of genes as well as a score, the only additional validation criteria we incur by disabling hr_class is ostensibly "score is just below the threshold on this system" exclusion, and "score is just above the threshold on this system" inclusion, for specific signatures very close to the configured threshold. In these cases, we can take the union of these additional results slightly above or below the threshold score, between all results for a workunit, provided the rest of the results above the threshold are equivalent. Why have hr_class at all for MCM1 then? Indeed. We intend to track the above cases and any other cases among validation failures where we can discern any unforseen effect of allowing resends to potentially go to different platforms, try this "disable hr_class if the feeder gets stuck" system for MAM1 which does have a numerical optimization routine to explore the signature search space that could change the actual signatures under test due to floating point error and so may not be a good candidate for this (and yet the calculations are valid, so any reasonable overlap or a "canary" or "spike-in" validation system might be considered sufficient validation...). If we are satisfied with the outcome of post-processing results that came from different platforms, we can disable it. This will accelerate throughput and discovery for MCM1 and possibly MAM1 while buying time to resolve this issue more permanently for applications such as ARP1 that this thinking does not apply to, where the floating point calculations must be byte-wise equivalent between results or the result is simply invalid. Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix. |
|
Send message Joined: 30 Mar 20 Posts: 707
|
We're back to an old problem again "Another scheduler instance is running for this host" Of course, that means no new tasks will be sent, and finished tasks can't be reported. I reported the issue to the WCG team. Edit, added:Problem fixed. The issue was: httpd logs filled up the disk, tried restarting the feeder only but websphere also ended up in a bad state, archived and rotated the logs and restarted the websphere server and apps, if all goes well back up in a few minutes here |
|
Send message Joined: 30 Mar 20 Posts: 707
|
The entire WCG site is now bogged down in cold molasses, and at times even responds with a: "503 Service Unavailable No server is available to handle this request." Edit: Molasses removed from the WCG site now. |
|
Send message Joined: 5 Nov 11 Posts: 46
|
Seems to be OK HERE. Getting work and a few validations.. |
|
Send message Joined: 30 Mar 20 Posts: 707
|
In reply to Hadrian's message of 13 Dec 2025: Seems to be OK HERE. Getting work and a few validations..Sure it's OK now. My post was almost 3 days ago. |
|
Send message Joined: 9 Jan 13 Posts: 31
|
Well I am still getting a 403 Forbidden error when I try to log on to the WCG website to look at my stats. -- You don't have permission to access this resource. "My streaks are directly proportional to WCG uptime. I think I would have a better chance during a football game." |
|
Send message Joined: 10 May 07 Posts: 1782
|
Well I am still getting a 403 Forbidden error when I try to log on to the WCG website to look at my stats. Make sure you are using https not http to access the website https://www.worldcommunitygrid.org/ |
|
Send message Joined: 9 Jan 13 Posts: 31
|
In reply to Dr Who Fan's message of 14 Dec 2025: Well I am still getting a 403 Forbidden error when I try to log on to the WCG website to look at my stats. I am....and even clicking on your link gives me the error. edit: the Jurisica site works fine, it is just the WCG that doesnt. |
|
Send message Joined: 9 Jan 13 Posts: 31
|
Deleting the WCG cookies from my browser fixed the problem. Unsure how the cookie got corrupted but it did. |
DaveSend message Joined: 28 Jun 10 Posts: 3259
|
All downloads are failing for me. I suspect something to do with tethering my phone and a connection even slower than my dead bored band even though spacious at home seems to be downloading tasks albeit with the odd failed download. |
|
Send message Joined: 3 Nov 20 Posts: 24
|
All forum threads are down , error received: mvnForum Fatal Error Message : Cannot init system. Reason : Assertion in ForumUserServlet. Hans S. |
|
Send message Joined: 30 Mar 20 Posts: 707
|
In reply to Hans Sveen's message of 15 Dec 2025: All forum threads are down , error received:Yup, seen that several times before over the years. I'll contact Igor Jurisica so he can ping Dylan. |
|
Send message Joined: 3 Nov 20 Posts: 24
|
Hi! The forums are now working again! Thanks so Igor and Dylan ðŸ‘😊 |
|
Send message Joined: 30 Mar 20 Posts: 707
|
In reply to Hans Sveen's message of 15 Dec 2025: Hi!Yes indeed, and you managed to post the "MCM1_0244000", just minutes before I was about to do that. Congratulations Hans 😊 |
|
Send message Joined: 3 Nov 20 Posts: 24
|
Hi again! Thank You, Grumpy Swede 😉😠Yes, this time around it was me , next time it will be You(Maybe!) ✌ï¸ðŸ¤žðŸ¤ž Hans S. |
|
Send message Joined: 30 Mar 20 Posts: 707
|
Another update from the WCG team: December 15, 2025 Forum service restored, after degraded service starting roughly 03:00 UTC, December 15th, 2025 led to a crash at roughly 12:30 UTC same day - service was restored at approximately 20:00 UTC Dec 15th, 2025. We have seen this before, due to database connections waiting indefinitely or so long as to eventually reach the thread pool maximum, causing OOM kill of the forum application under WAS, this is the meaning of the ForumUserServlet unable to initialize message displayed while the application was down. Previously, poor parameterization of the thread pool under WAS caused connections to the database to stay open instead of timing out under lock contention, we thought we had ameliorated the issue on the WAS side, but clearly this needs another look. At the very least, we will deploy alerts for the specific WARN and ERROR messages logged by WAS leading up to the crash which should provide a window of many hours within which we can fix this manually in the future, before the forum application crashes, pending a confirmed fix or workaround (e.g. better parameters and logic for managing thread pool). After announcing we would be accelerating validations "in the coming days", validations stalled again last week. Unfortunately, we continue to face issues with MCM1 validations. There are multiple categories of missed validations - orphaned "singles", mis-routed one or both results, incorrectly invalidated result pairs, missing resend condition, and now floating point tolerance too stringent for hr_class reset workunits which was the workaround to the impossible platform logjam issue, at the expense of having to validate workunits based on similarity of scientific results within a tolerance instead of strict equivalence of scientific results. The concept of validity for MCM1 became result pairs that have equivalent gene signature membership for signatures above the threshold score, and an equivalent list of gene signatures above the threshold score, and a similar score within a configurable error bound passed to the validator on startup, when the two workunits run with the same parameters and random seed value. While these cases should therefore be validated by the secondary validator we subscribed to the "validation failure" queue downstream of the primary, checksum based validator, our tolerance for floating point error was too stringent and we will be replaying the failure queue from an earlier offset to catch these cases for recently resent workunits. Regarding our approach to crediting workunits held during the downtime by scanning the filesystem and checking the database, the process began last week after indexing the locations of result files for all workunits across all filesystems on the backend, so that validations that involved file transfers could avoid the processing of walking remote filesystems and simply fetch the required remote result from wherever it had been uploaded or archived. Initial testing suggested we would catch most if not all missed validations using this approach, though the scripts each running on each worker node would have to run for some time. Clearly, despite thinking perhaps the timestamp-based approach was simply getting through points in time with few missing validations of any case early on, we are not making the expected progress. We are reviewing logs and stats on what has been processed so far to figure out what we missed, and how to adjust. Some validations have occured for each case, just nowhere near the expected throughput/hit rate we projected. So, we are tentatively hopeful we can fix this quickly and start finally making a dent. |
|
Send message Joined: 3 Nov 20 Posts: 24
|
As yesterday,forum threads are down again ! mvnForum Fatal Error Message : Cannot init system. Reason : Assertion in ForumUserServlet. Hans S. It is fixed, works as expected 👠|
Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.