Thread 'Database race condition'

Message boards : Server programs : Database race condition
Message board moderation

To post messages, you must log in.

AuthorMessage
Demis

Send message
Joined: 11 Jul 17
Posts: 14
Russia
Message 118805 - Posted: 13 Apr 2026, 10:03:06 UTC

How does this manifest itself?
Sometimes, the transitor processes the entire workunit table, filling the field with the value Tue Jan 19 06:14:07 MSK 2038

This is clearly noticeable when the workunity table contains tens of millions of entries.
(I know that such a large number of entries isn't ideal, but it happens for various reasons.)

Description below:
I will formulate point 1 like this: in line 149 of the code https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L149
there is a function "int handle_wu( ".
The details of its call are not important to us now.

Point 2: on line 620 of this function, there's a condition: "(wu_item.canonical_resultid || wu_item.error_mask)"
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L620

Point 3: if "wu_item.canonical_resultid" is Boolean, then "wu_item.error_mask" is NOT checked. This is the whole point of the || operator.

Point 4: on line 621, the value of the variable "wu_item.transition_time = INT_MAX" is filled in;
where INT_MAX is 2147483647, and 2147483647 is "Tue Jan 19 06:14:07 MSK 2038" (in my location, of course).
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L621
Which is logical, since "2147483647" in binary form is "111111111111111111111111111111111111111" , which is essentially the maximum possible number for this variable type.
It's not important. What's important is that in line 621, transition_time became equal to INT_MAX . Let's remember this, too.

Point 5: Further in the code, under my current conditions, the transition_time variable doesn't change.

Point 6: in line 676 the update function is called "retval = transitioner.update_workunit(wu_item, wu_item_original);"
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L676

Point 7: Find the beginning of this function (update_workunit) in the db/boinc_db.cpp file, line 1790.
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1790

Point 8: Look very carefully starting at line 1813. That is, the code, and I quote:
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1813-L1819
    // Don't update transition_time if it changed in database because something
    // happened in background (usually, another result was uploaded).
    // Instead, force another run of transitioner to handle these changes.
    if (ti.transition_time != ti_original.transition_time) {
        sprintf(buf, " transition_time=if(transition_time=%d,%d,%d),", ti_original.transition_time, ti.transition_time, (int)time(NULL));
        strcat(updates, buf);
    }
end quote

Point 9: Line 1817 partially generates a message for the log.

Point 10: On line 1837, the string generated earlier by the code is written to the log file. This is where I saw a possible cause of the problem we were discussing.

Point 11: The most interesting part.
On line 1838, a request to update a field in the database is executed! How so? (This is my exclamation, not a question!)
https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1838
After all, in the developer's comment it is written in line 1813:
"Don't update transition_time if it changed in the database because something happened in background (usually, another result was uploaded). Instead, force another run of transitioner to handle these changes.".
That is, the developer described the check logic in the comment, but did NOT implement interrupt processing in the code.

How can we fix this problem?
Simply add return 0; after line 1818.

Please confirm or refute my analysis.

The problem described has been present for several years, as far as I know.
Fixing the problem requires changes to code that has been stable for many decades.
ID: 118805 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3256
United Kingdom
Message 118807 - Posted: 13 Apr 2026, 12:40:13 UTC - in response to Message 118805.  

You are probably more likely to get an answer from someone who knows what they are talking about (unlike myself with this code) by posting on git-hub.
ID: 118807 · Report as offensive     Reply Quote
Demis

Send message
Joined: 11 Jul 17
Posts: 14
Russia
Message 118808 - Posted: 13 Apr 2026, 14:57:38 UTC - in response to Message 118807.  

Thank you for your reply.
Yes, I understand that.
That's why I created a ticket:
https://github.com/BOINC/boinc/issues/6994

But to avoid duplicating information, I've created a full description of the problem here.

Additional:
I believe the problem described and its solution require very careful study.
Preferably by different specialists.
Because I could be wrong.
That's why I posted the main post here.
Because I believe the discussion on this issue is best held here.
ID: 118808 · Report as offensive     Reply Quote

Message boards : Server programs : Database race condition

Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.