Message boards : Server programs : Database race condition
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Jul 17 Posts: 14
|
How does this manifest itself? Sometimes, the transitor processes the entire workunit table, filling the field with the value Tue Jan 19 06:14:07 MSK 2038 This is clearly noticeable when the workunity table contains tens of millions of entries. (I know that such a large number of entries isn't ideal, but it happens for various reasons.) Description below: I will formulate point 1 like this: in line 149 of the code https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L149 there is a function "int handle_wu( ". The details of its call are not important to us now. Point 2: on line 620 of this function, there's a condition: "(wu_item.canonical_resultid || wu_item.error_mask)" https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L620 Point 3: if "wu_item.canonical_resultid" is Boolean, then "wu_item.error_mask" is NOT checked. This is the whole point of the || operator. Point 4: on line 621, the value of the variable "wu_item.transition_time = INT_MAX" is filled in; where INT_MAX is 2147483647, and 2147483647 is "Tue Jan 19 06:14:07 MSK 2038" (in my location, of course). https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L621 Which is logical, since "2147483647" in binary form is "111111111111111111111111111111111111111" , which is essentially the maximum possible number for this variable type. It's not important. What's important is that in line 621, transition_time became equal to INT_MAX . Let's remember this, too. Point 5: Further in the code, under my current conditions, the transition_time variable doesn't change. Point 6: in line 676 the update function is called "retval = transitioner.update_workunit(wu_item, wu_item_original);" https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/sched/transitioner.cpp#L676 Point 7: Find the beginning of this function (update_workunit) in the db/boinc_db.cpp file, line 1790. https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1790 Point 8: Look very carefully starting at line 1813. That is, the code, and I quote: https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1813-L1819 // Don't update transition_time if it changed in database because something
// happened in background (usually, another result was uploaded).
// Instead, force another run of transitioner to handle these changes.
if (ti.transition_time != ti_original.transition_time) {
sprintf(buf, " transition_time=if(transition_time=%d,%d,%d),", ti_original.transition_time, ti.transition_time, (int)time(NULL));
strcat(updates, buf);
}end quotePoint 9: Line 1817 partially generates a message for the log. Point 10: On line 1837, the string generated earlier by the code is written to the log file. This is where I saw a possible cause of the problem we were discussing. Point 11: The most interesting part. On line 1838, a request to update a field in the database is executed! How so? (This is my exclamation, not a question!) https://github.com/BOINC/boinc/blob/deeb9b8612d2f919a0e52f555b7470f40c178219/db/boinc_db.cpp#L1838 After all, in the developer's comment it is written in line 1813: "Don't update transition_time if it changed in the database because something happened in background (usually, another result was uploaded). Instead, force another run of transitioner to handle these changes.". That is, the developer described the check logic in the comment, but did NOT implement interrupt processing in the code. How can we fix this problem? Simply add return 0; after line 1818. Please confirm or refute my analysis. The problem described has been present for several years, as far as I know. Fixing the problem requires changes to code that has been stable for many decades. |
DaveSend message Joined: 28 Jun 10 Posts: 3256
|
You are probably more likely to get an answer from someone who knows what they are talking about (unlike myself with this code) by posting on git-hub. |
|
Send message Joined: 11 Jul 17 Posts: 14
|
Thank you for your reply. Yes, I understand that. That's why I created a ticket: https://github.com/BOINC/boinc/issues/6994 But to avoid duplicating information, I've created a full description of the problem here. Additional: I believe the problem described and its solution require very careful study. Preferably by different specialists. Because I could be wrong. That's why I posted the main post here. Because I believe the discussion on this issue is best held here. |
Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.