Message boards : BOINC client : BOINC 6.2.xx - crashes all over the place
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Aug 08 Posts: 1 ![]() |
Hi, not sure if someone else allready posted it but i do think 6.2.xx is broken like hell. I've seen this on numberous project while beeing paired with some wingmen that use 6.2.xx It's allways the same message... <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> Can't get shared memory segment name: shmget() failed </message> ]]> example: http://genlife.is-a-geek.org/genlife/workunit.php?wuid=18194 seen some crashes on Milkyway and S@H too... always the same pattern/error message... |
Send message Joined: 19 Dec 05 Posts: 96 ![]() |
Hi, I ran 6.2.11 for several weeks (2 or maybe 3) on two different 32-bit machines. One on CentOS4 and the other on Red Hat Enterprise Linux 5. Both worked fine without that message. I upgraded the RHEL5 machine to 6.2.14 and it, too, works fine. I could not upgrade the CentOS4 machine to 6.2.14 but that was because the CentOS4 library for something (glibc?) is too old. I should upgrade that machine to CentOS5, but I have not gotten to that yet. ![]() |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? I have one 6.2.14 client for testing purposes. It was running absolutely smoothly, with no errors at all, until this happened: 10-Sep-2008 11:10:54 [lhcathome] Sending scheduler request: To fetch work. Requesting 57827 seconds of work, reporting 0 completed tasks 10-Sep-2008 11:11:16 [---] Project communication failed: attempting access to reference site 10-Sep-2008 11:11:18 [---] Internet access OK - project servers may be temporarily down. 10-Sep-2008 11:11:19 [lhcathome] Scheduler request failed: Couldn't connect to server 10-Sep-2008 11:28:04 [SETI@home] Computation for task 14au08af.23089.18477.5.8.223_1 finished 10-Sep-2008 11:28:04 [SETI@home] Starting ap_14au08aa_B0_P1_00115_20080909_26999.wu_0 10-Sep-2008 11:28:04 [SETI@home] Starting task ap_14au08aa_B0_P1_00115_20080909_26999.wu_0 using astropulse version 435 10-Sep-2008 11:28:06 [SETI@home] Started upload of 14au08af.23089.18477.5.8.223_1_0 10-Sep-2008 11:28:14 [SETI@home] Finished upload of 14au08af.23089.18477.5.8.223_1_0 10-Sep-2008 12:02:55 [SETI@home] Computation for task 15au08aa.16292.17250.5.8.155_1 finished 10-Sep-2008 12:02:55 [SETI@home] Starting 14au08af.23089.24203.5.8.28_0 10-Sep-2008 12:02:55 [SETI@home] Starting task 14au08af.23089.24203.5.8.28_0 using setiathome_enhanced version 528 10-Sep-2008 12:02:57 [SETI@home] Started upload of 15au08aa.16292.17250.5.8.155_1_0 10-Sep-2008 12:03:04 [SETI@home] Finished upload of 15au08aa.16292.17250.5.8.155_1_0 10-Sep-2008 12:57:31 [SETI@home] Computation for task 14au08af.23089.24203.5.8.28_0 finished 10-Sep-2008 12:57:31 [SETI@home] Starting 15au08aa.16292.20931.5.8.4_0 10-Sep-2008 12:57:31 [SETI@home] Starting task 15au08aa.16292.20931.5.8.4_0 using setiathome_enhanced version 528 10-Sep-2008 12:57:33 [SETI@home] Started upload of 14au08af.23089.24203.5.8.28_0_0 10-Sep-2008 12:57:40 [SETI@home] Finished upload of 14au08af.23089.24203.5.8.28_0_0 10-Sep-2008 13:21:00 [lhcathome] Sending scheduler request: To fetch work. Requesting 57797 seconds of work, reporting 0 completed tasks 10-Sep-2008 13:21:05 [lhcathome] Scheduler request succeeded: got 0 new tasks 10-Sep-2008 13:52:08 [SETI@home] Computation for task 15au08aa.16292.20931.5.8.4_0 finished 10-Sep-2008 13:52:08 [SETI@home] Starting 14au08ae.28085.72.13.8.135_0 10-Sep-2008 13:52:08 [SETI@home] Starting task 14au08ae.28085.72.13.8.135_0 using setiathome_enhanced version 528 10-Sep-2008 13:52:10 [SETI@home] Started upload of 15au08aa.16292.20931.5.8.4_0_0 10-Sep-2008 13:52:18 [SETI@home] Finished upload of 15au08aa.16292.20931.5.8.4_0_0 10-Sep-2008 14:46:06 [SETI@home] Computation for task 14au08ae.28085.72.13.8.135_0 finished [b][color=red]10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.890.13.8.242_0 10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.3344.13.8.113_1[/color][/b] 10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.890.13.8.242_0 finished 10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.890.13.8.242_0_0 for task 14au08ae.28085.890.13.8.242_0 absent 10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.3344.13.8.113_1 finished 10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.3344.13.8.113_1_0 for task 14au08ae.28085.3344.13.8.113_1 absent 10-Sep-2008 14:46:07 [SETI@home] Starting 14au08ae.28085.3753.13.8.163_1 10-Sep-2008 14:46:08 [SETI@home] Started upload of 14au08ae.28085.72.13.8.135_0_0 10-Sep-2008 14:46:08 [SETI@home] Computation for task 14au08ae.28085.3753.13.8.163_1 finished 10-Sep-2008 14:46:08 [SETI@home] Output file 14au08ae.28085.3753.13.8.163_1_0 for task 14au08ae.28085.3753.13.8.163_1 absent 10-Sep-2008 14:46:08 [SETI@home] Starting 14au08af.1803.6207.6.8.251_0 10-Sep-2008 14:46:09 [SETI@home] Computation for task 14au08af.1803.6207.6.8.251_0 finished 10-Sep-2008 14:46:09 [SETI@home] Output file 14au08af.1803.6207.6.8.251_0_0 for task 14au08af.1803.6207.6.8.251_0 absent This is a quad core, and is attached to a variety of projects: however as of today, every project is set to NNT except SETI and LHC. LHC had no work at the time, so effectively the host had become a SETI-only cruncher. Further, it had (and still has) three Astropulse tasks running - you can see where the third AP task started, at 11:28:04. They are 40-hour plus tasks, so that means that only one core remained available for SETI MB work, and you can see how the tasks start one at a time - at 12:02:55, 12:57:31, 13:52:08 etc. I run with a conservative 1 day cache, so there is no question of tasks being pre-empted for EDF. Then, at 14:46:06 (highlighted), BOINC tried to start two tasks at once. They both crashed with the "shmget() failed" error, and BOINC then proceeded to trash the remaining 74 tasks in the cache, one per second. Fortunately, it didn't trash the running Astropulse tasks, and it did go into a 24-hour backoff on scheduler contact with SETI (no reason apparent in the logs - the only scheduler contacts are: 10-Sep-2008 09:36:57 [SETI@home] Sending scheduler request: To fetch work. Requesting 101 seconds of work, reporting 2 completed tasks 10-Sep-2008 09:37:02 [SETI@home] Scheduler request succeeded: got 1 new tasks and 10-Sep-2008 17:41:07 [SETI@home] Fetching scheduler list 10-Sep-2008 17:41:12 [SETI@home] Master file download succeeded 10-Sep-2008 17:41:17 [SETI@home] Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 86 completed tasks 10-Sep-2008 17:41:22 [SETI@home] Scheduler request succeeded: got 0 new tasks 10-Sep-2008 17:47:41 [---] Exit requested by user when I got home). So the only oddity I can see is that double task start at 14:46:06, which would have meant five tasks running on a four-core CPU. Host ID 4292666 at SETI, now upgraded to BOINC v6.2.18 (service install, as before). |
Send message Joined: 2 Sep 05 Posts: 103 ![]() |
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? It's worth checking if there are any extra messages in stderrdae.txt. If there was a problem setting up the shared memory security descriptors the error messages will have been written directly to the stderr file stream (you won't see them in stdoutdae.txt or the BOINC Manager message tab). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? Worth a look - but it seems stderrdae.txt hasn't been written to since 7 May 2008, and only contains (multiple iterations of): UNRECOGNIZED: suspend_if_no_recent_input |
![]() Send message Joined: 29 Aug 05 Posts: 15599 ![]() |
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? Not really, as it was fixed in 6.2.18. See its change log, which says: I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed. |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. |
![]() Send message Joined: 29 Aug 05 Posts: 15599 ![]() |
None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. Um... why would you want to do that anyway? or am I missing something? |
Send message Joined: 5 Oct 06 Posts: 5148 ![]() |
None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. Well, I don't want to - but it seems my CC v6.2.14 did (at 10-Sep-2008 14:46:06, see log below), and that's what provoked the first attack of the shmget() faileds. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.