Thread 'can't download new work units'

Author	Message
John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16767 - Posted: 21 Apr 2008, 17:13:23 UTC - in response to Message 16765. Back to polling then. I had this idea last night but didn't write it out. When BOINC polls a project that has work, it gets a null signal so the scheduler takes over and asks for work. When BOINC polls a project that doesn't have work, it gets a signal that there is no work (and sent a one (1)), no scheduler action after that. Would that by itself not increase LTD? Or would there still need to be an additional catch-all, of when signal 1 is received that LTD is frozen? I'm just brain-storming here. BOINC does not normally poll the projects. The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. Some projects cannot take any significant number of additional server requests as the databases are pretty badly loaded already. The restrictions: No extra server contacts - the client is on its own on this one. Resource share should be honored as well as possible. BOINC WIKI ID: 16767 ·

Nicolas Send message Joined: 19 Jan 07 Posts: 1179	Message 16771 - Posted: 21 Apr 2008, 23:48:53 UTC - in response to Message 16767. The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. I believe his suggestion was to modify the server too, to allow a "do you have any work?" request. Some projects cannot take any significant number of additional server requests as the databases are pretty badly loaded already. That's an implementation issue. Rom Walton has a blog post why returning results immediately is bad for database load, which shows the database queries needed by a scheduler request. If the scheduler wasn't a CGI, a lot of stuff (like user information) could be cached in memory instead of re-read from the database each time. I have some ideas for a whole new scheduler protocol that would lower the DB queries needed (from Rom's post, authentication, platform check, and prefs check would be done less often), and give a lot more power to the scheduler, by letting the client know the properties of available work before actually accepting it. And potentially also giving "push" features; server letting clients know when work is available, instead of waiting for clients to ask for it, useful for low latency computing. Of course, there's issues with implementation complexity (particularly time to implement), and compatibility... ID: 16771 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16774 - Posted: 22 Apr 2008, 0:31:55 UTC - in response to Message 16771. The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. I believe his suggestion was to modify the server too, to allow a "do you have any work?" request. Some projects cannot take any significant number of additional server requests as the databases are pretty badly loaded already. That's an implementation issue. Rom Walton has a blog post why returning results immediately is bad for database load, which shows the database queries needed by a scheduler request. If the scheduler wasn't a CGI, a lot of stuff (like user information) could be cached in memory instead of re-read from the database each time. I have some ideas for a whole new scheduler protocol that would lower the DB queries needed (from Rom's post, authentication, platform check, and prefs check would be done less often), and give a lot more power to the scheduler, by letting the client know the properties of available work before actually accepting it. And potentially also giving "push" features; server letting clients know when work is available, instead of waiting for clients to ask for it, useful for low latency computing. Of course, there's issues with implementation complexity (particularly time to implement), and compatibility... Authentication, platform check, and prefs check are done once per connection as far as I know. These must be done once per connection as each connection is from a different client. They are already keeping large amounts of the active DB in memory. There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. BOINC WIKI ID: 16774 ·

Nicolas Send message Joined: 19 Jan 07 Posts: 1179	Message 16775 - Posted: 22 Apr 2008, 2:45:12 UTC - in response to Message 16774. There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. With those numbers, my idea surely doesn't scale. Like, at all :) ID: 16775 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16776 - Posted: 22 Apr 2008, 2:54:31 UTC - in response to Message 16775. There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. With those numbers, my idea surely doesn't scale. Like, at all :) Speeding up database access is one of the things that S@H has worked hardest on. There are a few other notes. Each WU has to be verified (another DB access), and created (another DB access). Each view of a thread is another DB access. Each post is 2 more. Generating the stats dump is a whole bunch of DB accesses. S@H will take any help they can get in order to reduce the number of database hits. Adding more database hits is just not going to happen for them. The resend of ghost tasks added enough database access so that it had to be dropped. Adding a mandatory call in to check to see if there is work once per day or so is just not going to happen because of the numbers. Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). BOINC WIKI ID: 16776 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16780 - Posted: 22 Apr 2008, 10:51:39 UTC - in response to Message 16777. Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). You seem to know precisely where and why the problem occurs. How about adding some "self diagnostic code" to the client. The code would check to see if the client is in "debt trouble" and then advise the user what to do to correct the problem. If I understand the problem correctly, reducing the project share for projects that are out of work for long periods would be one corrective action. OK, CPDN is running solo for a year in EDF. Should we nag the user daily about the LTD that is correctly accumulating? I think that this would be a bad idea. BOINC WIKI ID: 16780 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16788 - Posted: 22 Apr 2008, 22:11:54 UTC - in response to Message 16782. Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). You seem to know precisely where and why the problem occurs. How about adding some "self diagnostic code" to the client. The code would check to see if the client is in "debt trouble" and then advise the user what to do to correct the problem. If I understand the problem correctly, reducing the project share for projects that are out of work for long periods would be one corrective action. OK, CPDN is running solo for a year in EDF. Should we nag the user daily about the LTD that is correctly accumulating? I think that this would be a bad idea. The way I understand the problem (and maybe I don't understand it fully), CPDN accumulating debt while running solo for a year won't cause the user any real grief during that year so there is no need for any warning while CPDN is running solo. If I understand correctly, the user runs into trouble when s/he turns on other projects. At that point BOINC could advise the user that corrective action needs to be taken. If corrective action means reseting the debts and/or adjusting shares then so be it. No. The problem is that CPDN takes over for a year, and all of the other projects gain LTD. BTW, this is how resource shares are honored over the long term. Warning the user that corrective action needs to be taken is the same as saying that the user needs to intentionally violate his resource shares. BOINC WIKI ID: 16788 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5149	Message 16789 - Posted: 22 Apr 2008, 22:59:11 UTC - in response to Message 16788. Last modified: 22 Apr 2008, 22:59:37 UTC The way I understand the problem (and maybe I don't understand it fully), CPDN accumulating debt while running solo for a year won't cause the user any real grief during that year so there is no need for any warning while CPDN is running solo. If I understand correctly, the user runs into trouble when s/he turns on other projects. At that point BOINC could advise the user that corrective action needs to be taken. If corrective action means reseting the debts and/or adjusting shares then so be it. No. The problem is that CPDN takes over for a year, and all of the other projects gain LTD. BTW, this is how resource shares are honored over the long term. Warning the user that corrective action needs to be taken is the same as saying that the user needs to intentionally violate his resource shares. The 'problem' often manifests itself the other way round: the user has been running a mix of projects, got used to the task-switching behaviour, and then wonders why it stops when they add CPDN to the mix. @ JM7 I think you're getting this the wrong way round. BOINC has defined that resource shares are a long-term committment, and written code to match. Now you are trying to force users to accept that paradigm. Why should users not choose which projects to support with their CPU cycles on a short-term basis too? That is just as valid a contribution to the science. ID: 16789 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16790 - Posted: 22 Apr 2008, 23:10:51 UTC - in response to Message 16789. The way I understand the problem (and maybe I don't understand it fully), CPDN accumulating debt while running solo for a year won't cause the user any real grief during that year so there is no need for any warning while CPDN is running solo. If I understand correctly, the user runs into trouble when s/he turns on other projects. At that point BOINC could advise the user that corrective action needs to be taken. If corrective action means reseting the debts and/or adjusting shares then so be it. No. The problem is that CPDN takes over for a year, and all of the other projects gain LTD. BTW, this is how resource shares are honored over the long term. Warning the user that corrective action needs to be taken is the same as saying that the user needs to intentionally violate his resource shares. The 'problem' often manifests itself the other way round: the user has been running a mix of projects, got used to the task-switching behaviour, and then wonders why it stops when they add CPDN to the mix. @ JM7 I think you're getting this the wrong way round. BOINC has defined that resource shares are a long-term committment, and written code to match. Now you are trying to force users to accept that paradigm. Why should users not choose which projects to support with their CPU cycles on a short-term basis too? That is just as valid a contribution to the science. Because of deadlines. Making deadlines can cause disruption to the smooth working of the short term resource shares. BOINC does its best to meet the resource shares short term as well as long term, but meeting deadlines is a higher priority goal than short term resource share. If meeting deadlines was not a higher priority than short term resource share, much work would be returned late, and the screams would be even louder - "Why am I not getting credit." Once you start trying to meet deadlines, resource share balancing has to be long term. BOINC WIKI ID: 16790 ·

MikeMarsUK Send message Joined: 16 Apr 06 Posts: 386	Message 16806 - Posted: 23 Apr 2008, 23:25:27 UTC - in response to Message 16771. Last modified: 23 Apr 2008, 23:32:50 UTC ... I believe his suggestion was to modify the server too, to allow a "do you have any work?" request. ... If the semantics are adjusted slightly to "Has any work been given out for this platform recently", then it could be cached by Apache and regenerated every hour or so. It isn't necessary to know if there is actually any work available right at this instant, just to know whether work was generally available or not. So potentially the database hit would be one query per hour per platform, regardless of how many clients are connected to the project. No need to authenticate clients, or even to go anywhere near the DB except for these hourly refreshes. Such is the power of mod_proxy :-) (or alternatively mod_cache, mod_file_cache, etc.). ID: 16806 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16807 - Posted: 23 Apr 2008, 23:30:27 UTC - in response to Message 16806. ... I believe his suggestion was to modify the server too, to allow a "do you have any work?" request. ... If the semantics are adjusted slightly to "Has any work been given out for this platform recently", then it could be cached by Apache and regenerated every hour or so. It isn't necessary to know if there is actually any work available right at this instant, just to know whether work was generally available or not. So potentially the database hit would be one query per hour per platform, regardless of how many clients are connected to the project. And now, I present the other side that has been screaming at me. There are people that say that not calculating LTD in ALL cases is violating their resource shares. There is still the problem of clients that are not connected for a long time. BOINC WIKI ID: 16807 ·

Thund3rb1rd Send message Joined: 17 Apr 08 Posts: 22	Message 16834 - Posted: 24 Apr 2008, 17:35:36 UTC Since I started this thread several days ago, all of my projects "woke up" and started asking for work again. I did nothing apart from resetting CPDN once and just let things take their course. Now all projects (except orbit@home) have asked for and received work units and are happily crunching away. I still have no clue what happened but as I read the posts, it's apparent that several of you all do, and I appreciate your assistance. My most sincere thanks to all of you who offered suggestions and information (I particularly am interested in the detach/reattach suggestion) even though I freely admit I am not as up on the mechanics of boinc as I could be. Bob Graham ID: 16834 ·

John McLeod VII Send message Joined: 29 Aug 05 Posts: 147	Message 16838 - Posted: 24 Apr 2008, 20:44:39 UTC - in response to Message 16834. Since I started this thread several days ago, all of my projects "woke up" and started asking for work again. I did nothing apart from resetting CPDN once and just let things take their course. Now all projects (except orbit@home) have asked for and received work units and are happily crunching away. I still have no clue what happened but as I read the posts, it's apparent that several of you all do, and I appreciate your assistance. My most sincere thanks to all of you who offered suggestions and information (I particularly am interested in the detach/reattach suggestion) even though I freely admit I am not as up on the mechanics of boinc as I could be. Bob Graham Most likely CPDN was in the middle of a very long task, and would either not finish on time, or would be very close to the deadline. This causes workfetch to stop for all projects until this is cleared up (the exception is keeping other CPUs busy, and in the most recent versions, keeping the queue full). BOINC WIKI ID: 16838 ·

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.