Message boards : BOINC client : can't download new work units
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 147 |
Back to polling then. BOINC does not normally poll the projects. The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. Some projects cannot take any significant number of additional server requests as the databases are pretty badly loaded already. The restrictions: No extra server contacts - the client is on its own on this one. Resource share should be honored as well as possible. ![]() BOINC WIKI |
Send message Joined: 19 Jan 07 Posts: 1179 ![]() |
The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. I believe his suggestion was to modify the server too, to allow a "do you have any work?" request. Some projects cannot take any significant number of additional server requests as the databases are pretty badly loaded already. That's an implementation issue. Rom Walton has a blog post why returning results immediately is bad for database load, which shows the database queries needed by a scheduler request. If the scheduler wasn't a CGI, a lot of stuff (like user information) could be cached in memory instead of re-read from the database each time. I have some ideas for a whole new scheduler protocol that would lower the DB queries needed (from Rom's post, authentication, platform check, and prefs check would be done less often), and give a lot more power to the scheduler, by letting the client know the properties of available work before actually accepting it. And potentially also giving "push" features; server letting clients know when work is available, instead of waiting for clients to ask for it, useful for low latency computing. Of course, there's issues with implementation complexity (particularly time to implement), and compatibility... |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
The server code does not send any information about whether or not it has work unless the client has asked for work. If the client asks for work, it is just sent. Authentication, platform check, and prefs check are done once per connection as far as I know. These must be done once per connection as each connection is from a different client. They are already keeping large amounts of the active DB in memory. There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. ![]() BOINC WIKI |
Send message Joined: 19 Jan 07 Posts: 1179 ![]() |
There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. With those numbers, my idea surely doesn't scale. Like, at all :) |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
There are 1.8 million hosts that S@H has attached (admittedly some of these are no longer doing work). Assume that there are 864,000 hosts actually attached on a regular basis. Now assume that on average each of these contacts the server 6 times per day. That is 60 updates PER SECOND (note that even once per day is 10 contacts per second). This is the root of the database problems at S@H. The pipeline has to be able to handle at least this number of contacts. One of the results of queueing theory is that a system that is more than 50% loaded is headed for trouble. So the system has to be able to actually handle 120 updates per second. That means that you have on average 8 miliseconds to deal with each transaction. Speeding up database access is one of the things that S@H has worked hardest on. There are a few other notes. Each WU has to be verified (another DB access), and created (another DB access). Each view of a thread is another DB access. Each post is 2 more. Generating the stats dump is a whole bunch of DB accesses. S@H will take any help they can get in order to reduce the number of database hits. Adding more database hits is just not going to happen for them. The resend of ghost tasks added enough database access so that it had to be dropped. Adding a mandatory call in to check to see if there is work once per day or so is just not going to happen because of the numbers. Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). ![]() BOINC WIKI |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). OK, CPDN is running solo for a year in EDF. Should we nag the user daily about the LTD that is correctly accumulating? I think that this would be a bad idea. ![]() BOINC WIKI |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
Now, given that the client is on its own, what is the best compromise that we can come up with? I believe that it is the one that is currently coded (nobody is perfectly happy with it, but there are times that a compromise is the best overall solution). No. The problem is that CPDN takes over for a year, and all of the other projects gain LTD. BTW, this is how resource shares are honored over the long term. Warning the user that corrective action needs to be taken is the same as saying that the user needs to intentionally violate his resource shares. ![]() BOINC WIKI |
Send message Joined: 5 Oct 06 Posts: 5142 ![]() |
The way I understand the problem (and maybe I don't understand it fully), CPDN accumulating debt while running solo for a year won't cause the user any real grief during that year so there is no need for any warning while CPDN is running solo. If I understand correctly, the user runs into trouble when s/he turns on other projects. At that point BOINC could advise the user that corrective action needs to be taken. If corrective action means reseting the debts and/or adjusting shares then so be it. The 'problem' often manifests itself the other way round: the user has been running a mix of projects, got used to the task-switching behaviour, and then wonders why it stops when they add CPDN to the mix. @ JM7 I think you're getting this the wrong way round. BOINC has defined that resource shares are a long-term committment, and written code to match. Now you are trying to force users to accept that paradigm. Why should users not choose which projects to support with their CPU cycles on a short-term basis too? That is just as valid a contribution to the science. |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
The way I understand the problem (and maybe I don't understand it fully), CPDN accumulating debt while running solo for a year won't cause the user any real grief during that year so there is no need for any warning while CPDN is running solo. If I understand correctly, the user runs into trouble when s/he turns on other projects. At that point BOINC could advise the user that corrective action needs to be taken. If corrective action means reseting the debts and/or adjusting shares then so be it. Because of deadlines. Making deadlines can cause disruption to the smooth working of the short term resource shares. BOINC does its best to meet the resource shares short term as well as long term, but meeting deadlines is a higher priority goal than short term resource share. If meeting deadlines was not a higher priority than short term resource share, much work would be returned late, and the screams would be even louder - "Why am I not getting credit." Once you start trying to meet deadlines, resource share balancing has to be long term. ![]() BOINC WIKI |
Send message Joined: 16 Apr 06 Posts: 386 ![]() |
... If the semantics are adjusted slightly to "Has any work been given out for this platform recently", then it could be cached by Apache and regenerated every hour or so. It isn't necessary to know if there is actually any work available right at this instant, just to know whether work was generally available or not. So potentially the database hit would be one query per hour per platform, regardless of how many clients are connected to the project. No need to authenticate clients, or even to go anywhere near the DB except for these hourly refreshes. Such is the power of mod_proxy :-) (or alternatively mod_cache, mod_file_cache, etc.). |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
... And now, I present the other side that has been screaming at me. There are people that say that not calculating LTD in ALL cases is violating their resource shares. There is still the problem of clients that are not connected for a long time. ![]() BOINC WIKI |
Send message Joined: 17 Apr 08 Posts: 22 ![]() |
Since I started this thread several days ago, all of my projects "woke up" and started asking for work again. I did nothing apart from resetting CPDN once and just let things take their course. Now all projects (except orbit@home) have asked for and received work units and are happily crunching away. I still have no clue what happened but as I read the posts, it's apparent that several of you all do, and I appreciate your assistance. My most sincere thanks to all of you who offered suggestions and information (I particularly am interested in the detach/reattach suggestion) even though I freely admit I am not as up on the mechanics of boinc as I could be. Bob Graham |
![]() Send message Joined: 29 Aug 05 Posts: 147 |
Since I started this thread several days ago, all of my projects "woke up" and started asking for work again. I did nothing apart from resetting CPDN once and just let things take their course. Now all projects (except orbit@home) have asked for and received work units and are happily crunching away. Most likely CPDN was in the middle of a very long task, and would either not finish on time, or would be very close to the deadline. This causes workfetch to stop for all projects until this is cleared up (the exception is keeping other CPUs busy, and in the most recent versions, keeping the queue full). ![]() BOINC WIKI |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.