259990 – sci-misc/boinc-6.4.5-r1 discontinues processing when not connected to net.

Bug 259990 - sci-misc/boinc-6.4.5-r1 discontinues processing when not connected to net.

Summary: sci-misc/boinc-6.4.5-r1 discontinues processing when not connected to net.

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	New packages (show other bugs)
Hardware:	All Linux

Importance:	Normal enhancement (vote)
Assignee:	Gentoo Science Related Packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-02-23 10:14 UTC by John (EBo) David
Modified:	2013-06-30 20:37 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description John (EBo) David 2009-02-23 10:14:11 UTC

I have a laptop which I often connect via wireless.  After rebooting I discovered that boinc did not start but was pending a net connection.  At that point I had two tasks processing and another downloaded waiting; so I had hours of work I could be doing.

To work around this I simply commented out "need net" in the depend section of the /etc/init.d/boinc init script.

It is clear why this dependency was added, but it breaks the ability of boinc to work with an intermittent connection.   Running boinc without this connection causes the scheduler to ping once a minute to see if a connection has been established.  I am not sure what the best way to resolve this situation is, but personally I would like to be able to continue processing while the machines are not connected.

  EBo --

Comment 1 Tomáš Chvátal (RETIRED) gentoo-dev

2009-02-24 21:21:34 UTC

Well understand your point.
But the problem with boinc and no net are more common so the dep was added.
You can start ANY net interface to get the boinc running (for example i have net.eth0 (cable) started all the time even if there is no cable pluged in so i can "boinc")

Comment 2 John (EBo) David 2009-02-24 22:13:04 UTC

Actually, that is not working for me.  I have eth0 started up and waiting for a connection (via ifplugd), and lo with a loop-back.  This condition holds true as long as one of the devices does not have a valid ipv4/6 net address (or at least that is the way it appears).  As a note, I do not start the address with a fall-back IP number since I have had that interfere with ifplugd's (re)connection.

It makes sense for you to leave it the way it is, but it does behave differently than I saw advertised somewhere.  Maybe this can be handled with some configuration (a REQUIRE_NET, or ALLOW_DISCONNECTED_RUN...)

Should we go ahead and change this to WONTFIX, or is there room for further discussion?

Comment 3 Tomáš Chvátal (RETIRED) gentoo-dev

2009-02-24 22:26:26 UTC

Nono, dont close it, i will try to think up something, but i cant promise it will be really soon (few weeks or so).

Comment 4 John (EBo) David 2009-02-24 23:06:27 UTC

Fair enough...  I'll see if I can think of something too.

Thanks for considering this ;-)

Comment 5 John (EBo) David 2009-02-25 03:16:04 UTC

Tomáš,

More info...

I decided to take a look at what was going on with my WCG statistics since they dd not seem quite right. I have been running DDDT in the background for the last week on my core2duo machine (with SMP running and two tasks basically running constantly). I should be returning 6 to 8 results/day. As it is I am returning 2 to 4...

Looking at my message log I see:

Tue Feb 24 20:48:42 2009|World Community Grid|Restarting task dddt0902b0075_100479_0 using dddt version 606
Tue Feb 24 20:48:42 2009|World Community Grid|Scheduler request failed: Couldn't resolve host name
Tue Feb 24 20:48:43 2009|World Community Grid|Task dddt0902b0084_100038_0 exited with zero status but no 'finished' file
Tue Feb 24 20:48:43 2009|World Community Grid|If this happens repeatedly you may need to reset the project.
Tue Feb 24 20:49:44 2009|World Community Grid|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks
Tue Feb 24 20:49:49 2009|World Community Grid|Scheduler request completed: got 0 new tasks

It appears that I am dropping between 50-75% of all my results, and the problem appears to be related to dropped network connection.

My current network connection is flaky at best, and will continue to be so until the new cable is installed to the pole (due to hurricane damage). I expect this problem to continue for maybe another month. But this may provide us a stress-test situation of computing on intermittent connectivity.

Anyway, I thought I would mention this for something else to add to your head-scratching. It might help point to other scheduling scenarios and tests.

Hope this helps, and best regards.

EBo --

Comment 6 Sebastian Günther 2009-02-27 12:47:54 UTC

Well, one could just edit /etc/rc.conf (baselayout-2 !) this way:

# Do we allow any started service in the runlevel to satisfy the depedency
# or do we want all of them regardless of state? For example, if net.eth0
# and net.eth1 are in the default runlevel then with rc_depend_strict="NO"
# both will be started, but services that depend on 'net' will work if either
# one comes up. With rc_depend_strict="YES" we would require them both to
# come up.
rc_depend_strict="NO"

Then net.lo is enough to satisfy the net dependency. Downside is: Some other net depending services may fail to start (e.g. ntp-client)

Comment 7 John (EBo) David 2009-02-28 07:47:45 UTC

baselayout-2 is currently marked as unstable on my distro (gentoo).  If there is no other solutions then I guess I will not participate when my laptop is connected wirelessly (since I appear to be loosing roughly 50% of the results), and/or wait until baselayout-2 becomes stable.  I have to many things going on at the moment to comfortably try that change.

Thanks,

  EBo --

Comment 8 Sebastian Günther 2009-02-28 14:33:08 UTC

There is a similar approsch in baselayout 1, but I don't recall the exact file and conf var.

Comment 9 John (EBo) David 2009-03-04 04:55:25 UTC

Sorry for the long delay...

(In reply to comment #8)
> There is a similar approsch in baselayout 1, but I don't recall the exact file
> and conf var.

RF_NET_STRICT_CHECKING="NO" 

appears to be the equivalent.  I have not fully tested it though, but it seems to work as expected.

I am still loosing maybe 50% of my tasks though, but it is unrelated to this.  I will see if I can track it down and either post another bug or see if I can discuss it on IRC or something.

Thanks

  EBo --

Comment 10 John (EBo) David 2009-04-30 12:51:03 UTC

I was going through my open issues and came back across this.

For the last couple of months I discontinued using boinc because I am still dropping 50-75% of my results.  I have spent a little more time trying to figure out what is going on and have the following to add.  As a note, I have only briefly looked at the code and cannot offer a patch, but can discuss some of the overall behaviour:

1) Some time within the last hour or so, the system is notified to download the next packet.

2) as soon as a job is finished, the new job is loaded and run.

3) when the new job is started the server is contacted and initial handshaking is done to announce that a potential result is done -- however the result is not uploaded.

4) the job continues to run until it is ready to load another job (see 1 above).  At this point the previous results are finally loaded.

Now if I lost my net connection any time in the 4 to 6 hours between steps 3) and 4), I seem to loose that job entirely.  My guess at this point is that the scheduler needs a loop around the hand-shake/upload portion of step 3) which runs until all solutions which are ready are uploaded.

Hope this helps.

  EBo --

Comment 11 John (EBo) David 2009-06-07 03:07:49 UTC

I just reviewed this again, and the thing I forgot to mention before is that I have a dual core machine and allow 100% usage when not otherwise occupied.  The problem appears to surround the fact that there are two processes running simultaneously.  I am running a new test case with only 49% of the processing space (ie 90% of a single processor).  I'll report the behaviour I find...

  EBo --

Comment 12 John (EBo) David 2009-06-07 03:10:46 UTC

Ok... that's weird.  at 49% I boinc_gui tells me that I "Won't get new processes".  We will see what I get with 51%...

Comment 13 John (EBo) David 2009-06-07 05:04:27 UTC

Ok... 51% runs, but only after I cancel one of the two processes it wants to run on this two core machine...

I think that the scheduler needs to be re-examine in respect to multi core/processor machines.

I'll report back what I see with 51% utilization...

  EBo --

Comment 14 Tomáš Chvátal (RETIRED) gentoo-dev

2009-06-07 20:13:15 UTC

Just quick note, i am reading throught this, but still have no clue why it hate you.

Comment 15 John (EBo) David 2009-06-07 22:01:56 UTC

(In reply to comment #14)
> Just quick note, i am reading throught this, but still have no clue why it hate
> you.

When I first read this I thought you meant that "you" hated me not "boinc", and I was REALLY confused?!?!?!  I was trying to figure out how I had offended you... Ok I get it now ;-)

When I set the multi-use CPU to >50.00 (set to 51%) it seems to work correctly running only a single process.  It was strange when I gave it 49% that it choked like it did.  My guess is that there is a bug in the scheduler which marks a result as bad if it cannot confirm each step (send, waiting for result, waiting for confirmation, etc.).  The problem of dropping results seems to only happen when I loose net connection in the middle of the reporting the results and getting credit for it.

BTW, the fan on my laptop is about to die, so until I can fix or replace it I will be shutting off Boinc -- to keep the heat down.  I'll let the current job run and post tonight, but I need to keep things cool...

Thanks for all your help, and with any luck we can get this sorted out sometime...

Comment 16 Martin Doucha 2010-08-07 23:40:20 UTC

What exactly are those problems with Boinc and no net? I've run perfectly fine for years with "after net" instead of "need net" in the init script.

Comment 17 John (EBo) David 2010-08-08 10:32:21 UTC

Martin,

It has been over a year since I looked at this.  After net should work fine, but last time I checked I was still dropping work done packets more than 50% of the time.  This had something to do with handshaking failure to report the results when the net was down, and the results were being thrown away instead of being uploaded later.  I do consider the dropped results a bug.  I mean, why spend the electricity/ware-and-tare when the results are just being pitched.

Unfortunately, the laptop I was using for this has developed *issues* with the fan, and I have since throttled the processor to keep it from overheating (the cost of replacing the fan is 25% of replacing the entire laptop), so I am not running boinc at the moment...

Hopes this helps.

Comment 18 trogdog 2012-04-18 12:19:32 UTC

Isthis bug still relevant?

Comment 19 Justin Lecher (RETIRED) gentoo-dev

2013-06-30 14:47:49 UTC

Is this still present in version 7.2.0?

Comment 20 John (EBo) David 2013-06-30 20:37:33 UTC

(In reply to Justin Lecher from comment #19)
> Is this still present in version 7.2.0?

It will take a little while to test this.  Will try to report back in the next week or two.