Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 397875

Summary: sys-apps/openrc-9999 commit f583030e3cbf: bad assumptions break net
Product: Gentoo Hosted Projects Reporter: Duncan <1i5t5.duncan>
Component: OpenRCAssignee: OpenRC Team <openrc>
Status: VERIFIED FIXED    
Severity: blocker    
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard: openrc:oldnet
Package list:
Runtime testing required: ---

Description Duncan 2012-01-06 12:03:44 UTC
This is a nasty one as it entirely broke networking here, including loopback/net.lo, thus the severity boost.  I'd recommend reverting f583030e3cbf while a fix is devised.

The conceptual issue appears to be that the net-modules functionality is rather less than intuitive, to the point SpanKY could break networking entirely without a clue he was touching anything even marginally sensitive, and even with the single small commit and a very broken net as a result to point the way, it took me quite some time banging my head against code before it suddenly hit me how it was supposed to work, and thus what went wrong.  Perhaps some warning comments in the modules about the danger of messing with this might help prevent other devs making similar mistakes in the future.  (FWIW, the comment in the xdm initscript is a great example, by which I too have been enlightened.)

The technical root issue is this:  In _modules_load in net.lo, the net module program directive (which I haven't found documentation for, BTW, the runscript or rc manpages might be appropriate...) does NOT simply conveniently provide the path to the net module base-program-binary as it might appear on first glance, and an empty program entry does NOT disable the module as I (and apparently SpanKY) intuitively expected it to do.

Rather, a missing program directive is interpreted by the logic in net.lo as disabling THAT PART OF THE MODULE ONLY, while processing and loading of the module in general and other triggers in particular continue.  Specifically, if a program entry is NON-EMPTY, the file it points to is tested executable and the entire module disabled if it is not executable, **BUT**, if the program entry is EMPTY, the test for executability is skipped and module loading continues (as it needs to for modules such as dhcpcd), loading other triggers such as program_start, program_stop, (interface)_up, _down, etc.

With that explanation it should be possible to see what went wrong in that commit.  I have net-tools/ifconfig installed but not iproute2/ip.  Without ip installed, the "which" introduced by that commit returns an empty string, thus changing a hard-coded path (non-empty program directive) to an empty string and thus an empty program directive.

The result is that the test for hard-coded-path executable and skip to the next module without further processing if the executable isn't found, now gets skipped entirely because the path is empty, so the rest of the module can load other functions and triggers.  The result of /that/ was that the rest of the iproute2 module was loaded, and when the interface tried to come up, the iproute2 _up trigger triggered, and of course failed because iproute2 isn't installed, thus failing the entire initscript.  This happened for both net.lo and net.eth0, symlinked to it.

But it isn't as simple as changing the general module-loading logic to bail out entirely on program-empty, because modules such as dhcpcd have an empty program directive, but a valid program_start directive.  Bailing out on program-empty would bail out on dhcpcd without loading the valid program_start.


What I propose instead is something similar to this commit at the module level (as opposed to the net.lo script level), but with the "which" exit code checked or the output string tested for empty then and there.  If "which" returns an error (or empty), set some pre-designated arbitrary value instead of "", then change the code in net.lo to treat that pre-designated value as it currently does a non-empty value that tests as non-executable -- skip further processing with a "continue" to consideration of the next module.

It /could/ be as simple as setting a known-to-not-exist path such as /dev/null/invalid_file instead of setting empty on a "which" fail.  That should "just work" but is arguably a bit of a hack.

Meanwhile, while researching this, I ran into additional difficulties due to not finding any shell_var (as an rc alias) documentation either.  I ended up having to do a search in git-whatchanged for shell_var, and then a git-show on the commit that converted it from a shell code to a C code function, just to figure out what it was actually doing.  Again, documentation in the manpages would have been rather useful!  It's quite likely there's other as yet not clearly documented (at least where I could find it) functionality, too.  (Ask if you want me to file the docs requests as separate bugs.  Mainly I'm too tired to properly file them now, but I'll forget if I don't write it down, so I'm punting.)
Comment 1 William Hubbs gentoo-dev 2012-01-06 20:11:30 UTC
This note is for the rest of the openrc team... I'm looking into this real quick.
Comment 2 William Hubbs gentoo-dev 2012-01-06 21:20:52 UTC
I confirmed this myself as well and reverted the commit you suggested in
commit 84aa4ba.

I am about to release openrc-0.9.8, so thanks for the catch.

I do have another way I'm thinking about doing this, but, I'll look into
it after the release.

Thanks much for the report.

William
Comment 3 Duncan 2012-01-08 21:22:50 UTC
Unfortunately, while reverting the commit for release fixed the problem temporarily, it's baackkk! =:^(


commit 61e05331d14a08fa909526fda15470a1ca4927dd reintroduces the problem... somehow!  I'm on d02d3af02e4254b04949de546c5d53af82cc2fc2 now, with the problem worked around as below.

I don't have time to dig further ATM, but the program line definitely has something to do with it, as the following in iproute2_depend (unindented are my changes, note that tabs are converted to spaced due to my select/paste method) works:

iproute2_depend()
{
        local x
        x=$(_which ip)
#       [ -z "$x" ] && return 1
        program $x
program /dev/null/does_not_exist
        provide interface
        after ifconfig
}

So simply returning 1 and not finishing the depend doesn't work, but deliberately setting program to an invalid location DOES work.

Either the error return is being ignored and the module is loaded anyway, or something else is loading the module, somehow.  Setting an invalid program line works, but not setting it at all doesn't.

Without double-checking, I'd guess that there's nothing catching that "return 1" error (which I had to comment out to get the stuff below it to process, so it's definitely bailing out of the depend as the logic suggests it should), so the thing is still loaded... unless the program line is set invalid.
Comment 4 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-01-09 00:28:49 UTC
Please retest 9999, I've got a new commit in.

http://git.overlays.gentoo.org/gitweb/?p=proj/openrc.git;a=commit;h=4255ba175bd7c3ccadc5bc894d00ccb844467067

If you have network issues with it, I want to see the output of:
/etc/init.d/net.$IFACE --verbose start
Comment 5 Duncan 2012-01-09 10:39:18 UTC
(In reply to comment #4)
> Please retest 9999, I've got a new commit in.
> 
> http://git.overlays.gentoo.org/gitweb/?p=proj/openrc.git;a=commit;h=4255ba175bd7c3ccadc5bc894d00ccb844467067
> 
> If you have network issues with it, I want to see the output of:
> /etc/init.d/net.$IFACE --verbose start

Ahh!  Someone that can think like openrc's net script does!
Verified/fixed! =:^)

FWIW I have a "portage git-helper" script that for git-based overlays or live-packages, makes it MUCH easier to pull and then check updates (git-whatchanged and if I see anything interesting, git-show) before I do the actual build.  For openrc and a couple other live packages I follow upstream very closely on, I routinely use that script to check updates since the last pull, and paid special attention this time due to this bug.

Reading your commit I saw that for the first time since the triggering commit, someone was actually following the net.lo script _load_modules logic and working with it, not trying to fight it.  As such, I was quite confident of the fix (barring some trivial typo/thinko) even before I built and tried it.  Sure enough, merged and rebooted, and networking (both net.lo and net.eth0 interfaces) came up exactly like it's supposed to do!  =:^)

So we have the old logic working again, but now with the additional flexibility of being able to find the path (and/or builtins) dynamically! And I can proudly point to that new flexibility as something I helped debug!  =:^)

Thanks!  The new flexibility will be a tremendous help with the coming /usr dustup, etc.  Plus even aside from that, a little extra flexibility is a good thing!