Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 481320 - Hardlink issues with grsec's RBAC when portage merge files from TMDPIR to rootfs on the same EXT4 filesystem.
Summary: Hardlink issues with grsec's RBAC when portage merge files from TMDPIR to roo...
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Unclassified (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 193766
  Show dependency tree
 
Reported: 2013-08-16 14:34 UTC by Piotr Karbowski (RETIRED)
Modified: 2017-08-09 08:00 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
A proof of concept. (portage-no-hardlinks.patch,426 bytes, patch)
2014-02-11 17:17 UTC, Piotr Karbowski (RETIRED)
Details | Diff
syncfs python script which may help to reproduce the bug (syncfs.py,550 bytes, text/plain)
2014-11-01 07:45 UTC, Zac Medico
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Piotr Karbowski (RETIRED) gentoo-dev 2013-08-16 14:34:47 UTC
When portage merges files from PORTAGE_TMPDIR on the same filesystem it does first hardlink then move the files. Kernel gets confused by such mechanism and Grsecurity's RBAC fails hard because of it.

Kernel still thinks that for example /usr/sbin/ntpd is still withing PORTAGE_TMPDIR which results in errors like:

(root:U:/usr/sbin/ntpd) denied unlink of /run/openntpd.pid by /var/tmp/portage/net-misc/openntpd-3.9_p1-r4/image/usr/sbin/ntpd[ntpd:20899] uid/euid:0/0 gid/egid:0/0, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 

While doing move or copy is okey, hardlinking confuses kernel as inode number is the same and results are terrible, policy does not match and so on.

Also, Lsof seems to have similar issues like it
sugoi [SSH] ~ # lsof | grep 'DEL.*lib'  
irssi      1193         piotr  DEL       REG              253,2              148168 /var/tmp/portage/dev-libs/glib-2.36.3-r2/image/usr/lib64/libglib-2.0.so.0.3600.3
irssi      1193         piotr  DEL       REG              253,2              148166 /var/tmp/portage/dev-libs/glib-2.36.3-r2/image/usr/lib64/libgmodule-2.0.so.0.3600.3

not sure if it is related or not, but seems like it is.

Portage should have feature like no-merge-hardlink to not do hardlinks. It also could auto-enable it if /dev/grsec is present or /proc/self/status ends with 'RBAC:.*'

fwiw move or copy is no problem, as inode number does change.

so far I hack movefiles.py to not do hardlinks, very hacky workaround.


Reproducible: Always
Comment 1 Piotr Karbowski (RETIRED) gentoo-dev 2013-09-24 16:24:16 UTC
*friendliest of friendly bump*
Comment 2 Piotr Karbowski (RETIRED) gentoo-dev 2013-09-28 16:24:47 UTC
Full of sorrow upgrade.

[2355630.241248] grsec: (root:U:/) denied access to hidden file /var/log/everything by /var/tmp/portage/app-admin/metalog-3-r1/image/usr/sbin/metalog[metalog:1071] uid/euid:0/0 gid/egid:0/0, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
[2355630.274948] grsec: (root:U:/) denied unlink of /run/metalog.pid by /var/tmp/portage/app-admin/metalog-3-r1/image/usr/sbin/metalog[metalog:1071] uid/euid:0/0 gid/egid:0/0, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Comment 3 Piotr Karbowski (RETIRED) gentoo-dev 2014-02-11 17:17:32 UTC
Created attachment 370162 [details, diff]
A proof of concept.

Adding proof of concept that gets the job done. If portage could support FEATURE=no-hardlinks it could be quite nice.
Comment 4 Pavel Kazakov (RETIRED) gentoo-dev 2014-02-11 18:44:31 UTC
Taking a look
Comment 5 Pavel Kazakov (RETIRED) gentoo-dev 2014-02-12 19:54:57 UTC
The proof-of-concept code should prevent portage itself from creating any hard links, but I'm curious what happens if a package tries to create a hard link. Using your proof of concept code, can you try emerging a package that creates a hardlink (e.g. dev-vcs/git)?
Comment 6 SpanKY gentoo-dev 2014-02-13 08:07:44 UTC
this sounds like the kernel is broken.  why not report it to the grsec devs ?
Comment 7 Piotr Karbowski (RETIRED) gentoo-dev 2014-02-13 20:11:14 UTC
@Pavel: Would you kindly tell me what git does hardlink? I don't see any link beside symbolic in ebuild of git.

@Vapier: Spender is not willing to fix it on grsecurity side (he did not agreed that this is grsec issue but rather portage does nasty things to merge files). As I saw in movefile.py selinux if-o-logic I think it woudn't harm to have no-hardlink switch/feature.
Comment 8 Piotr Karbowski (RETIRED) gentoo-dev 2014-02-13 20:12:38 UTC
followup.

@Pavel: The only issue is when portage first hardlink temp file to real filesystem and then move one hardlink onto another - that does confuses portage. Hardlinks as-is does not break anything, and poking policy recreation (gradm -R) does 'fix' the issue after merge.
Comment 9 Pavel Kazakov (RETIRED) gentoo-dev 2014-02-14 18:16:30 UTC
(In reply to Piotr Karbowski from comment #7)
> @Pavel: Would you kindly tell me what git does hardlink? I don't see any
> link beside symbolic in ebuild of git.

I was advised on the irc channel that it creates a hardlink in /usr/libexec/git-core


(In reply to Piotr Karbowski from comment #8)
> followup.
> 
> @Pavel: The only issue is when portage first hardlink temp file to real
> filesystem and then move one hardlink onto another - that does confuses
> portage. Hardlinks as-is does not break anything, and poking policy
> recreation (gradm -R) does 'fix' the issue after merge.

I'm actually not too familiar with grsec's RBAC, so I'll have to see what vapier's take on this is
Comment 10 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-06 08:07:46 UTC
*friendliest of friendly bumps*

Can we maybe get additional FEATURE for emerge to actually not hardlink? This very issue strikes me in the back everytime I forget to localpatch/epatch_user portage on new deployments.
Comment 11 Anthony Basile gentoo-dev 2014-09-06 11:49:30 UTC
(In reply to Piotr Karbowski from comment #0)
> When portage merges files from PORTAGE_TMPDIR on the same filesystem it does
> first hardlink then move the files. Kernel gets confused by such mechanism
> and Grsecurity's RBAC fails hard because of it.
> 

What's most likely happening here is that you have rbac rules that don't allow the hard linking.  You probably created your rules by full system learning with `gradm -F`.  During the learning, your system never did any emerge hard linking and so it never became part of the rbac rules.  Of course, when you switch from learning to enforcing, you're going to get the hard linking denied.


(In reply to SpanKY from comment #6)
> this sounds like the kernel is broken.  why not report it to the grsec devs ?

If my assumption above is correct, its not broken, its doing exactly what you asked it to do.


(In reply to Piotr Karbowski from comment #7)
> @Vapier: Spender is not willing to fix it on grsecurity side (he did not
> agreed that this is grsec issue but rather portage does nasty things to
> merge files). As I saw in movefile.py selinux if-o-logic I think it woudn't
> harm to have no-hardlink switch/feature.

Did you discuss with him what rbac rules are causing the denial?  If we come up with some ruleset that helps here, we'll document it.


(In reply to Piotr Karbowski from comment #10)
> *friendliest of friendly bumps*
> 
> Can we maybe get additional FEATURE for emerge to actually not hardlink?
> This very issue strikes me in the back everytime I forget to
> localpatch/epatch_user portage on new deployments.

I feel uncomfortable with this.  I'd prefer to first be convinced that you can't fix this by getting the correct rbac rules.
Comment 12 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-06 14:31:37 UTC
Now it's even more troublesome.

It seems that os.rename, or simple `mv` breaks my rbac enabled system apart as per this report https://forums.grsecurity.net/viewtopic.php?p=14402&sid=4acb6042709c239b28821a5dc71e5b15#p14402

Even with the no-hardlink patch it fails hard, but used to work. movefile.py does not seems to be changed all that much so it needs to be a bug in grsecurity, I am using grsecurity-3.0-3.15.10-201408140023.patch at this moment.

And Spender refused to take action on it, but apperantly now its not only hardlink issue but rename/mv.

At this point I think there's no reason to modify portage as the grsec is to blame here. :<
Comment 13 Anthony Basile gentoo-dev 2014-09-06 15:31:30 UTC
(In reply to Piotr Karbowski from comment #12)
> Now it's even more troublesome.
> 
> It seems that os.rename, or simple `mv` breaks my rbac enabled system apart
> as per this report
> https://forums.grsecurity.net/viewtopic.
> php?p=14402&sid=4acb6042709c239b28821a5dc71e5b15#p14402
> 
> Even with the no-hardlink patch it fails hard, but used to work. movefile.py
> does not seems to be changed all that much so it needs to be a bug in
> grsecurity, I am using grsecurity-3.0-3.15.10-201408140023.patch at this
> moment.

Did your rules change between when it did and when it didn't work?

> 
> And Spender refused to take action on it, but apperantly now its not only
> hardlink issue but rename/mv.

There may be no action to take here.  The kernel may be doing exactly what you told it to do.

> 
> At this point I think there's no reason to modify portage as the grsec is to
> blame here. :<

I still don't clearly see that.  All of the above misbehavior can be explained by bad rules.  So its still not clear to me whether this is a valid bug or not.
Comment 14 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-06 15:46:54 UTC
The rules didnt change, only kernel. And there's nothing about disallowing symlinks. And of course I was running the emerge in admin role.

See the linked post, simple replacing the subjet binary make RBAC thinks that the running binary have different path. Its a bug.
Comment 15 Anthony Basile gentoo-dev 2014-09-06 16:06:07 UTC
(In reply to Piotr Karbowski from comment #14)
> The rules didnt change, only kernel. And there's nothing about disallowing
> symlinks. And of course I was running the emerge in admin role.
> 
> See the linked post, simple replacing the subjet binary make RBAC thinks
> that the running binary have different path. Its a bug.

Thanks.  And you can revert to an earlier version where everything does work?
Comment 16 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-06 16:13:07 UTC
Not at this very moment, but according to my /boot on that box I was running  3.15.2 and 3.14.5 before that. I am pretty sure that the rename wasn't causing the problems on 3.14, but I need Spender to check on it, too.

As soon as I have any real info about either the downgrade or 'upstream' I will update this bug,
Comment 17 Anthony Basile gentoo-dev 2014-09-06 16:59:57 UTC
(In reply to Piotr Karbowski from comment #16)
> Not at this very moment, but according to my /boot on that box I was running
> 3.15.2 and 3.14.5 before that. I am pretty sure that the rename wasn't
> causing the problems on 3.14, but I need Spender to check on it, too.
> 
> As soon as I have any real info about either the downgrade or 'upstream' I
> will update this bug,

The point here is to bracket any change so we know where it was introduced.
Comment 18 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-06 17:12:52 UTC
Okey, tested the 3.14.5, no 'mv' problem (the hardlink is another story).

3.15.2 and 3.15.10 suffer from this rename issue.
Comment 19 Anthony Basile gentoo-dev 2014-09-07 22:16:30 UTC
(In reply to Piotr Karbowski from comment #18)
> Okey, tested the 3.14.5, no 'mv' problem (the hardlink is another story).
> 
> 3.15.2 and 3.15.10 suffer from this rename issue.

Has the hardlink issue always been an issue?
Comment 20 Piotr Karbowski (RETIRED) gentoo-dev 2014-09-08 14:13:06 UTC
The hardliink bug is since I am using rbac, years? I wasn't aware of it when I had portage tmpdir on different than rootfs partition.

the rename issue is upstream bug - https://lkml.org/lkml/2014/9/7/6
Comment 21 Zac Medico gentoo-dev 2014-10-30 20:26:51 UTC
(In reply to Piotr Karbowski from comment #7)
> @Vapier: Spender is not willing to fix it on grsecurity side (he did not
> agreed that this is grsec issue but rather portage does nasty things to
> merge files). As I saw in movefile.py selinux if-o-logic I think it woudn't
> harm to have no-hardlink switch/feature.

Maybe you can whitelist portage, so that it can do whatever it needs to do? Restricting portage behavior based on some kernel security heuristics would be backwards. The heuristics should be configurable so that package managers like portage have the freedom to create hardlinks.
Comment 22 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-30 20:44:22 UTC
White list does not do, because I can disable RBAC, emerge update, enable RBAC and still fail, as for the kernel, the /usr/sbin/ntpd is in fact /var/tmp/portage/net-misc/openntpd-3.9_p1-r4/image/usr/sbin/ntpd. Just check on this:

    sugoi [SSH] ~ # ls -l /proc/*/exe 2>/dev/null | grep tmp
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1637/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1638/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1639/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1640/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1641/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root         view_proc 0 Oct 29 22:52 /proc/1642/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 piotr        view_proc 0 Oct 29 22:53 /proc/1709/exe -> /var/portage/tmp/portage/net-irc/irssi-0.8.17/image/usr/bin/irssi (deleted)

I have it even under systems where RBAC was never enabled

    localhost ~ # ls -l /proc/*/exe 2>/dev/null | grep tmp
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1229/exe -> /var/portage/tmp/portage/net-misc/openssh-6.6.1_p1-r4/image/usr/sbin/sshd (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1309/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1310/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1311/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1312/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1313/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1314/exe -> /var/portage/tmp/portage/sys-apps/util-linux-2.25.2/image/sbin/agetty (deleted)
    lrwxrwxrwx 1 root    root    0 Oct 30 21:46 /proc/1384/exe -> /var/portage/tmp/portage/app-shells/bash-4.3_p30/image/bin/bash (deleted)

This whole mechanism is a bit broken it seems, and it does falls under vanilla kernel it seems, too. Even recently there was even greater problem, check this for reference - https://lkml.org/lkml/2014/9/6/120

I am out of ideas how to handle it, but it seems that the way portage does the merge by creating hardlink and then rename to it is confusing to the kernel as a whole.
Comment 23 Zac Medico gentoo-dev 2014-10-30 20:59:59 UTC
(In reply to Piotr Karbowski from comment #22)
> White list does not do, because I can disable RBAC, emerge update, enable
> RBAC and still fail, as for the kernel, the /usr/sbin/ntpd is in fact
> /var/tmp/portage/net-misc/openntpd-3.9_p1-r4/image/usr/sbin/ntpd. 

Does this issue persist through a reboot? If so, where are these paths recorded?
Comment 24 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-30 21:05:15 UTC
Rebooting the whole system or just killing the PID that origins from the path and starting new copy does the trick, too.

So the steps to reproduce are:
  1. Start any service (or process that stays alive long enough to be a test case).
  2. Update/downgrade (emerge will not replace file if the same version) said service while PORTAGE_TMPDIR points to the same file system that service executable is on (presume rootfs)
  3. Check on the started service /proc/<pid>/exec symlink.
Comment 25 Zac Medico gentoo-dev 2014-10-30 21:14:49 UTC
(In reply to Piotr Karbowski from comment #24)
> So the steps to reproduce are:
>   1. Start any service (or process that stays alive long enough to be a test
> case).
>   2. Update/downgrade (emerge will not replace file if the same version)
> said service while PORTAGE_TMPDIR points to the same file system that
> service executable is on (presume rootfs)
>   3. Check on the started service /proc/<pid>/exec symlink.

So, why does /proc/*/exe point to files in /var/tmp/portage even though the kernel is executing the old files that have been replaced on disk? It seems to me that the kernel's /proc/*/exe data is corrupt. Why not fix the corruption?
Comment 26 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-30 21:27:51 UTC
From my stand point, indeed, the kernel seems to be blame, however portage is the only thing that so far hit this very bug and some workarounds in portage can mitigate this issue.

The mechanism that take care of exe symlink is used by the RBAC, which renders this issue as a serious problem rather than a cosmetic thing.

I am confused by what portage is really doing as I am unable to reproduce it without.

sabre /bin # ./busybox sleep 999999 &
[1] 19399
sabre /bin # cp busybox /var/portage/tmp/foo      
‘busybox’ -> ‘/var/portage/tmp/foo’
sabre /bin # ln -f /var/portage/tmp/foo /bin/busybox 
‘/bin/busybox’ => ‘/var/portage/tmp/foo’
sabre /bin # mv /var/portage/tmp/foo /bin/busybox
mv: overwrite ‘/bin/busybox’? y
removed ‘/var/portage/tmp/foo’
sabre /bin # ls -l /proc/19399/exe
lrwxrwxrwx 1 root view_proc 0 Oct 30 22:25 /proc/19399/exe -> /bin/busybox (deleted)

I was pretty sure that's what portage's movefile does.

Can you help on reproducing it without portage?
Comment 27 Zac Medico gentoo-dev 2014-10-30 22:10:34 UTC
(In reply to Piotr Karbowski from comment #26)
> I was pretty sure that's what portage's movefile does.

If the package does not install hardlinks, then movefile simply does the equivalent of this:

# mv /var/portage/tmp/foo /bin/busybox 

If the package installs hardlinks, then movefile does this:

# mv /var/portage/tmp/foo /bin/busybox
# ln /bin/busybox /bin/.busybox-hardlink._portage_merge_.XXXX
# mv /bin/.busybox-hardlink._portage_merge_.XXXX /bin/busybox-hardlink

If I do either of these, then I cannot reproduce the /proc/*/exe corruption with linux-3.14.21 (selinux not enabled).

However, if you are using selinux, then portage uses libselinux to call setfscreatecon(context) before it renames/moves the file. So, maybe your corruption is related to the special selinux context handling.
Comment 28 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-31 08:08:53 UTC
No selinux is used on any of my boxes.

Its 100% reproducable on all of my systems, too.

Before I used to just patch the movefile.py and input hardlink_candidates=None, which was indeed a workaround but it worked, it was when movefile.py was under /usr/lib64/portage/pym/portage/util/movefile.py.

It stopped working a while ago, pretty much when movefile.py was moved to site-packages of python2.7 and python3, but I can confirm that the hack is in there, just no longer works.

> If the package does not install hardlinks, then movefile simply does the
> equivalent of this:
> 
> # mv /var/portage/tmp/foo /bin/busybox 

Does not seems to be like that, it does try to use hardlinks when merging normal files, per movefile.py:

        # For atomic replacement, first create the link as a temp file
        # and them use os.rename() to replace the destination.
        if hardlink_candidates:
                head, tail = os.path.split(dest)
                hardlink_tmp = os.path.join(head, ".%s._portage_merge_.%s" % \
                        (tail, os.getpid()))
                try:
                        os.unlink(hardlink_tmp)
                except OSError as e:
                        if e.errno != errno.ENOENT:
                                writemsg(_("!!! Failed to remove hardlink temp file: %s\n") % \
                                        (hardlink_tmp,), noiselevel=-1)
                                writemsg("!!! %s\n" % (e,), noiselevel=-1)
                                return None
                        del e
                for hardlink_src in hardlink_candidates:
                        try:
                                os.link(hardlink_src, hardlink_tmp)
                        except OSError:
                                continue
                        else:
                                try:
                                        os.rename(hardlink_tmp, dest)

I have no idea why the hack no longer works, but I am pretty sure that's the code that trigger the bug.
Comment 29 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-31 08:14:37 UTC
And the most reproducable scenario:

    # mkdir /foo
    # chown portage:portage /foo
    # echo PORTAGE_TMPDIR=/foo >>/etc/make.conf
    # emerge -1 '=openssh-6.6.1_p1-r4'
    # /etc/init.d/sshd start
    # emerge -1 '=openssh-6.6_p1-r1'
    # ls -l /proc/`pidof sshd`/exe    
    lrwxrwxrwx 1 root view_proc 0 Oct 31 08:52 /proc/7893/exe -> /foo/portage/net-misc/openssh-6.6_p1-r1/image/usr/sbin/sshd (deleted)
Comment 30 Zac Medico gentoo-dev 2014-10-31 09:25:24 UTC
(In reply to Piotr Karbowski from comment #28)
> > If the package does not install hardlinks, then movefile simply does the
> > equivalent of this:
> > 
> > # mv /var/portage/tmp/foo /bin/busybox 
> 
> Does not seems to be like that, it does try to use hardlinks when merging
> normal files, per movefile.py:
> 
>         # For atomic replacement, first create the link as a temp file
>         # and them use os.rename() to replace the destination.
>         if hardlink_candidates:

The hardlink_candidates variable is an empty list if the package did not install any hardlinks to the current file. It's also empty when merging the first file of a group of files that the package hardlinked together.
Comment 31 Piotr Karbowski (RETIRED) gentoo-dev 2014-10-31 16:29:32 UTC
So I re-tested this on latest, vanilla kernel 3.17.2 and the exe symlink is broken.

@Zac, could you please help on reproducing it outside portage so I can push it forward to upstream?
Comment 32 Zac Medico gentoo-dev 2014-11-01 01:16:44 UTC
What filesystem are you using?  The file system may be at fault. For example, here's a bug report involving /proc/self/exe corruption with NFSv4:

	https://bugzilla.redhat.com/show_bug.cgi?id=511278
Comment 33 Piotr Karbowski (RETIRED) gentoo-dev 2014-11-01 07:07:26 UTC
ext4 on all the deployments that I've tried to reproduce it.
Comment 34 Zac Medico gentoo-dev 2014-11-01 07:26:51 UTC
(In reply to Piotr Karbowski from comment #33)
> ext4 on all the deployments that I've tried to reproduce it.

Thanks that's good to know, especially since I run exclusively btrfs here.

Another possibly contributing factor to consider is that portage calls syncfs after it finishes merging a package. I'll write a small python script that we may need in order to reproduce issue.
Comment 35 Zac Medico gentoo-dev 2014-11-01 07:45:21 UTC
Created attachment 387938 [details]
syncfs python script which may help to reproduce the bug
Comment 36 Piotr Karbowski (RETIRED) gentoo-dev 2014-11-01 18:09:05 UTC
How one supposed to use the syncfs script? If I run it over /usr/sbin/sshd after the exe symlink become broken, nothing changes.
Comment 37 Zac Medico gentoo-dev 2014-11-01 20:50:06 UTC
(In reply to Piotr Karbowski from comment #36)
> How one supposed to use the syncfs script? If I run it over /usr/sbin/sshd
> after the exe symlink become broken, nothing changes.

The syncfs script is provided in order to help duplicate the portage behavior that  leads up to the exe symlink corruption. If you have a working test case that reliably triggers exe symlink corruption "outside of portage", then there's no need for the syncfs script. If you're still trying to duplicate whatever portage does to trigger exe symlink corruption, then you should run the syncfs script after just after moving files around, to see if that triggers the corruption.

The syncfs call will sync the entire filesystem that the argument file belongs to. For example, to sync the root filesystem, call 'syncfs.py /' to sync the root fileystem. If /usr is separate, then you need to call 'syncfs.py /usr' after you update files in /usr/*bin.
Comment 38 Zac Medico gentoo-dev 2014-11-10 00:25:39 UTC
You can trace the syscalls that emerge is making by using strace as follows:

	strace -o strace.log -f emerge [args]

Filter out lines containing files of interest like this:

	grep /bin/ strace.log > strace_bin.log
Comment 39 Piotr Karbowski (RETIRED) gentoo-dev 2014-12-27 17:12:12 UTC
I was able to reproduce it without portage, with simple perl's rename().

# for i in `pidof sshd`; do ls -l /proc/$i/exe; done
lrwxrwxrwx 1 root root 0 Dec 27 18:09 /proc/29047/exe -> /usr/sbin/sshd

# cp sshd /root/foo

# strace -f perl -e 'rename("/root/foo", "/usr/sbin/sshd")' 2>&1 | grep sshd
rename("/root/foo", "/usr/sbin/sshd")   = 0

# for i in `pidof sshd`; do ls -l /proc/$i/exe; done
lrwxrwxrwx 1 root root 0 Dec 27 18:09 /proc/29047/exe -> /root/sshd (deleted)

The bug does not kicks in, however, if the file is in the very same dir or the source path is within the target one, like 

rename("/usr/sbin/foo", "/usr/sbin/sshd")

and

rename("/usr/sbin/bar/sshd", "/usr/sbin/sshd")

Going to report it to lkml.
Comment 40 Anthony Basile gentoo-dev 2014-12-28 12:47:14 UTC
(In reply to Piotr Karbowski from comment #39)
> I was able to reproduce it without portage, with simple perl's rename().
> 
> # for i in `pidof sshd`; do ls -l /proc/$i/exe; done
> lrwxrwxrwx 1 root root 0 Dec 27 18:09 /proc/29047/exe -> /usr/sbin/sshd
> 
> # cp sshd /root/foo
> 
> # strace -f perl -e 'rename("/root/foo", "/usr/sbin/sshd")' 2>&1 | grep sshd
> rename("/root/foo", "/usr/sbin/sshd")   = 0
> 
> # for i in `pidof sshd`; do ls -l /proc/$i/exe; done
> lrwxrwxrwx 1 root root 0 Dec 27 18:09 /proc/29047/exe -> /root/sshd (deleted)
> 
> The bug does not kicks in, however, if the file is in the very same dir or
> the source path is within the target one, like 
> 
> rename("/usr/sbin/foo", "/usr/sbin/sshd")
> 
> and
> 
> rename("/usr/sbin/bar/sshd", "/usr/sbin/sshd")
> 
> Going to report it to lkml.

Thank you.  This is a clear statement of the problem.  It affects hardened, but it is a vanilla kernel issue where they changed how /proc/<pid>/self works.  I can see advantages to this new way of doing it, so you may find lkml not very friendly.

Please put the url to the email in this bug so we can follow it.
Comment 41 Piotr Karbowski (RETIRED) gentoo-dev 2014-12-28 14:54:48 UTC
Here's the lkml thread https://lkml.org/lkml/2014/12/27/85

I've also mailed Spender but I've got no response from him at all.
Comment 42 Piotr Karbowski (RETIRED) gentoo-dev 2015-02-15 18:10:49 UTC
Just an update.

After multiple bumps on lkml it seems that not a single developer give a damn about this rename() issue, leaving the RBAC permanently broken.

How this issue should be handled here? Grsecurity is supported by Gentoo, so maybe someone here have enough willpower to find/hack another way that RBAC can use to track origin of running processes? Or maybe a hack in portage to copy from $DESTDIR and rename() within the same dir as target file?
Comment 43 Zac Medico gentoo-dev 2015-02-16 17:40:33 UTC
(In reply to Piotr Karbowski from comment #42)
> After multiple bumps on lkml it seems that not a single developer give a
> damn about this rename() issue, leaving the RBAC permanently broken.

Since grsecurity requires a patched kernel anyway, whoever maintains the grsecurity paches might be interested in developing a patch for this.

> How this issue should be handled here? Grsecurity is supported by Gentoo, so
> maybe someone here have enough willpower to find/hack another way that RBAC
> can use to track origin of running processes?

I doubt that there is any other way to accomplish this, other than fixing the kernel.

> Or maybe a hack in portage to
> copy from $DESTDIR and rename() within the same dir as target file?

This is not very appealing. I think the onus is on the grsecurity upstream to fix this.
Comment 44 Zac Medico gentoo-dev 2015-02-16 18:27:46 UTC
As a workaround, you can mount a different filesystem on /var/tmp/portage. You can also bind mount a directory from your root filesystem to /var/tmp/portage, and that is enough to make rename fail, so that files are copied instead of renamed.
Comment 45 Anthony Basile gentoo-dev 2015-02-16 19:19:59 UTC
(In reply to Piotr Karbowski from comment #42)
> Just an update.
> 
> After multiple bumps on lkml it seems that not a single developer give a
> damn about this rename() issue, leaving the RBAC permanently broken.
> 

While I get what's going on here and why the change is annoying, what I don't get is why you don't just work around the issue the way you said above.  It doesn't break RBAC, it just means that on upgrade with portage, RBAC rules will deny the newly renamed process resources.  So as you said, restart that process:

(In reply to Piotr Karbowski from comment #24)
> Rebooting the whole system or just killing the PID that origins from the
> path and starting new copy does the trick, too.
> 

You should restart a service after an upgrade anyhow.  It not only addresses this issue but ensures that the new running image gets whatever changes were just made (espeically important if its a security upgrade).

Backporting the old renaming code is possible but I'm worried about what side effects it might have.  Upstream made this change for a reason which I don't know so it worries me undoing it.

spender is busy these days moving.  I'll speak to him about this and see what his take is.  If it turns out that there's some security issue then I'm not going to backport, otherwise I will.  Maybe we can come up with some scripts to help identify and restart services to automate this for you.

Since its a vanilla issue, I'd like to hear what Mike Pagano's idea is here.  

@Mike comment 39 show the change in 3.14 to 3.15 clearly.  It causes a problem when role based rules are applied since the process magically changes name on copying its executable on the filesystem.