Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 439584 - sys-apps/portage: sync disk after merging each package
Summary: sys-apps/portage: sync disk after merging each package
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: High major (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on:
Blocks: 431026
  Show dependency tree
 
Reported: 2012-10-25 04:37 UTC by Zac Medico
Modified: 2012-10-28 01:53 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2012-10-25 04:37:24 UTC
The intention is to minimize data-loss in the event of a power failure. It's inefficient to sync for individual files, so it seems that the best compromise is to sync once after each package is merged.

A global sync may be somewhat invasive on a multi-user system, so it should probably only be done as a last resort, when no other method is available. On Linux, we can keep a list of mount points touched when merging the package, and then call syncfs(2) for each mount point. Python doesn't have a syncfs wrapper function, but this appears to work:

   ctypes.CDLL("libc.so.6").syncfs(fd)
Comment 1 Zac Medico gentoo-dev 2012-10-25 04:57:50 UTC
Also, we can add a FEATURES="merge-sync" setting, so that people can disable this behavior if they want.
Comment 2 Daniel Robbins 2012-10-25 05:30:39 UTC
Original bug was reported by Funtoo, and here is a link to my most recent comment on this:

http://bugs.funtoo.org/browse/FL-160#comment-11036

Basic point is - Portage has an IO bug, it breaks the rules about ensuring data is written to disk. The optimal solution is to call the operating system's "sync" command after every individual package merge. This is very optimal from an IO perspective and will fix the problem. There is no need for complicated work-arounds to avoid calling sync (ie. doing syncfs), as any user on a system can call "sync" 10 times a second if they want. If this were problematic, sync could only be run by the super-user. It hasn't been a problem, so a package manager that really NEEDS to sync should not be afraid to sync.
Comment 3 Zac Medico gentoo-dev 2012-10-25 08:39:44 UTC
This is fixed in git:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=e440548da95775019dc89bb612f8049240eed4e9
http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=978f3c6d514f0fcf9329d536cc43cf1230e23112
http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=639a084f3aebe09ee580f958b6b7ec7922e9e07d

(In reply to comment #2)
> There is no need
> for complicated work-arounds to avoid calling sync (ie. doing syncfs), as
> any user on a system can call "sync" 10 times a second if they want.

Shrug, maybe somebody with a multi-user system will appreciate it. Anyway, I think the code is pretty clean and maintainable.
Comment 4 Zac Medico gentoo-dev 2012-10-25 08:46:55 UTC
(In reply to comment #3)
> http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;
> h=639a084f3aebe09ee580f958b6b7ec7922e9e07d

I've done a fixup on that last commit, so that FEATURES=merge-sync can be disabled:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=10b6d0129d062f4d5d8a7611023c3f8cc43f1eab
Comment 5 Zac Medico gentoo-dev 2012-10-25 09:36:31 UTC
Avoid loading the libc library multiple times in child processes:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=fa83d9cbe167f5fb54052f544cad4ddd7a264843
Comment 6 Zac Medico gentoo-dev 2012-10-25 15:35:46 UTC
Here's a tweak for syncfs on parent directories of _unmerged_ files:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=6d94d08143c3f198a8ebfca1573cebefaa3e8bb0
Comment 7 Zac Medico gentoo-dev 2012-10-25 15:44:45 UTC
(In reply to comment #6)
> Here's a tweak for syncfs on parent directories of _unmerged_ files:
> 
> http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;
> h=6d94d08143c3f198a8ebfca1573cebefaa3e8bb0

I've pushed a fixup on that commit:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=3f2451c0901ee75e3e4e0f11628a0a3a981eff88
Comment 8 Daniel Robbins 2012-10-25 16:27:35 UTC
I appreciate you fixing this critical bug so quickly. I do have some concerns about a few things in the code:

A package manager correctly writing data to the disk is not a "feature", it's a requirement, so having a FEATURE to turn it off is very bad. It needs to be on all the time. As long as files are rename()d during merge, there needs to be a sync of some kind. This is due to how IO works on POSIX systems. It should only be possible to turn things off that are truly features.

The other thing is that when you fall back to "sync", is it going to find "/bin/sync" on Linux and nothing else? In other words, is the path scrubbed?

And some comments would be nice, to explain the importance of this code -- to ensure that it is preserved in future versions of Portage. Lack of comments combined with the FEATURE aspect of it sort of cloud the fact that this behavior is critical to ensure data integrity, particularly on non-ext* filesystems.

Other than that, looks good, and as always we appreciate this fix :)
Comment 9 Fabian Groffen gentoo-dev 2012-10-25 16:38:04 UTC
(In reply to comment #8)
> The other thing is that when you fall back to "sync", is it going to find
> "/bin/sync" on Linux and nothing else? In other words, is the path scrubbed?

I hope not, since coreutils provides this one (/usr-merge put aside), and it seems to even do something on Prefix-powered systems.

Following your reasoning that it is important, I wonder why not always just call `sync', instead of the complex code to try and do something with filedescriptors.  The invasiveness is not going to be less, IMO.

What bothers me though, is that there is no real transactional system in Portage (AFAICT) so, the sync is really artificial, the system will be left in an unknown state if the power is removed during (e)merging.  Probably VDB should be atomic and transactional, copying to the live filesystem is dangerous, but perhaps also done much more careful to ensure putting it life is just a rename on the same filesystem, potentially with replay-log prepared to, with minimal tools, finish the merge operation when interrupted.
Comment 10 Zac Medico gentoo-dev 2012-10-25 19:09:32 UTC
(In reply to comment #8)
> The other thing is that when you fall back to "sync", is it going to find
> "/bin/sync" on Linux and nothing else? In other words, is the path scrubbed?

Currently, it just uses a normal PATH lookup, using the calling environment.

(In reply to comment #9)
> Following your reasoning that it is important, I wonder why not always just
> call `sync', instead of the complex code to try and do something with
> filedescriptors.  The invasiveness is not going to be less, IMO.

Portage is used in lots of different environments, and I would prefer not to assume that a global 'sync' operation is desirable. For example, consider if portage is running in a chroot on a separate filesystem. There's no reason the to sync anything outside the chroot, and syncfs accomplishes this.
Comment 11 Zac Medico gentoo-dev 2012-10-25 20:01:59 UTC
Here's a fix for some corner cases with _unmerged_ files:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=690312722e111cf813b591dd4dccb4400f5f7f79
Comment 12 Zac Medico gentoo-dev 2012-10-25 20:08:11 UTC
(In reply to comment #8)
> A package manager correctly writing data to the disk is not a "feature",
> it's a requirement, so having a FEATURE to turn it off is very bad. It needs
> to be on all the time. As long as files are rename()d during merge, there
> needs to be a sync of some kind. This is due to how IO works on POSIX
> systems. It should only be possible to turn things off that are truly
> features.

As Fabian mentioned in comment #9, you really need a transactional system in order to cover all your bases. The reason that I added the FEATURES setting to turn this off is that there may be some users for which calling a global sync is somehow detrimental, so I wanted to give them a way out, especially considering that this behavior is new.
Comment 13 Daniel Robbins 2012-10-26 00:00:34 UTC
The current fix Zac implemented massively limits the damage, to the point where I consider the bug fixed. With the fix, the only time you are screwed is when Portage is in the middle of the "merge" phase (ie. actually copying files to disk from the image/ directory,) and you lose power. But guess what -- if you lose power during this phase, you're screwed anyway (ie. you're going to have an incomplete package written to disk). So the "danger window" has been closed from potentially hours (emerge -auDN world on XFS can result in HUNDREDS of corrupted package binaries) to a period of a second or so for every package merged, and if you hit the "danger window", the impact is limited to the package currently being installed (and, again, problems are unavoidable during this phase unless we implement filesystem snapshotting/rollback.) This is a HUGE improvement and reduces the likelihood of corrupt data by orders and orders of magnitude.

I agree that /bin/sync would have been sufficient. I prefer simpler solutions. With syncfs, you are optimizing a corner case (Portage in chroot/container) that doesn't even need to be optimized. Also, if you are in a chroot, it's likely that you are going to still be on the main Gentoo filesystem. And containers generally share a common filesystem to minimize disk usage. So in my opinion it's a premature and unnecessary optimization, but that's Zac's call. In theory, it *could* help, and apart from adding quite a few python function calls does not slow things down.

It would be nice if vardb was truly ACID, and it would be nice to have filesystem-level snapshots, but with this fix, I think we're in good shape and vardb and other advanced data integrity improvements can be worked on in the future to make things even better.
Comment 14 Zac Medico gentoo-dev 2012-10-26 01:26:06 UTC
(In reply to comment #13)
> The current fix Zac implemented massively limits the damage, to the point
> where I consider the bug fixed. With the fix, the only time you are screwed
> is when Portage is in the middle of the "merge" phase (ie. actually copying
> files to disk from the image/ directory,) and you lose power. But guess what
> -- if you lose power during this phase, you're screwed anyway (ie. you're
> going to have an incomplete package written to disk). So the "danger window"
> has been closed from potentially hours (emerge -auDN world on XFS can result
> in HUNDREDS of corrupted package binaries) to a period of a second or so for
> every package merged, and if you hit the "danger window", the impact is
> limited to the package currently being installed (and, again, problems are
> unavoidable during this phase unless we implement filesystem
> snapshotting/rollback.) This is a HUGE improvement and reduces the
> likelihood of corrupt data by orders and orders of magnitude.

Agreed.

> I agree that /bin/sync would have been sufficient. I prefer simpler
> solutions. With syncfs, you are optimizing a corner case (Portage in
> chroot/container) that doesn't even need to be optimized. Also, if you are
> in a chroot, it's likely that you are going to still be on the main Gentoo
> filesystem. And containers generally share a common filesystem to minimize
> disk usage. So in my opinion it's a premature and unnecessary optimization,
> but that's Zac's call. In theory, it *could* help, and apart from adding
> quite a few python function calls does not slow things down.

KISS is also my general preference, but in this case I thought it would be more fun (and possibly useful for some people) to implement the syncfs thing. I've also been wanting to use ctypes in portage, and this gave me something to try it with.

> It would be nice if vardb was truly ACID, and it would be nice to have
> filesystem-level snapshots, but with this fix, I think we're in good shape
> and vardb and other advanced data integrity improvements can be worked on in
> the future to make things even better.

A simple improvement would be to merge all the files into place under temporary file names, and then rename all of the files in one big batch of renames. It's not as good as filesystem-level snapshots, but it's better than what we have now, and it's portable.

The tricky thing about filesystem-level snapshots is that they may need to interact with the bootloader, in order to ensure that its filesystem-level snapshots will boot according to the user's bootloader configuration.
Comment 15 Zac Medico gentoo-dev 2012-10-26 20:13:10 UTC
This is fixed in 2.1.11.31 and 2.2.0_alpha142.
Comment 16 Duncan 2012-10-28 01:53:23 UTC
(In reply to comment #13)
> The current fix Zac implemented massively limits the damage, to the point
> where I consider the bug fixed.

Agreed.  Good work. =:^)

> I agree that /bin/sync would have been sufficient. I prefer simpler
> solutions. With syncfs, you are optimizing a corner case (Portage in
> chroot/container) that doesn't even need to be optimized. Also, if you are
> in a chroot, it's likely that you are going to still be on the main Gentoo
> filesystem. And containers generally share a common filesystem to minimize
> disk usage. So in my opinion it's a premature and unnecessary optimization,
> but that's Zac's call. In theory, it *could* help, and apart from adding
> quite a few python function calls does not slow things down.

(Just seeing this after checking the portage changelog before an update.  So it's a bit of commenting after the update, but FWIF & FTR, I have exactly that sort of chroot case in point...)

General amd64/no-multilib system, with a 32-bit chroot setup in general accordance with the gentoo/amd64 32-bit chroot guide to build for my 32-bit only netbook.  The chroot is on its own mountpoint, which contains all paths the 32-bit portage installs to as well as /var/db/portage, in ordered to better manage backups.

So syncfs there does usefully limit the sync scope, as Zac pointed out.  I wouldn't complain about a general sync either; better that than the existing lack of sync at all, but with syncfs already implemented, keep it. =:^)

As for making it a feature, in the light of the current big hubbub about ext4 corruption with obscure and known dangerous but not well warned about in the documentation mount options, if it's exposed as a FEATURES option, just be sure the docs are very clear about the consequences of NOT having it enabled.  After doing the update, I see merge-sync listed in the make.conf manpage under FEATURES, and it DOES mention that it helps prevent dataloss, but at least personally, I'd prefer the wording be a bit stronger.  Something about "strongly recommend", maybe (maybe even in CAPS), or a WARNING IN ALL CAPS line.  It's also worth noting that the the FEATURES var in in make.globals doesn't have any in-file-comment documentation at all, for merge-sync or any of the other FEATURES listed there.  make.conf.example has a comment basically referring to the manpage, which is fine IMO, as long as the manpage is sufficiently clear.  And as I said, in light of the current ext4 hubbub I'd stress the chance of data loss if disabled a bit more in the manpage, but it /does/ at least mention it.

Meanwhile, as always, thanks, Zac. =:^)