Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 722270 - sys-apps/portage: add FEATURE to skip overwrite of identical files during merge
Summary: sys-apps/portage: add FEATURE to skip overwrite of identical files during merge
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: Normal enhancement
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS, PullRequest
Depends on: 905355
Blocks: 835380
  Show dependency tree
 
Reported: 2020-05-10 21:20 UTC by Zac Medico
Modified: 2024-01-16 05:16 UTC (History)
7 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2020-05-10 21:20:49 UTC
We'll need a file comparison function. We can have a C extension function to compare 2 files via mmap, and a pure python fallback implementation.
Comment 1 Zac Medico gentoo-dev 2020-05-10 21:50:43 UTC
I thought we might be able to use python's mmap object if mmap == mmap uses memcmp like str == str does. However, this test case prints False for two identical files:

#!/usr/bin/python

import mmap
import sys

path1 = sys.argv[1]
path2 = sys.argv[2]

with open(path1, "r+b") as f1, open(path2, "r+b") as f2:
    m1 = mmap.mmap(f1.fileno(), 0)
    m2 = mmap.mmap(f2.fileno(), 0)
    print(m1 == m2)
Comment 2 Michael Egger 2023-02-19 20:35:49 UTC
I would like to give this a shot as this sounds like a great first issue to tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution for file comparison in this case?
Comment 3 Arsen Arsenović gentoo-dev 2023-02-19 21:05:51 UTC
(In reply to Michael Egger from comment #2)
> I would like to give this a shot as this sounds like a great first issue to
> tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution
> for file comparison in this case?

Those functions only do a fast-and-loose comparison via stat (), not as precise as Zac specified.  Sadly, here, there's not many cases to optimize via stat anyway, besides checking if, say, sizes differ and skipping the memcmp step in that case, as all files are new (i.e. have different (dev,ino) pairs, times, etc), and so, only size and type are meaningful.

I am unsure this would speed anything up, personally; I expect that it makes a meaningful difference only for very large files, and that it pessimises small files, which are the majority of files portage installs; but, naturally, only testing can prove either case.

One should also take into account the caches Linux implicitly does.  If packages are sufficiently small, writing all files unconditionally might fit into that cache, and reading might end up being the slower operation.

I'd like to see some testing, so if you are interested, please give it as shot.
Comment 4 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-02-19 21:06:42 UTC
(In reply to Michael Egger from comment #2)
> I would like to give this a shot as this sounds like a great first issue to
> tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution
> for file comparison in this case?

It looks like that's implemented in Python rather than C within cpython, but maybe it's optimised enough.

I think it's a good place to start investigating this, but it'll need benchmarking (time how long emerges with already-installed files & fresh ones take before/after). It'll also require checking on a CoW filesystem whether it actually does what's intended or not (note that we already use copy_file_range and friends where possible on merging).

This is not me saying it's a bad idea, just that we want to test whether or not it regresses the "new version" case significantly and whether it actually helps the "same or mostly the same" file(s) case.

There's a few options to help mitigation of any slowdown too. We might end up optimising this with some suggested heuristics like:
- different slot -> don't bother comparing
- if the perf hit is bad in the many changes case, only do the check on new revisions rather than new versions, as it's more likely the files will be the same
Comment 5 James Le Cuirot gentoo-dev 2023-02-19 21:40:04 UTC
I think Zac filed this at my request. I had wanted it over concerns about SSD wear, although I've now more or less been convinced that that's not a real concern on any modern half-decent SSD. I thought it would also speed things up for spinning disks, but that was based on my proof of concept that compared the mtime of the existing file and the md5sum in CONTENTS. Zac felt that we couldn't trust the mtimes. I thought this seemed overly paranoid. My implementation is at https://github.com/chewi/portage/commit/f7e3e3753ab3a808865c2e351898b3054fd638ec.

I did look into optimal zero-copy comparisons, but I found the subject quite hard to understand, and I'm not sure it would work when your PORTAGE_TMPDIR is on tmpfs like mine is.
Comment 6 Austin S. Hemmelgarn 2023-02-20 13:12:26 UTC
Irrespective of any benefit for SSD wear-leveling (might still be beneficial for low-level flash storage though), this is beneficial for system snapshots and some backup systems. By not unconditionally doing full file replacements, snapshots and incremental backups end up much smaller, and that’s where I personally see a benefit here, even if it makes merging a bit slower.

Someone actually opened a thread on Reddit about that particular use-case for this type of thing just recently: https://www.reddit.com/r/Gentoo/comments/116jgsj/can_portage_not_clobber/

I could also envision it being beneficial for cases where copy_file_range is on a CoW filesystem (for example, split PORTAGE_TMPDIR setups). In those cases, copying a file is potentially significantly more resource-intensive than just reading the existing file for comparison due to the write amplification implicit in the filesystem itself.
Comment 7 Austin S. Hemmelgarn 2023-02-20 13:13:13 UTC
Correction, meant ‘when copy_file_range is not usable on a CoW filesystem' in that last paragraph.
Comment 8 Larry the Git Cow gentoo-dev 2023-05-23 00:26:20 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=a87be47f7d3245050da43d7c3ab4760d47e9fac5

commit a87be47f7d3245050da43d7c3ab4760d47e9fac5
Author:     gcarq <egger.m@protonmail.com>
AuthorDate: 2023-02-21 00:04:26 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-05-23 00:22:09 +0000

    mergeme: Don't overwrite files if the content matches
    
    Uses filecmp.cmp(shallow=False) to compare file contents and
    doesn't replace them if they are equal. This results in less disk
    churn and helps to keep filesystem snapshots as small as possible.
    
    Closes: https://bugs.gentoo.org/722270
    Signed-off-by: gcarq <egger.m@protonmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>

 lib/portage/dbapi/vartree.py | 47 +++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 20 deletions(-)

Additionally, it has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=7ca20833feaeaa8364b0ef4911cb81ccfa041de4

commit 7ca20833feaeaa8364b0ef4911cb81ccfa041de4
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2023-05-23 00:25:49 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-05-23 00:25:49 +0000

    NEWS: update for merge improvements w/ CoW
    
    Bug: https://bugs.gentoo.org/722270
    Signed-off-by: Sam James <sam@gentoo.org>

 NEWS | 9 +++++++++
 1 file changed, 9 insertions(+)
Comment 9 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-05-23 00:29:31 UTC
commit 5572b73744121f67c4e55040966bfe91a0e5fb6c
Author: gcarq <egger.m@protonmail.com>
Date:   Tue Mar 21 17:19:13 2023 +0100

    mergeme: Consider file mode and xattr for comparison

    Implements dblink._needs_move(...) to take care of file equality
    checks which consist of file mode, extended attributes and
    file content.

    Signed-off-by: gcarq <egger.m@protonmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>

commit 89703c688868c9eb8cd6115cb42ff92f0b9668b8
Author: gcarq <egger.m@protonmail.com>
Date:   Mon May 15 11:55:05 2023 +0200

    mergeme: Update mtime if file is equal and introduce ignore-mtime

    Updates the mtime per default if the file is equal and adds a new
    feature flag called "ignore-mtime" to ignore mtime updates if the
    target file is equal.

    Signed-off-by: gcarq <egger.m@protonmail.com>
    Closes: https://github.com/gentoo/portage/pull/991
    Signed-off-by: Sam James <sam@gentoo.org>

commit 6da3e0fd0ddf5577771cf39b6ac329ee51051a22
Author: gcarq <egger.m@protonmail.com>
Date:   Mon Mar 27 15:47:52 2023 +0200

    mergeme: Put xattr comparison logic behind xattr feature flag

    Signed-off-by: gcarq <egger.m@protonmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>

commit 35b5e4d71ebb5c7408dba7dc27cff0c22cad1562
Author: gcarq <egger.m@protonmail.com>
Date:   Mon Mar 27 15:31:44 2023 +0200

    mergeme: Rely on mydmode instead of calling os.path.exists again

    Signed-off-by: gcarq <egger.m@protonmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>
Comment 10 Larry the Git Cow gentoo-dev 2023-06-01 01:23:47 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=08be91eebdbff0de0e033efe30c633219a9859ca

commit 08be91eebdbff0de0e033efe30c633219a9859ca
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2023-06-01 01:22:47 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-06-01 01:23:18 +0000

    sys-apps/portage: add 3.0.48
    
    Closes: https://bugs.gentoo.org/722270
    Closes: https://bugs.gentoo.org/879687
    Closes: https://bugs.gentoo.org/898232
    Closes: https://bugs.gentoo.org/898366
    Closes: https://bugs.gentoo.org/905355
    Closes: https://bugs.gentoo.org/905358
    Closes: https://bugs.gentoo.org/905868
    Closes: https://bugs.gentoo.org/906129
    Closes: https://bugs.gentoo.org/906156
    Signed-off-by: Sam James <sam@gentoo.org>

 sys-apps/portage/Manifest              |   1 +
 sys-apps/portage/portage-3.0.48.ebuild | 296 +++++++++++++++++++++++++++++++++
 2 files changed, 297 insertions(+)
Comment 11 Larry the Git Cow gentoo-dev 2024-01-16 05:16:11 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=c9ed8136f9f0c56dc7b72d400eaa0acbd338778f

commit c9ed8136f9f0c56dc7b72d400eaa0acbd338778f
Author:     Konstantin Tokarev <annulen@yandex.ru>
AuthorDate: 2024-01-12 17:47:45 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2024-01-16 05:16:08 +0000

    vartree: _needs_move() should replace file if filecmp fails for whatever reason
    
    When trying to update I've got "OSError: [Errno 5] Input/output error"
    exception traceback from portage originating from _needs_move(). It turned
    out that 3 files in my root filesystem were corrupted, which caused filecmp
    to fail. This patch makes portage replace original file in this case.
    
    Bug: https://bugs.gentoo.org/722270
    Signed-off-by: Konstantin Tokarev <annulen@yandex.ru>
    Closes: https://github.com/gentoo/portage/pull/1233
    Signed-off-by: Sam James <sam@gentoo.org>

 lib/portage/dbapi/vartree.py | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)