We'll need a file comparison function. We can have a C extension function to compare 2 files via mmap, and a pure python fallback implementation.
I thought we might be able to use python's mmap object if mmap == mmap uses memcmp like str == str does. However, this test case prints False for two identical files: #!/usr/bin/python import mmap import sys path1 = sys.argv[1] path2 = sys.argv[2] with open(path1, "r+b") as f1, open(path2, "r+b") as f2: m1 = mmap.mmap(f1.fileno(), 0) m2 = mmap.mmap(f2.fileno(), 0) print(m1 == m2)
I would like to give this a shot as this sounds like a great first issue to tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution for file comparison in this case?
(In reply to Michael Egger from comment #2) > I would like to give this a shot as this sounds like a great first issue to > tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution > for file comparison in this case? Those functions only do a fast-and-loose comparison via stat (), not as precise as Zac specified. Sadly, here, there's not many cases to optimize via stat anyway, besides checking if, say, sizes differ and skipping the memcmp step in that case, as all files are new (i.e. have different (dev,ino) pairs, times, etc), and so, only size and type are meaningful. I am unsure this would speed anything up, personally; I expect that it makes a meaningful difference only for very large files, and that it pessimises small files, which are the majority of files portage installs; but, naturally, only testing can prove either case. One should also take into account the caches Linux implicitly does. If packages are sufficiently small, writing all files unconditionally might fit into that cache, and reading might end up being the slower operation. I'd like to see some testing, so if you are interested, please give it as shot.
(In reply to Michael Egger from comment #2) > I would like to give this a shot as this sounds like a great first issue to > tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution > for file comparison in this case? It looks like that's implemented in Python rather than C within cpython, but maybe it's optimised enough. I think it's a good place to start investigating this, but it'll need benchmarking (time how long emerges with already-installed files & fresh ones take before/after). It'll also require checking on a CoW filesystem whether it actually does what's intended or not (note that we already use copy_file_range and friends where possible on merging). This is not me saying it's a bad idea, just that we want to test whether or not it regresses the "new version" case significantly and whether it actually helps the "same or mostly the same" file(s) case. There's a few options to help mitigation of any slowdown too. We might end up optimising this with some suggested heuristics like: - different slot -> don't bother comparing - if the perf hit is bad in the many changes case, only do the check on new revisions rather than new versions, as it's more likely the files will be the same
I think Zac filed this at my request. I had wanted it over concerns about SSD wear, although I've now more or less been convinced that that's not a real concern on any modern half-decent SSD. I thought it would also speed things up for spinning disks, but that was based on my proof of concept that compared the mtime of the existing file and the md5sum in CONTENTS. Zac felt that we couldn't trust the mtimes. I thought this seemed overly paranoid. My implementation is at https://github.com/chewi/portage/commit/f7e3e3753ab3a808865c2e351898b3054fd638ec. I did look into optimal zero-copy comparisons, but I found the subject quite hard to understand, and I'm not sure it would work when your PORTAGE_TMPDIR is on tmpfs like mine is.
Irrespective of any benefit for SSD wear-leveling (might still be beneficial for low-level flash storage though), this is beneficial for system snapshots and some backup systems. By not unconditionally doing full file replacements, snapshots and incremental backups end up much smaller, and that’s where I personally see a benefit here, even if it makes merging a bit slower. Someone actually opened a thread on Reddit about that particular use-case for this type of thing just recently: https://www.reddit.com/r/Gentoo/comments/116jgsj/can_portage_not_clobber/ I could also envision it being beneficial for cases where copy_file_range is on a CoW filesystem (for example, split PORTAGE_TMPDIR setups). In those cases, copying a file is potentially significantly more resource-intensive than just reading the existing file for comparison due to the write amplification implicit in the filesystem itself.
Correction, meant ‘when copy_file_range is not usable on a CoW filesystem' in that last paragraph.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=a87be47f7d3245050da43d7c3ab4760d47e9fac5 commit a87be47f7d3245050da43d7c3ab4760d47e9fac5 Author: gcarq <egger.m@protonmail.com> AuthorDate: 2023-02-21 00:04:26 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2023-05-23 00:22:09 +0000 mergeme: Don't overwrite files if the content matches Uses filecmp.cmp(shallow=False) to compare file contents and doesn't replace them if they are equal. This results in less disk churn and helps to keep filesystem snapshots as small as possible. Closes: https://bugs.gentoo.org/722270 Signed-off-by: gcarq <egger.m@protonmail.com> Signed-off-by: Sam James <sam@gentoo.org> lib/portage/dbapi/vartree.py | 47 +++++++++++++++++++++++++------------------- 1 file changed, 27 insertions(+), 20 deletions(-) Additionally, it has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=7ca20833feaeaa8364b0ef4911cb81ccfa041de4 commit 7ca20833feaeaa8364b0ef4911cb81ccfa041de4 Author: Sam James <sam@gentoo.org> AuthorDate: 2023-05-23 00:25:49 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2023-05-23 00:25:49 +0000 NEWS: update for merge improvements w/ CoW Bug: https://bugs.gentoo.org/722270 Signed-off-by: Sam James <sam@gentoo.org> NEWS | 9 +++++++++ 1 file changed, 9 insertions(+)
commit 5572b73744121f67c4e55040966bfe91a0e5fb6c Author: gcarq <egger.m@protonmail.com> Date: Tue Mar 21 17:19:13 2023 +0100 mergeme: Consider file mode and xattr for comparison Implements dblink._needs_move(...) to take care of file equality checks which consist of file mode, extended attributes and file content. Signed-off-by: gcarq <egger.m@protonmail.com> Signed-off-by: Sam James <sam@gentoo.org> commit 89703c688868c9eb8cd6115cb42ff92f0b9668b8 Author: gcarq <egger.m@protonmail.com> Date: Mon May 15 11:55:05 2023 +0200 mergeme: Update mtime if file is equal and introduce ignore-mtime Updates the mtime per default if the file is equal and adds a new feature flag called "ignore-mtime" to ignore mtime updates if the target file is equal. Signed-off-by: gcarq <egger.m@protonmail.com> Closes: https://github.com/gentoo/portage/pull/991 Signed-off-by: Sam James <sam@gentoo.org> commit 6da3e0fd0ddf5577771cf39b6ac329ee51051a22 Author: gcarq <egger.m@protonmail.com> Date: Mon Mar 27 15:47:52 2023 +0200 mergeme: Put xattr comparison logic behind xattr feature flag Signed-off-by: gcarq <egger.m@protonmail.com> Signed-off-by: Sam James <sam@gentoo.org> commit 35b5e4d71ebb5c7408dba7dc27cff0c22cad1562 Author: gcarq <egger.m@protonmail.com> Date: Mon Mar 27 15:31:44 2023 +0200 mergeme: Rely on mydmode instead of calling os.path.exists again Signed-off-by: gcarq <egger.m@protonmail.com> Signed-off-by: Sam James <sam@gentoo.org>
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=08be91eebdbff0de0e033efe30c633219a9859ca commit 08be91eebdbff0de0e033efe30c633219a9859ca Author: Sam James <sam@gentoo.org> AuthorDate: 2023-06-01 01:22:47 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2023-06-01 01:23:18 +0000 sys-apps/portage: add 3.0.48 Closes: https://bugs.gentoo.org/722270 Closes: https://bugs.gentoo.org/879687 Closes: https://bugs.gentoo.org/898232 Closes: https://bugs.gentoo.org/898366 Closes: https://bugs.gentoo.org/905355 Closes: https://bugs.gentoo.org/905358 Closes: https://bugs.gentoo.org/905868 Closes: https://bugs.gentoo.org/906129 Closes: https://bugs.gentoo.org/906156 Signed-off-by: Sam James <sam@gentoo.org> sys-apps/portage/Manifest | 1 + sys-apps/portage/portage-3.0.48.ebuild | 296 +++++++++++++++++++++++++++++++++ 2 files changed, 297 insertions(+)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=c9ed8136f9f0c56dc7b72d400eaa0acbd338778f commit c9ed8136f9f0c56dc7b72d400eaa0acbd338778f Author: Konstantin Tokarev <annulen@yandex.ru> AuthorDate: 2024-01-12 17:47:45 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2024-01-16 05:16:08 +0000 vartree: _needs_move() should replace file if filecmp fails for whatever reason When trying to update I've got "OSError: [Errno 5] Input/output error" exception traceback from portage originating from _needs_move(). It turned out that 3 files in my root filesystem were corrupted, which caused filecmp to fail. This patch makes portage replace original file in this case. Bug: https://bugs.gentoo.org/722270 Signed-off-by: Konstantin Tokarev <annulen@yandex.ru> Closes: https://github.com/gentoo/portage/pull/1233 Signed-off-by: Sam James <sam@gentoo.org> lib/portage/dbapi/vartree.py | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)