Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 722270 - sys-apps/portage: add FEATURE to skip overwrite of identical files during merge
Summary: sys-apps/portage: add FEATURE to skip overwrite of identical files during merge
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: Normal enhancement (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: PullRequest
Depends on:
Blocks:
 
Reported: 2020-05-10 21:20 UTC by Zac Medico
Modified: 2023-03-08 04:59 UTC (History)
6 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2020-05-10 21:20:49 UTC
We'll need a file comparison function. We can have a C extension function to compare 2 files via mmap, and a pure python fallback implementation.
Comment 1 Zac Medico gentoo-dev 2020-05-10 21:50:43 UTC
I thought we might be able to use python's mmap object if mmap == mmap uses memcmp like str == str does. However, this test case prints False for two identical files:

#!/usr/bin/python

import mmap
import sys

path1 = sys.argv[1]
path2 = sys.argv[2]

with open(path1, "r+b") as f1, open(path2, "r+b") as f2:
    m1 = mmap.mmap(f1.fileno(), 0)
    m2 = mmap.mmap(f2.fileno(), 0)
    print(m1 == m2)
Comment 2 Michael Egger 2023-02-19 20:35:49 UTC
I would like to give this a shot as this sounds like a great first issue to tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution for file comparison in this case?
Comment 3 Arsen Arsenović gentoo-dev 2023-02-19 21:05:51 UTC
(In reply to Michael Egger from comment #2)
> I would like to give this a shot as this sounds like a great first issue to
> tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution
> for file comparison in this case?

Those functions only do a fast-and-loose comparison via stat (), not as precise as Zac specified.  Sadly, here, there's not many cases to optimize via stat anyway, besides checking if, say, sizes differ and skipping the memcmp step in that case, as all files are new (i.e. have different (dev,ino) pairs, times, etc), and so, only size and type are meaningful.

I am unsure this would speed anything up, personally; I expect that it makes a meaningful difference only for very large files, and that it pessimises small files, which are the majority of files portage installs; but, naturally, only testing can prove either case.

One should also take into account the caches Linux implicitly does.  If packages are sufficiently small, writing all files unconditionally might fit into that cache, and reading might end up being the slower operation.

I'd like to see some testing, so if you are interested, please give it as shot.
Comment 4 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-02-19 21:06:42 UTC
(In reply to Michael Egger from comment #2)
> I would like to give this a shot as this sounds like a great first issue to
> tackle. Is https://docs.python.org/3/library/filecmp.html a valid solution
> for file comparison in this case?

It looks like that's implemented in Python rather than C within cpython, but maybe it's optimised enough.

I think it's a good place to start investigating this, but it'll need benchmarking (time how long emerges with already-installed files & fresh ones take before/after). It'll also require checking on a CoW filesystem whether it actually does what's intended or not (note that we already use copy_file_range and friends where possible on merging).

This is not me saying it's a bad idea, just that we want to test whether or not it regresses the "new version" case significantly and whether it actually helps the "same or mostly the same" file(s) case.

There's a few options to help mitigation of any slowdown too. We might end up optimising this with some suggested heuristics like:
- different slot -> don't bother comparing
- if the perf hit is bad in the many changes case, only do the check on new revisions rather than new versions, as it's more likely the files will be the same
Comment 5 James Le Cuirot gentoo-dev 2023-02-19 21:40:04 UTC
I think Zac filed this at my request. I had wanted it over concerns about SSD wear, although I've now more or less been convinced that that's not a real concern on any modern half-decent SSD. I thought it would also speed things up for spinning disks, but that was based on my proof of concept that compared the mtime of the existing file and the md5sum in CONTENTS. Zac felt that we couldn't trust the mtimes. I thought this seemed overly paranoid. My implementation is at https://github.com/chewi/portage/commit/f7e3e3753ab3a808865c2e351898b3054fd638ec.

I did look into optimal zero-copy comparisons, but I found the subject quite hard to understand, and I'm not sure it would work when your PORTAGE_TMPDIR is on tmpfs like mine is.
Comment 6 Austin S. Hemmelgarn 2023-02-20 13:12:26 UTC
Irrespective of any benefit for SSD wear-leveling (might still be beneficial for low-level flash storage though), this is beneficial for system snapshots and some backup systems. By not unconditionally doing full file replacements, snapshots and incremental backups end up much smaller, and that’s where I personally see a benefit here, even if it makes merging a bit slower.

Someone actually opened a thread on Reddit about that particular use-case for this type of thing just recently: https://www.reddit.com/r/Gentoo/comments/116jgsj/can_portage_not_clobber/

I could also envision it being beneficial for cases where copy_file_range is on a CoW filesystem (for example, split PORTAGE_TMPDIR setups). In those cases, copying a file is potentially significantly more resource-intensive than just reading the existing file for comparison due to the write amplification implicit in the filesystem itself.
Comment 7 Austin S. Hemmelgarn 2023-02-20 13:13:13 UTC
Correction, meant ‘when copy_file_range is not usable on a CoW filesystem' in that last paragraph.