Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 194348 - emerge should merge files only if differs
Summary: emerge should merge files only if differs
Status: RESOLVED WONTFIX
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All Linux
: High enhancement
Assignee: Portage team
URL:
Whiteboard:
Keywords:
: 24258 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-10-01 07:42 UTC by Marcello Magaldi
Modified: 2007-10-02 10:42 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcello Magaldi 2007-10-01 07:42:48 UTC
When emerge update a package it merges also the files that don't differ,
if it is possible it should be a good idea to compare the files with diff before and overwrite only if the files differ. 
A good example is pidgin : the icons don't change between 2.2.0 and 2.2.1, so the old 2.2.0 should be used since they are the same.

Reproducible: Always
Comment 1 Zac Medico gentoo-dev 2007-10-01 07:51:49 UTC
Now that we preserve time stamps during merge, we could conceivably skip merge if the mtime and size match. If the filesystem has built in checksums, it would be even better, but we'd need a somewhat portable api to access those.

Is there some specific reason why you're trying to optimize this, like limited write cycles on flash media? Absent built in checksums in the filesystem, it just doesn't seem worth the trouble.
Comment 2 Marcello Magaldi 2007-10-01 08:07:48 UTC
(In reply to comment #1)
> Now that we preserve time stamps during merge, we could conceivably skip merge
> if the mtime and size match. 
> If the filesystem has built in checksums, it would
> be even better, but we'd need a somewhat portable api to access those.

what I want to say is that pidgin-2.2.0 icons and pidgin-2.2.1 should not differ
I mean the time metadata of that files should differ, but the data not. The mine suggestion is to compare the data, i.e.g. with diff or similar programs and then
only if file differs overwrite. The timestamp mechanism used in portage probably isn't enough to prevent that a file is overwritten by another identical file, maybe this mechanism is good for preventing overwrite if you re-emerge the same version of a package. 
 
> Is there some specific reason why you're trying to optimize this, like limited
> write cycles on flash media? Absent built in checksums in the filesystem, it
> just doesn't seem worth the trouble.
> 

a good reason for making this improvement is for large updates such as openoffice or eclipse.Perhaps there are some file that doesn't change from a version to another.  IMO write less data to disk can reduce the merge time.
Anyway I don't know if diff execution can slow too much the entire process to make less time expensive a overwrite. If it is the case forget my suggestion or choose a different and faster program than diff.
Comment 3 SpanKY gentoo-dev 2007-10-01 08:21:48 UTC
not sure if this is doable ... while the old timestamps/hashes/filesizes are cached in the vdb, that does not mean the file on disk is exactly the same ... so you'd have to read in the file completely and re-compute the hashes (since it is entirely possible to have a file with modified data but the same timestamps/filesizes)

at this point, you've pretty much blown any possible performance gain and in fact, made the situation worse on systems with slow cpu's (where calculating hashes is quite cpu bound, not i/o bound)
Comment 4 SpanKY gentoo-dev 2007-10-01 08:22:12 UTC
*** Bug 24258 has been marked as a duplicate of this bug. ***
Comment 5 Marcello Magaldi 2007-10-01 08:27:59 UTC
(In reply to comment #3)
> not sure if this is doable ... while the old timestamps/hashes/filesizes are
> cached in the vdb, that does not mean the file on disk is exactly the same ...
> so you'd have to read in the file completely and re-compute the hashes (since
> it is entirely possible to have a file with modified data but the same
> timestamps/filesizes)
> 
> at this point, you've pretty much blown any possible performance gain and in
> fact, made the situation worse on systems with slow cpu's (where calculating
> hashes is quite cpu bound, not i/o bound)
> 


you're right, perhaps isn't doable or it introduces more disadvantages than advantages. I don't know too much how portage works, for me you could close this bug or keep it open only to discuss alternate solutions to this "problem".  
Comment 6 SpanKY gentoo-dev 2007-10-01 08:35:14 UTC
it really isnt a portage issue, just look at it on the file scale of things

in order to skip replacing of a file, you have to know that two files are exact.  the only way to know if they are exact is if you compare them.  but this requires doing I/O on the disk that you're trying to avoid in the first place.  caching and comparing file metadata is inadequate as file metadata can be exactly the same even while the file data differs.
Comment 7 Marcello Magaldi 2007-10-01 08:38:57 UTC
(In reply to comment #6)
> it really isnt a portage issue, just look at it on the file scale of things
> 

yes it's true

> in order to skip replacing of a file, you have to know that two files are
> exact.  the only way to know if they are exact is if you compare them.  but
> this requires doing I/O on the disk that you're trying to avoid in the first
> place.  caching and comparing file metadata is inadequate as file metadata can
> be exactly the same even while the file data differs.
> 

I proposed comparing whole files using diff since IMO it's less time expensive read a file instead of writing it. Am I wrong?

Comment 8 SpanKY gentoo-dev 2007-10-01 09:10:46 UTC
that depends ... in many cases where the build directory (/var/tmp/portage/) lies on the same filesystem as the destination (say /usr/share/), there will be no performance hit really at all from overwriting the existing file in /usr/share/ with the one in /var/tmp/portage/ since you're talking about simply doing one unlink() followed by one rename() ... so reading in the file from /usr/share/ will actually be a lot worse

so the question comes down to, is the performance difference from reading a file and computing its hash less than that of simply writing out the file, and if so, is it such a significant gain to warrant the additional complexity (and there will be quite a bit) added to the package manager
Comment 9 Marius Mauch (RETIRED) gentoo-dev 2007-10-02 10:42:12 UTC
As already said, the performance gain would be questionable at best. And when the files are different the additional read would definitely decrease performance, and I'm pretty sure that this additional cost will outwheigh any potential benefits by not overwriting some files on 99% of all systems.