Having to work with multiple package rebuilds recently, I've noticed that Portage gets stuck with very high I/O around package install/removal part. At first I thought there's some issue with vdb efficiency, but after stracing it a bit I've noticed that Portage stat()s a lot of installed files -- binaries, AFAICS. This is very bad for two reasons: 1. It makes Portage every slow, especially when system load is high. I was able to get 1-2 minute package install delays because of this, and this makes multiple package rebuilds (i.e. testing ebuilds) near to impossible. 2. It can cause aggressive caching on some filesystems, practically throwing all other (i.e. more useful) data out of the cache. In other words, making other programs slow after each package install. I suppose the code is somehow related to preserved-libs handling. However, what is the exact rationale for doing it? Do we really have to actually check all installed files every time?
The problem is that LinkageMapELF uses inode numbers to internally identify/compare files, and the preserve-libs code rebuilds the LinkageMapELF state every time it enters a critical section.
If there's a way to guarantee that a concurrent process hasn't invalidated all of the LinkageMapELF state, then the portion of state that hasn't been invalidated can be recycled.
I don't really understand why you need inode numbers there. What exactly do you need other than paths and their SO_NEEDED entries?
Maybe we really don't need inode numbers. As I recall, the code initially tracked files using canonicalized paths (with os.path.realpath). Of course, canonical paths have a similar problem to inode numbers, in that they represent a state that could be changed by a concurrent process.
I've tested removing the whole 'canonicalization' thingie and it seems to be required for Portage to identify SOVERSION symlinks -- without it the actual library is kept but the needed SOVERSION symlink is removed. So yeah, maybe realpath() + do it only as needed rather than on all files.