I suggest to use a database to store /var/db/pkg info, mysql or even sqlite instead of having regular files to do so, I think it could improve a bit portage speed, because having tons of small files in a FS slows down a bit performance. Reproducible: Always
One of the advantages using plain files instead of a "real database" is that your package manager won't be crippled if your database library is broken somehow (or not installed yet). It's possible to get the benefits of both worlds by using a database to index /var/db/pkg. In fact, portage already does something like that by caching /var/db/pkg content inside /var/cache/edb/vdb_metadata.pickle. Since the mtime changes from bug #290428, we can optimize index/cache validation more by using top level and category level directory timestamps to detect changes inside.
Ok, zac, you're the expert after all, I don't know anything about portage's internet, I am only a user who red that a FS with tons of small files could be a waste of performance. It was only an idea to add it to sqlite backed (and possibly, to enable global MYSQL database). For instance, to allow to store all dependency/packages installed and so on, on a single machine all over a whole network... But of course you can discard this idea if it seems too difficult or not a good idea after all. What is really the overall, or even a mixure of files in /var/db/pkg and a sql database? Question 2: can both be used? for example /var/db/pkg for some king of bootstrap and the other for production machines? Thanks again.
(In reply to comment #2) > What is really the overall, or even a mixure of files in /var/db/pkg and a sql > database? I don't understand. Can you rephrase? > Question 2: can both be used? for example /var/db/pkg for some king of > bootstrap and the other for production machines? I think the best approach is to use both, with the existing /var/db/pkg format being the master, and another type of database being used to optimize read-only queries (while being fully redundant/disposable). As said, portage already uses an approach like this, but it could be optimized some more.
(In reply to Zac Medico from comment #3) > (In reply to comment #2) > > What is really the overall, or even a mixure of files in /var/db/pkg and a sql > > database? > > I don't understand. Can you rephrase? > > > Question 2: can both be used? for example /var/db/pkg for some king of > > bootstrap and the other for production machines? > > I think the best approach is to use both, with the existing /var/db/pkg > format being the master, and another type of database being used to optimize > read-only queries (while being fully redundant/disposable). As said, portage > already uses an approach like this, but it could be optimized some more. Yes, Zac, I was refering to use both, but keep /var/db/pkg on only as a consistent backup if database works. That will imply checking database first, if it does not work, then use files info, if it does, then use database info (that would apply also for /usr/portage/* This way, current behavior is mainteined (as anyways as you say, we need /usr/portage/* and /var/db/pkg/* in case something bad happens with database), but with proposed solution, it can be completelly ignored in dependency check (for example) which will gain a *lot* of performance in my opinion
And also, since all files are updated anyways, portage's database can be safelly disabled or removed without rendering portage useless.
(In reply to David Carlos Manuelda from comment #4) > That will imply checking database first, if it does not work, then use files > info, if it does, then use database info (that would apply also for > /usr/portage/* > /usr/portage is a bit different, because it's more likely that people edit ebuilds in it. > This way, current behavior is mainteined (as anyways as you say, we need > /usr/portage/* and /var/db/pkg/* in case something bad happens with > database), but with proposed solution, it can be completelly ignored in > dependency check (for example) which will gain a *lot* of performance in my > opinion As Zac explained in comment 3, there already is a mechanism like this for /var/db/pkg (just not a sql database). All that needs to happen is that the time stamps are used in more places to check if the cache is up to date.
(In reply to Sebastian Luther (few) from comment #6) > (In reply to David Carlos Manuelda from comment #4) > > That will imply checking database first, if it does not work, then use files > > info, if it does, then use database info (that would apply also for > > /usr/portage/* > > > /usr/portage is a bit different, because it's more likely that people edit > ebuilds in it. One could consider adding an sqlite version of the metadata/cache to the rsync tree. It's up for debate if it is any beneficial over the cache format, though (less fopens/mmaps likely).
(In reply to Fabian Groffen from comment #7) > (In reply to Sebastian Luther (few) from comment #6) > > (In reply to David Carlos Manuelda from comment #4) > > > That will imply checking database first, if it does not work, then use files > > > info, if it does, then use database info (that would apply also for > > > /usr/portage/* > > > > > /usr/portage is a bit different, because it's more likely that people edit > > ebuilds in it. > > One could consider adding an sqlite version of the metadata/cache to the > rsync tree. It's up for debate if it is any beneficial over the cache > format, though (less fopens/mmaps likely). How does this help with "people are more likely to edit it"? There would need to be a way for the user to say "I promise to never edit an ebuild in this repo". Portage has had a database based metadata cache in the past which you would update after each sync. IIRC it's now defunct. Maybe someone knows how much of an improvement it was.
(In reply to Sebastian Luther (few) from comment #8) > (In reply to Fabian Groffen from comment #7) > > (In reply to Sebastian Luther (few) from comment #6) > > > /usr/portage is a bit different, because it's more likely that people edit > > > ebuilds in it. > > > > One could consider adding an sqlite version of the metadata/cache to the > > rsync tree. It's up for debate if it is any beneficial over the cache > > format, though (less fopens/mmaps likely). > > How does this help with "people are more likely to edit it"? There would > need to be a way for the user to say "I promise to never edit an ebuild in > this repo". The rsync tree is not to be edited, therefore it's shipped with cache. Providing that cache in a different format would not change anything to the rules for invalidating it. > Portage has had a database based metadata cache in the past which you would > update after each sync. IIRC it's now defunct. Maybe someone knows how much > of an improvement it was. Just to avoid confusions here: I agree that the on-disk format is the best we have for as long as we want to keep it all just plain text (we do). I just feel this bug is about the performance of Portage due to IO it needs to do because of the many files that build up the cache, see comment #1. There is something to say for reading a single file vs potentially thousands of them. But then, it doesn't even need to be database, storing all as a giant YAML file (or even pickle -- though very Python specific and security problematic) would probably do for systems that can do the tradeoff for memory.
(In reply to Fabian Groffen from comment #9) > The rsync tree is not to be edited, therefore it's shipped with cache. That's not how it is. Ebuilds are check for modification and the metadata is regenerated if needed. From the metadata cache perspective, it's currently perfectly safe to edit the rsync tree. > > > Portage has had a database based metadata cache in the past which you would > > update after each sync. IIRC it's now defunct. Maybe someone knows how much > > of an improvement it was. > > Just to avoid confusions here: I agree that the on-disk format is the best > we have for as long as we want to keep it all just plain text (we do). I > just feel this bug is about the performance of Portage due to IO it needs to > do because of the many files that build up the cache, see comment #1. There > is something to say for reading a single file vs potentially thousands of > them. But then, it doesn't even need to be database, storing all as a giant > YAML file (or even pickle -- though very Python specific and security > problematic) would probably do for systems that can do the tradeoff for > memory. Now I'm confused. Are we talking about the rsync tree or the vdb? My comment was about the rsync tree. For the vdb we already have that "one single file". It may be that we're still checking too many files for modification in the vdb. Since there's the rule to update directory mtimes on package installation/removal in the vdb, it should only be required to check the mtime of /var/db/pkg and then rely completely on the cache. I don't know what the current status of this in portage.
(In reply to Sebastian Luther (few) from comment #10) > (In reply to Fabian Groffen from comment #9) > > The rsync tree is not to be edited, therefore it's shipped with cache. > > That's not how it is. Ebuilds are check for modification and the metadata is > regenerated if needed. From the metadata cache perspective, it's currently > perfectly safe to edit the rsync tree. If the rsync tree was to be edited, we'd not waste time/cycles on generating cache for it. Yes, you CAN edit the ebuilds, and Portage WILL do the right thing. Most of the time though, the tree will be virgin, without any modifications. Therefore, it is beneficial to invest in speedups that rely on the tree not being modified. > > Just to avoid confusions here: > I agree that the on-disk format is the best > > we have for as long as we want to keep it all just plain text (we do). I > > just feel this bug is about the performance of Portage due to IO it needs to > > do because of the many files that build up the cache, see comment #1. There > > is something to say for reading a single file vs potentially thousands of > > them. But then, it doesn't even need to be database, storing all as a giant > > YAML file (or even pickle -- though very Python specific and security > > problematic) would probably do for systems that can do the tradeoff for > > memory. > > Now I'm confused. Are we talking about the rsync tree or the vdb? My comment > was about the rsync tree. For the vdb we already have that "one single > file". It may be that we're still checking too many files for modification > in the vdb. Since there's the rule to update directory mtimes on package > installation/removal in the vdb, it should only be required to check the > mtime of /var/db/pkg and then rely completely on the cache. I don't know > what the current status of this in portage. I'm talking about the rsync tree. metadata/md5-cache contains many directories and files. I think the OP referred to /var/db/pkg, but his IO problems are referring to the rsync tree (and it's cache). Therefore I suggested to use a simple db, or single file (metadata/cache.sqlite3 or metadata/cache.yaml). Perhaps hopelessly off-topic for this bug. If you say /var/db/pkg contains a single file nowadays, there's not much to improve there any more. I don't have that in any of my installs, so I can't check that.
I feel like our users would not be happy with running a daemon for /var/db/pkg. Matt did recently on the ML however suggest changing to another format: https://archives.gentoo.org/gentoo-portage-dev/message/891b99bc55239b475fd8d71659dc60ec