As we all know, emerge --search/--searchdesc actions are embarrassingly slow (from most users' perspectives, anyway), especially in comparison to external tools like eix and esearch. Wouldn't it be nice if the performance of emerge's search functionality was more competitive with other offerings? Then, external search tools might not be seen as an absolute necessity. In order to solve this problem, I propose that we add support for a package description index file. For example, I have patched egencache so that it will generate a suitable index formatted as series of lines like this: sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1: sandbox'd LD_PRELOAD hack Using this format, the index file for the entire gentoo repo consumes approximately 1.5 MB. The whole file can be quickly searched as a stream (the whole file need not be in memory at once), yielding emerge --search/--searchdesc performance that is competitive with app-portage/esearch. The index can either be generated on the server side by egencache, or on the client side by a post emerge --sync hook. It makes sense to support both modes of operation, so that server side generation is purely optional. As an alternative to my proposal, others may propose to use a binary database to hold all of the metadata (including everything currently distributed as small text files in the metadata/md5-cache directory). However, I would prefer to stick with the package description index that I have proposed for the following reasons: 1) The package names and descriptions are, by far, the most commonly searched items. So, for general use, emerge --search/--searchdesc actions should be sufficient for most users. More advanced queries are better suited to something like eix-db or sqlite, but the majority of users have negligible interest in performing such advanced queries, so it's hard to justify distributing a relatively large binary database inside the package repository (it puts extra load on the rsync servers). So, I think it's better to generate such databases on the client side, using $repo/metadata/md5-cache as a source when available. 2) A plain text index, like the one I have proposed, is small enough (1.5 MB for current gentoo repo) so that the additional load it puts on the rsync servers should be manageable. Also, for repositories distributed via a vcs such as git, changes to the plain text index will transfer efficiently (only differences are transferred).
Created attachment 386860 [details, diff] emerge --search: use description index This adds an egencache --update-pkg-desc-index action which generates a plain-text index of package names, versions, and descriptions. The index can then be used to optimize emerge --search / --searchdesc actions. If the package description index is missing from a particular repository, then all metadata for that repository is obtained using the normal pordbapi.aux_get method. Searching of installed packages is optimized to take advantage of vardbdbapi._aux_cache, which is backed by vardb_metadata.pickle. See the IndexedVardb docstring some more details.
What happens if the index is outdated? If Portage doesn't fallback to regular search, then it's a major regression over the current code.
(In reply to Michał Górny from comment #2) > What happens if the index is outdated? It assumes that the list of packages in the index is correct, so it only searchs those packages. If any of those ebuilds have been removed, it triggers the "emerge: search: aux_get() failed, skipping" message in search.py. > If Portage doesn't fallback to regular search, then it's a major regression over > the current code. I suppose we could add a --search-index=<y|n> option. Would it be acceptable to you to have this option enabled by default? Then you could set EMERGE_DEFAULT_OPTS="--search-index=n" if you wanted to persistently disable it.
Created attachment 386866 [details, diff] emerge --search: use description index This updated patch adds --search-index < y | n >: For users that would like to modify ebuilds in a repository without running egencache afterwards, the new emerge --search-index < y | n > option can be used to get non-indexed search. Alternatively, the user could simply remove the stale index file, in order to disable the search index for a particular repository. I'll be maintaining this patch in the following branch: https://github.com/zmedico/portage/tree/bug_525718
Additionally, 1.5M extra update on each rsync run would be a noticeable extra load.
We can make a news item, if a user doesn't want it, they can add it to the rsync exclude list. It's not that big a file!
Created attachment 386988 [details, diff] emerge --search: use description index This updated patch changes the index format to use spaces instead of commas, for readability. This example given in man/portage.5: sys-apps/sed 4.2 4.2.1 4.2.1-r1 4.2.2: Super-useful stream editor sys-apps/usleep 0.1: A wrapper for usleep Hopefully that's easier on the eyes (thanks to Michał Górny for the suggestion). Also, Michał has brought it to my attention that git will send the whole file instead of the delta, unless an expensive `git repack` operation is performed. Maybe it's possible to repack the user.git each time the index is generated? Currently, the master rsync mirror runs egencache every 30 minutes. If user.git syncs at the same interval, it would need to be repacked at the same interval. Anyway, it would be nice to merge this patch, even if we don't have the resources now to generate the index for gentoo on the server side. We could follow up this patch later with a post emerge --sync hook for client-side index generation.
In order to optimize IndexedVardb so that it doesn't have to do expensive listdir calls in /var/db/pkg/*, I plan to have vardbapi maintain a small log file that is updated with each merge and unmerge. The log will keep track of all merges/unmerges that have occurred since the most recent update of vdb_metadata.pickle (vdb_metadata.pickle is not updated for every single merge/unmerge, since that would lead to excessive re-writing of a large file). Then, IndexedVardb can use this log file together with vdb_metadata.pickle to get a complete view of /var/db/pkg, without the need to call listdir inside /var/db/pkg/*.
This is in the master branch now: https://github.com/gentoo/portage/commit/d22ba91d4adc177551be8b4be95a6fc1f061fc2e https://github.com/gentoo/portage/commit/7bc992a02c89c6c3f76b09bfc978c104fb1c2b9a https://github.com/gentoo/portage/commit/96c2e57685659211c1e33281e08dfaff04d05b58 https://github.com/gentoo/portage/commit/5424b91133b3b155b0e6ddc08fb46ba301d971f8 https://github.com/gentoo/portage/commit/55c8c8bc7a781e3f71ce92922eea64ad4cafce3c https://github.com/gentoo/portage/commit/d800d224ab38c0f524d3fe858ebe201cbfa903c1 https://github.com/gentoo/portage/commit/646b671d4afb92e0bb81664568544e01e8456dc2
Released in portage-2.2.16.