Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 525718 - sys-apps/portage: emerge search actions should use an index to improve performance (like esearch)
Summary: sys-apps/portage: emerge search actions should use an index to improve perfor...
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core - Interface (emerge) (show other bugs)
Hardware: All All
: Normal enhancement (vote)
Assignee: Portage team
Keywords: InVCS
Depends on:
Blocks: 240187 484436
  Show dependency tree
Reported: 2014-10-18 02:58 UTC by Zac Medico
Modified: 2015-02-15 05:33 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---

emerge --search: use description index (emerge-search-use-description-index.patch,13.09 KB, patch)
2014-10-18 03:19 UTC, Zac Medico
Details | Diff
emerge --search: use description index (emerge-search-use-description-index.patch,15.52 KB, patch)
2014-10-18 05:40 UTC, Zac Medico
Details | Diff
emerge --search: use description index (emerge-search-use-description-index.patch,15.74 KB, patch)
2014-10-19 21:41 UTC, Zac Medico
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2014-10-18 02:58:04 UTC
As we all know, emerge --search/--searchdesc actions are embarrassingly slow (from most users' perspectives, anyway), especially in comparison to external tools like eix and esearch.

Wouldn't it be nice if the performance of emerge's search functionality was more competitive with other offerings? Then, external search tools might not be seen as an absolute necessity.

In order to solve this problem, I propose that we add support for a package description index file. For example, I have patched egencache so that it will generate a suitable index formatted as series of lines like this:

sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1: sandbox'd LD_PRELOAD hack

Using this format, the index file for the entire gentoo repo consumes approximately 1.5 MB. The whole file can be quickly searched as a stream (the whole file need not be in memory at once), yielding emerge --search/--searchdesc performance that is competitive with app-portage/esearch.

The index can either be generated on the server side by egencache, or on the client side by a post emerge --sync hook. It makes sense to support both modes of operation, so that server side generation is purely optional.

As an alternative to my proposal, others may propose to use a binary database to hold all of the metadata (including everything currently distributed as small text files in the metadata/md5-cache directory). However, I would prefer to stick with the package description index that I have proposed for the following reasons:

1) The package names and descriptions are, by far, the most commonly searched items. So, for general use, emerge --search/--searchdesc actions should be sufficient for most users. More advanced queries are better suited to something like eix-db or sqlite, but the majority of users have negligible interest in performing such advanced queries, so it's hard to justify distributing a relatively large binary database inside the package repository (it puts extra load on the rsync servers). So, I think it's better to generate such databases on the client side, using $repo/metadata/md5-cache as a source when available.

2) A plain text index, like the one I have proposed, is small enough (1.5 MB for current gentoo repo) so that the additional load it puts on the rsync servers should be manageable. Also, for repositories distributed via a vcs such as git, changes to the plain text index will transfer efficiently (only differences are transferred).
Comment 1 Zac Medico gentoo-dev 2014-10-18 03:19:11 UTC
Created attachment 386860 [details, diff]
emerge --search: use description index

This adds an egencache --update-pkg-desc-index action which generates a plain-text index of package names, versions, and descriptions. The index can then be used to optimize emerge --search / --searchdesc actions. If the package description index is missing from a particular repository, then all metadata for that repository is obtained using the normal pordbapi.aux_get method.

Searching of installed packages is optimized to take advantage of vardbdbapi._aux_cache, which is backed by vardb_metadata.pickle. See the IndexedVardb docstring some more details.
Comment 2 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2014-10-18 04:08:39 UTC
What happens if the index is outdated? If Portage doesn't fallback to regular search, then it's a major regression over the current code.
Comment 3 Zac Medico gentoo-dev 2014-10-18 04:24:56 UTC
(In reply to Michał Górny from comment #2)
> What happens if the index is outdated?

It assumes that the list of packages in the index is correct, so it only searchs those packages. If any of those ebuilds have been removed, it triggers the "emerge: search: aux_get() failed, skipping" message in

> If Portage doesn't fallback to regular search, then it's a major regression over
> the current code.

I suppose we could add a --search-index=<y|n> option. Would it be acceptable to you to have this option enabled by default? Then you could set EMERGE_DEFAULT_OPTS="--search-index=n" if you wanted to persistently disable it.
Comment 4 Zac Medico gentoo-dev 2014-10-18 05:40:06 UTC
Created attachment 386866 [details, diff]
emerge --search: use description index

This updated patch adds --search-index < y | n >:

For users that would like to modify ebuilds in a repository without
running egencache afterwards, the new emerge --search-index < y | n >
option can be used to get non-indexed search. Alternatively, the user
could simply remove the stale index file, in order to disable the
search index for a particular repository.

I'll be maintaining this patch in the following branch:
Comment 5 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2014-10-18 14:54:36 UTC
Additionally, 1.5M extra update on each rsync run would be a noticeable extra load.
Comment 6 Brian Dolbec gentoo-dev 2014-10-18 15:49:06 UTC
We can make a news item, if a user doesn't want it, they can add it to the rsync exclude list.  It's not that big a file!
Comment 7 Zac Medico gentoo-dev 2014-10-19 21:41:33 UTC
Created attachment 386988 [details, diff]
emerge --search: use description index

This updated patch changes the index format to use spaces instead of commas, for readability. This example given in man/portage.5:

sys-apps/sed 4.2 4.2.1 4.2.1-r1 4.2.2: Super-useful stream editor
sys-apps/usleep 0.1: A wrapper for usleep

Hopefully that's easier on the eyes (thanks to Michał Górny for the suggestion).

Also, Michał has brought it to my attention that git will send the whole file instead of the delta, unless an expensive `git repack` operation is performed. Maybe it's possible to repack the user.git each time the index is generated? Currently, the master rsync mirror runs egencache every 30 minutes. If user.git syncs at the same interval, it would need to be repacked at the same interval.

Anyway, it would be nice to merge this patch, even if we don't have the resources now to generate the index for gentoo on the server side. We could follow up this patch later with a post emerge --sync hook for client-side index generation.
Comment 8 Zac Medico gentoo-dev 2014-11-05 22:18:15 UTC
In order to optimize IndexedVardb so that it doesn't have to do expensive listdir calls in /var/db/pkg/*, I plan to have vardbapi maintain a small log file that is updated with each merge and unmerge. The log will keep track of all merges/unmerges that have occurred since the most recent update of vdb_metadata.pickle (vdb_metadata.pickle is not updated for every single merge/unmerge, since that would lead to excessive re-writing of a large file). Then, IndexedVardb can use this log file together with vdb_metadata.pickle to get a complete view of /var/db/pkg, without the need to call listdir inside /var/db/pkg/*.
Comment 10 Brian Dolbec gentoo-dev 2015-02-15 05:33:16 UTC
Released in portage-2.2.16.