525718 – sys-apps/portage: emerge search actions should use an index to improve performance (like esearch)

Bug 525718 - sys-apps/portage: emerge search actions should use an index to improve performance (like esearch)

Summary: sys-apps/portage: emerge search actions should use an index to improve perfor...

Status:	RESOLVED FIXED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core - Interface (emerge) (show other bugs)
Hardware:	All All

Importance:	Normal enhancement
Assignee:	Portage team

URL:	http://thread.gmane.org/gmane.linux.g...
Whiteboard:
Keywords:	InVCS

Depends on:
Blocks:	240187 484436
	Show dependency tree

Reported:	2014-10-18 02:58 UTC by Zac Medico
Modified:	2015-02-15 05:33 UTC (History)
CC List:	1 user (show)

See Also:	412471
Package list:
Runtime testing required:	---

Attachments
emerge --search: use description index (emerge-search-use-description-index.patch,13.09 KB, patch) 2014-10-18 03:19 UTC, Zac Medico	Details \| Diff
emerge --search: use description index (emerge-search-use-description-index.patch,15.52 KB, patch) 2014-10-18 05:40 UTC, Zac Medico	Details \| Diff
emerge --search: use description index (emerge-search-use-description-index.patch,15.74 KB, patch) 2014-10-19 21:41 UTC, Zac Medico	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zac Medico gentoo-dev

2014-10-18 02:58:04 UTC

As we all know, emerge --search/--searchdesc actions are embarrassingly slow (from most users' perspectives, anyway), especially in comparison to external tools like eix and esearch.

Wouldn't it be nice if the performance of emerge's search functionality was more competitive with other offerings? Then, external search tools might not be seen as an absolute necessity.

In order to solve this problem, I propose that we add support for a package description index file. For example, I have patched egencache so that it will generate a suitable index formatted as series of lines like this:

sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1: sandbox'd LD_PRELOAD hack

Using this format, the index file for the entire gentoo repo consumes approximately 1.5 MB. The whole file can be quickly searched as a stream (the whole file need not be in memory at once), yielding emerge --search/--searchdesc performance that is competitive with app-portage/esearch.

The index can either be generated on the server side by egencache, or on the client side by a post emerge --sync hook. It makes sense to support both modes of operation, so that server side generation is purely optional.

As an alternative to my proposal, others may propose to use a binary database to hold all of the metadata (including everything currently distributed as small text files in the metadata/md5-cache directory). However, I would prefer to stick with the package description index that I have proposed for the following reasons:

1) The package names and descriptions are, by far, the most commonly searched items. So, for general use, emerge --search/--searchdesc actions should be sufficient for most users. More advanced queries are better suited to something like eix-db or sqlite, but the majority of users have negligible interest in performing such advanced queries, so it's hard to justify distributing a relatively large binary database inside the package repository (it puts extra load on the rsync servers). So, I think it's better to generate such databases on the client side, using $repo/metadata/md5-cache as a source when available.

2) A plain text index, like the one I have proposed, is small enough (1.5 MB for current gentoo repo) so that the additional load it puts on the rsync servers should be manageable. Also, for repositories distributed via a vcs such as git, changes to the plain text index will transfer efficiently (only differences are transferred).

Comment 1 Zac Medico gentoo-dev

2014-10-18 03:19:11 UTC

Created attachment 386860 [details, diff]
emerge --search: use description index

This adds an egencache --update-pkg-desc-index action which generates a plain-text index of package names, versions, and descriptions. The index can then be used to optimize emerge --search / --searchdesc actions. If the package description index is missing from a particular repository, then all metadata for that repository is obtained using the normal pordbapi.aux_get method.

Searching of installed packages is optimized to take advantage of vardbdbapi._aux_cache, which is backed by vardb_metadata.pickle. See the IndexedVardb docstring some more details.

Comment 2 Michał Górny archtester

2014-10-18 04:08:39 UTC

What happens if the index is outdated? If Portage doesn't fallback to regular search, then it's a major regression over the current code.

Comment 3 Zac Medico gentoo-dev

2014-10-18 04:24:56 UTC

(In reply to Michał Górny from comment #2)
> What happens if the index is outdated?

It assumes that the list of packages in the index is correct, so it only searchs those packages. If any of those ebuilds have been removed, it triggers the "emerge: search: aux_get() failed, skipping" message in search.py.

> If Portage doesn't fallback to regular search, then it's a major regression over
> the current code.

I suppose we could add a --search-index=<y|n> option. Would it be acceptable to you to have this option enabled by default? Then you could set EMERGE_DEFAULT_OPTS="--search-index=n" if you wanted to persistently disable it.

Comment 4 Zac Medico gentoo-dev

2014-10-18 05:40:06 UTC

Created attachment 386866 [details, diff]
emerge --search: use description index

This updated patch adds --search-index < y | n >:

For users that would like to modify ebuilds in a repository without
running egencache afterwards, the new emerge --search-index < y | n >
option can be used to get non-indexed search. Alternatively, the user
could simply remove the stale index file, in order to disable the
search index for a particular repository.

I'll be maintaining this patch in the following branch:

	https://github.com/zmedico/portage/tree/bug_525718

Comment 5 Michał Górny archtester

2014-10-18 14:54:36 UTC

Additionally, 1.5M extra update on each rsync run would be a noticeable extra load.

Comment 6 Brian Dolbec (RETIRED) gentoo-dev

2014-10-18 15:49:06 UTC

We can make a news item, if a user doesn't want it, they can add it to the rsync exclude list.  It's not that big a file!

Comment 7 Zac Medico gentoo-dev

2014-10-19 21:41:33 UTC

Created attachment 386988 [details, diff]
emerge --search: use description index

This updated patch changes the index format to use spaces instead of commas, for readability. This example given in man/portage.5:

sys-apps/sed 4.2 4.2.1 4.2.1-r1 4.2.2: Super-useful stream editor
sys-apps/usleep 0.1: A wrapper for usleep

Hopefully that's easier on the eyes (thanks to Michał Górny for the suggestion).

Also, Michał has brought it to my attention that git will send the whole file instead of the delta, unless an expensive `git repack` operation is performed. Maybe it's possible to repack the user.git each time the index is generated? Currently, the master rsync mirror runs egencache every 30 minutes. If user.git syncs at the same interval, it would need to be repacked at the same interval.

Anyway, it would be nice to merge this patch, even if we don't have the resources now to generate the index for gentoo on the server side. We could follow up this patch later with a post emerge --sync hook for client-side index generation.

Comment 8 Zac Medico gentoo-dev

2014-11-05 22:18:15 UTC

In order to optimize IndexedVardb so that it doesn't have to do expensive listdir calls in /var/db/pkg/*, I plan to have vardbapi maintain a small log file that is updated with each merge and unmerge. The log will keep track of all merges/unmerges that have occurred since the most recent update of vdb_metadata.pickle (vdb_metadata.pickle is not updated for every single merge/unmerge, since that would lead to excessive re-writing of a large file). Then, IndexedVardb can use this log file together with vdb_metadata.pickle to get a complete view of /var/db/pkg, without the need to call listdir inside /var/db/pkg/*.

Comment 9 Zac Medico gentoo-dev

2014-12-07 23:20:04 UTC

This is in the master branch now:

https://github.com/gentoo/portage/commit/d22ba91d4adc177551be8b4be95a6fc1f061fc2e
https://github.com/gentoo/portage/commit/7bc992a02c89c6c3f76b09bfc978c104fb1c2b9a
https://github.com/gentoo/portage/commit/96c2e57685659211c1e33281e08dfaff04d05b58
https://github.com/gentoo/portage/commit/5424b91133b3b155b0e6ddc08fb46ba301d971f8
https://github.com/gentoo/portage/commit/55c8c8bc7a781e3f71ce92922eea64ad4cafce3c
https://github.com/gentoo/portage/commit/d800d224ab38c0f524d3fe858ebe201cbfa903c1
https://github.com/gentoo/portage/commit/646b671d4afb92e0bb81664568544e01e8456dc2

Comment 10 Brian Dolbec (RETIRED) gentoo-dev

2015-02-15 05:33:16 UTC

Released in portage-2.2.16.