Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 283610 - Proposal and offer of assistance in vastly improving the search functionality of bugs.gentoo.org (bugzilla)
Summary: Proposal and offer of assistance in vastly improving the search functionality...
Status: RESOLVED OBSOLETE
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Bugzilla (show other bugs)
Hardware: All Linux
: High minor (vote)
Assignee: Bugzilla Admins
URL: http://www.sphinxsearch.com/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-03 20:01 UTC by Matthew Gregory Sr.
Modified: 2013-09-09 15:22 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Gregory Sr. 2009-09-03 20:01:34 UTC
I have been working for the past 8 months alongside our DBA in creating and improving FTI search and our database.  Since I have been using gentoo for 8 years and bugs.gentoo.org for more than 7 of those 8 years I have a lot of experience and tons of hours of frustration behind the old bugzilla search function used by bugs.gentoo.org.

Now that I know that bugs.gentoo.org uses mysql as its back-end, I know that I can vastly improve both the quality of the bug search engine, and the overall speed of the search with minimal to no impact at all on the normal operation of bugs.gentoo.org.

I propose using the sphinx sql FTI search engine to index the bugs in the database now, as well as all of the comments that have been entered on the bugs, and then extend the footer section of the standard bugzilla pages to add a new search box for FTI searching on bugs.  During the trial phase of the FTI search this would allow us to ensure that the new search was at least as effective and found at least the same results as the old search and to provide measurable improvement over the old search.

The nature of the sphinx FTI search is such that no changes should be required to the existing bugzilla data structure except for new tables and the new search functionality is handled almost entirely within mysql stored procedures.  The only bugzilla changes required should be adding the new FTI search input field and the search handler.  I have a great deal of experience implementing sphinx FTI with mysql and considerable experience with Perl as well, although I sincerely doubt much perl will be needed for this as the standard perl DBI code should be able to pass off everything to the stored procedure and the regular search result handling script should be able to handle the sphinx search results just as well as it currently handles the results with little or no changes.  :-D

Reproducible: Always

Steps to Reproduce:
Use the standard bugzilla search... blech
Actual Results:  
vary

Expected Results:  
fast and reliable full text searches through both the bugs and the comments on those bugs.

Let's git 'er done y'all
Comment 1 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2009-09-03 20:56:32 UTC
We're in the process of migrating to Bugzilla 3.

I've got no objections to Sphinx for this use-case, and I'd encourage you to consider writing said patch, but against the latest Bugzilla, to be applied AFTER the migration.

A couple of suggestions for it:
- privacy protection for locked bugs is important. The search engine must not leak bugs.
- I'd like the Sphinx to NOT be the MySQL engine variant, but stand-alone.
- Turnaround time for updates is very important.
- Ability for admins to force reindexing of a given bug id #, with the old content dropped.

Our database setup for Bugzilla is not small. It's got two dedicated DB servers, each with master and slave, with master-master and master-slave replication.

Lastly, in my own experience with FTI, Sphinx had issues with documents that had content deleted/updated. I've spoken in person to the author over them, and he's been promising to look at them for years now, but still nothing. Lucene is my personal preference (3 years experience with it), but since we run zero-Java in Gentoo infrastructure, Sphinx or PLucene would be suitable alternatives.
Comment 2 Matthew Gregory Sr. 2009-09-04 03:47:25 UTC
>> A couple of suggestions for it:
>> - privacy protection for locked bugs is important. The search engine must not leak bugs.

This should offer no complications.  Whatever flag sets a bug as locked can be introduced into the indexer reqs. to simply ignore them in the search index.  If you want them search-able but only by people with access to the locked bugs, it can be accomplished any of several ways (most probably including a separate index appending those records in addition to those in the general search and an SP that looks at both instead of only the one).

>> - I'd like the Sphinx to NOT be the MySQL engine variant, but stand-alone.

I'm curious about this, not because I disagree (in truth I neither agree nor disagree) but merely that I am curious what has influinced you against using the sphinx mysql engine.  My tests show that it is easier to produce stored procedures with the sphinx engine and there are some operations that are impossible without it, while speed is virtually unaffected either way.

Again, I want to stress that I am not disagreeing with you, merely curious about it.

- Turnaround time for updates is very important.

Agreed and noted.

- Ability for admins to force reindexing of a given bug id #, with the old
content dropped.

AFAIK there is no way to tell sphinx to merely update the index by re-indexing only certain content or records unless you design in the ability with the structure of the sphinx indexes themselves.  When indexing descriptions of 20+Mil businesses we subdivided them by region (states, territories, small countries) so that we could re-index only a given region when an update was required outside of the regularly scheduled index updates (twice a day in our case, probably more often for bugs.gentoo.org).  Something like this could probably be worked in based on a date/time range for bugs.gentoo.org.  Either way, running the indexer could certainly be keyed off by the admins at any time.  In my experience, if you plan on re-indexing a great deal then the indexing should be done by a machine other than the database server due to memory use.  A properly configured production mysql database uses a lot of memory and the sphinx indexer also uses a lot of memory.  It is configurable, but I have noticed that limiting sphinx's memory during an index rebuild slows it down horribly.

I intend to install a local copy of everything and experiment a bit.  I have 2TB of disk space available and 8G of memory available to experiment with on this, but since I am unfamiliar with the size of the data for bugs.gentoo.org you will have to let me know if this isn't nearly enough to run my tests.  Also, I need to know if a mysql dump of unlocked bugs could be provided to me for the tests (I need a relatively accurate sampling of the data and only want to fall back on generating it myself as a plan B).
Comment 3 Alex Legler (RETIRED) archtester gentoo-dev Security 2013-09-09 15:22:00 UTC
I'm closing this bug due to inactivity. Should the problem you are trying to fix still persist and you still wish to help, please reopen this bug.