Summary: | Proposal and offer of assistance in vastly improving the search functionality of bugs.gentoo.org (bugzilla) | ||
---|---|---|---|
Product: | Gentoo Infrastructure | Reporter: | Matthew Gregory Sr. <skyleach> |
Component: | Bugzilla | Assignee: | Bugzilla Admins <bugzilla> |
Status: | RESOLVED OBSOLETE | ||
Severity: | minor | CC: | djc, idl0r |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://www.sphinxsearch.com/ | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Matthew Gregory Sr.
2009-09-03 20:01:34 UTC
We're in the process of migrating to Bugzilla 3. I've got no objections to Sphinx for this use-case, and I'd encourage you to consider writing said patch, but against the latest Bugzilla, to be applied AFTER the migration. A couple of suggestions for it: - privacy protection for locked bugs is important. The search engine must not leak bugs. - I'd like the Sphinx to NOT be the MySQL engine variant, but stand-alone. - Turnaround time for updates is very important. - Ability for admins to force reindexing of a given bug id #, with the old content dropped. Our database setup for Bugzilla is not small. It's got two dedicated DB servers, each with master and slave, with master-master and master-slave replication. Lastly, in my own experience with FTI, Sphinx had issues with documents that had content deleted/updated. I've spoken in person to the author over them, and he's been promising to look at them for years now, but still nothing. Lucene is my personal preference (3 years experience with it), but since we run zero-Java in Gentoo infrastructure, Sphinx or PLucene would be suitable alternatives. >> A couple of suggestions for it: >> - privacy protection for locked bugs is important. The search engine must not leak bugs. This should offer no complications. Whatever flag sets a bug as locked can be introduced into the indexer reqs. to simply ignore them in the search index. If you want them search-able but only by people with access to the locked bugs, it can be accomplished any of several ways (most probably including a separate index appending those records in addition to those in the general search and an SP that looks at both instead of only the one). >> - I'd like the Sphinx to NOT be the MySQL engine variant, but stand-alone. I'm curious about this, not because I disagree (in truth I neither agree nor disagree) but merely that I am curious what has influinced you against using the sphinx mysql engine. My tests show that it is easier to produce stored procedures with the sphinx engine and there are some operations that are impossible without it, while speed is virtually unaffected either way. Again, I want to stress that I am not disagreeing with you, merely curious about it. - Turnaround time for updates is very important. Agreed and noted. - Ability for admins to force reindexing of a given bug id #, with the old content dropped. AFAIK there is no way to tell sphinx to merely update the index by re-indexing only certain content or records unless you design in the ability with the structure of the sphinx indexes themselves. When indexing descriptions of 20+Mil businesses we subdivided them by region (states, territories, small countries) so that we could re-index only a given region when an update was required outside of the regularly scheduled index updates (twice a day in our case, probably more often for bugs.gentoo.org). Something like this could probably be worked in based on a date/time range for bugs.gentoo.org. Either way, running the indexer could certainly be keyed off by the admins at any time. In my experience, if you plan on re-indexing a great deal then the indexing should be done by a machine other than the database server due to memory use. A properly configured production mysql database uses a lot of memory and the sphinx indexer also uses a lot of memory. It is configurable, but I have noticed that limiting sphinx's memory during an index rebuild slows it down horribly. I intend to install a local copy of everything and experiment a bit. I have 2TB of disk space available and 8G of memory available to experiment with on this, but since I am unfamiliar with the size of the data for bugs.gentoo.org you will have to let me know if this isn't nearly enough to run my tests. Also, I need to know if a mysql dump of unlocked bugs could be provided to me for the tests (I need a relatively accurate sampling of the data and only want to fall back on generating it myself as a plan B). I'm closing this bug due to inactivity. Should the problem you are trying to fix still persist and you still wish to help, please reopen this bug. |