Having run "equery belongs md5sum" which seemed to run slowly, I investigated and saw that most of the run time is due to regular expression matching. The regular expression used has some redundancy: /md5sum$|^/current/working/dir/md5sum$ Although we have two branches, the first branch will match everything the second will. The second branch is there in case a symlink in the current path points to a file with a different name. So for example, let /bin/ksh -> /bin/mksh, then executing "equery belongs ksh" in /bin will search for /ksh$|^/bin/mksh$. I have written a patch to improve this while retaining the symlink functionality. This improves the run time of "equery belongs" by 9% for me. I have tested this only on Python 3.3 and don't know Python well myself. This was an excuse to learn it a little. This patch will need to be adjusted to work with all versions. Thank you. --- old_helpers.py 2013-07-06 11:44:38.524514791 +0100 +++ helpers.py 2013-07-06 16:03:06.241003837 +0100 @@ -338,6 +338,46 @@ return paths + @staticmethod + def _filter_redundant_regexes(rs): + """ + Given a list of regular expressions to be joined via "|".join(), + removes those branches that have non-unique suffixes: + ['a$', 'ba$'] --> ['a$'] + for the purpose of improving efficiency. + + This is used in _prepare_search_regex to make filename matching + more efficient. + + @type rs: list + @param rs: file path regular expressions each ending with '$' + @rtype: list + @return: list with possibly fewer branches + + @raise ValueError: if a regular expression does not end with '$' + """ + + # sort in place + rs.sort(key= len, reverse= True) + rs_filtered = [] + + # keep a list item only if no suffix of it is present + for i in range(len(rs)): + found = False + for j in range(i + 1, len(rs)): + if rs[i].endswith(rs[j]): + found = True + break + if not found: + if not rs[i].endswith('$'): + raise ValueError( + "_filter_redundant_regexes needs all regexes to end with '$'") + rs_filtered.append(rs[i]) + + # we assume a search with very few hits; sort order is unimportant + return rs_filtered + + def _prepare_search_regex(self, queries): """Create a regex out of the queries""" @@ -357,7 +397,9 @@ else: query = "/%s$" % re.escape(query) result.append(query) - result = "|".join(result) + + # remove same redundancy here, 9% improvement in run time with "equery belongs" + result = "|".join( self._filter_redundant_regexes(result) ) return result # ========= Reproducible: Always
Please attach your patch to the bug.
Created attachment 353022 [details, diff] patch to speed up regex file search, written for Python 3
There's no need to credit me for this in the Changelog, Gentoo owns it now. Thanks. Neil.