Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 475958 - app-portage/gentoolkit - Make regex file matching quicker
Summary: app-portage/gentoolkit - Make regex file matching quicker
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: x86 Linux
: Normal enhancement (vote)
Assignee: Portage Tools Team
Keywords: PATCH
Depends on:
Reported: 2013-07-06 15:25 UTC by ncahill_alt
Modified: 2013-07-10 17:20 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---

patch to speed up regex file search, written for Python 3 (bug475958.patch,1.66 KB, patch)
2013-07-10 17:13 UTC, ncahill_alt
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description ncahill_alt 2013-07-06 15:25:00 UTC
Having run "equery belongs md5sum" which seemed to run slowly, I investigated and saw that most of the run time is due to regular expression matching.  The regular expression used has some redundancy: /md5sum$|^/current/working/dir/md5sum$

Although we have two branches, the first branch will match everything the second will.  The second branch is there in case a symlink in the current path points to a file with a different name.  So for example, let /bin/ksh -> /bin/mksh, then executing "equery belongs ksh" in /bin will search for /ksh$|^/bin/mksh$.

I have written a patch to improve this while retaining the symlink functionality.  This improves the run time of "equery belongs" by 9% for me.  I have tested this only on Python 3.3 and don't know Python well myself.  This was an excuse to learn it a little.  This patch will need to be adjusted to work with all versions.

Thank you.

---	2013-07-06 11:44:38.524514791 +0100
+++	2013-07-06 16:03:06.241003837 +0100
@@ -338,6 +338,46 @@
 		return paths
+	@staticmethod
+	def _filter_redundant_regexes(rs):
+		"""
+		Given a list of regular expressions to be joined via "|".join(),
+		removes those branches that have non-unique suffixes: 
+			['a$', 'ba$'] --> ['a$']
+		for the purpose of improving efficiency.  
+		This is used in _prepare_search_regex to make filename matching
+		more efficient.
+		@type rs: list
+		@param rs: file path regular expressions each ending with '$'
+		@rtype: list
+		@return: list with possibly fewer branches
+		@raise ValueError: if a regular expression does not end with '$'
+		"""
+		# sort in place
+		rs.sort(key= len, reverse= True)
+		rs_filtered = []
+		# keep a list item only if no suffix of it is present
+		for i in range(len(rs)):
+			found = False
+			for j in range(i + 1, len(rs)):
+				if rs[i].endswith(rs[j]):
+					found = True
+					break
+			if not found:
+				if not rs[i].endswith('$'):
+					raise ValueError(
+						"_filter_redundant_regexes needs all regexes to end with '$'")
+				rs_filtered.append(rs[i])
+		# we assume a search with very few hits; sort order is unimportant
+		return rs_filtered
 	def _prepare_search_regex(self, queries):
 		"""Create a regex out of the queries"""
@@ -357,7 +397,9 @@
 					query = "/%s$" % re.escape(query)
-		result = "|".join(result)
+		# remove same redundancy here, 9% improvement in run time with "equery belongs"
+		result = "|".join( self._filter_redundant_regexes(result) )
 		return result
 # =========

Reproducible: Always
Comment 1 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-07-06 17:35:33 UTC
Please attach your patch to the bug.
Comment 2 ncahill_alt 2013-07-10 17:13:43 UTC
Created attachment 353022 [details, diff]
patch to speed up regex file search, written for Python 3
Comment 3 ncahill_alt 2013-07-10 17:20:03 UTC
There's no need to credit me for this in the Changelog, Gentoo owns it now.