Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 475958

Summary: app-portage/gentoolkit - Make regex file matching quicker
Product: Gentoo Linux Reporter: ncahill_alt
Component: Current packagesAssignee: Portage Tools Team <tools-portage>
Status: CONFIRMED ---    
Severity: enhancement Keywords: PATCH
Priority: Normal    
Version: unspecified   
Hardware: x86   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Attachments: patch to speed up regex file search, written for Python 3

Description ncahill_alt 2013-07-06 15:25:00 UTC
Having run "equery belongs md5sum" which seemed to run slowly, I investigated and saw that most of the run time is due to regular expression matching.  The regular expression used has some redundancy: /md5sum$|^/current/working/dir/md5sum$

Although we have two branches, the first branch will match everything the second will.  The second branch is there in case a symlink in the current path points to a file with a different name.  So for example, let /bin/ksh -> /bin/mksh, then executing "equery belongs ksh" in /bin will search for /ksh$|^/bin/mksh$.

I have written a patch to improve this while retaining the symlink functionality.  This improves the run time of "equery belongs" by 9% for me.  I have tested this only on Python 3.3 and don't know Python well myself.  This was an excuse to learn it a little.  This patch will need to be adjusted to work with all versions.

Thank you.

--- old_helpers.py	2013-07-06 11:44:38.524514791 +0100
+++ helpers.py	2013-07-06 16:03:06.241003837 +0100
@@ -338,6 +338,46 @@
 
 		return paths
 
+	@staticmethod
+	def _filter_redundant_regexes(rs):
+		"""
+		Given a list of regular expressions to be joined via "|".join(),
+		removes those branches that have non-unique suffixes: 
+			['a$', 'ba$'] --> ['a$']
+		for the purpose of improving efficiency.  
+		
+		This is used in _prepare_search_regex to make filename matching
+		more efficient.
+
+		@type rs: list
+		@param rs: file path regular expressions each ending with '$'
+		@rtype: list
+		@return: list with possibly fewer branches
+		
+		@raise ValueError: if a regular expression does not end with '$'
+		"""
+
+		# sort in place
+		rs.sort(key= len, reverse= True)
+		rs_filtered = []
+
+		# keep a list item only if no suffix of it is present
+		for i in range(len(rs)):
+			found = False
+			for j in range(i + 1, len(rs)):
+				if rs[i].endswith(rs[j]):
+					found = True
+					break
+			if not found:
+				if not rs[i].endswith('$'):
+					raise ValueError(
+						"_filter_redundant_regexes needs all regexes to end with '$'")
+				rs_filtered.append(rs[i])
+
+		# we assume a search with very few hits; sort order is unimportant
+		return rs_filtered
+
+
 	def _prepare_search_regex(self, queries):
 		"""Create a regex out of the queries"""
 
@@ -357,7 +397,9 @@
 				else:
 					query = "/%s$" % re.escape(query)
 				result.append(query)
-		result = "|".join(result)
+
+		# remove same redundancy here, 9% improvement in run time with "equery belongs"
+		result = "|".join( self._filter_redundant_regexes(result) )
 		return result
 
 # =========










Reproducible: Always
Comment 1 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-07-06 17:35:33 UTC
Please attach your patch to the bug.
Comment 2 ncahill_alt 2013-07-10 17:13:43 UTC
Created attachment 353022 [details, diff]
patch to speed up regex file search, written for Python 3
Comment 3 ncahill_alt 2013-07-10 17:20:03 UTC
There's no need to credit me for this in the Changelog, Gentoo owns it now.

Thanks.
Neil.