475958 – app-portage/gentoolkit - Make regex file matching quicker

Bug 475958 - app-portage/gentoolkit - Make regex file matching quicker

Summary: app-portage/gentoolkit - Make regex file matching quicker

Status:	CONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	x86 Linux

Importance:	Normal enhancement
Assignee:	Portage Tools Team

URL:
Whiteboard:
Keywords:	PATCH

Depends on:
Blocks:

Reported:	2013-07-06 15:25 UTC by ncahill_alt
Modified:	2013-07-10 17:20 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
patch to speed up regex file search, written for Python 3 (bug475958.patch,1.66 KB, patch) 2013-07-10 17:13 UTC, ncahill_alt	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ncahill_alt 2013-07-06 15:25:00 UTC

Having run "equery belongs md5sum" which seemed to run slowly, I investigated and saw that most of the run time is due to regular expression matching.  The regular expression used has some redundancy: /md5sum$|^/current/working/dir/md5sum$

Although we have two branches, the first branch will match everything the second will.  The second branch is there in case a symlink in the current path points to a file with a different name.  So for example, let /bin/ksh -> /bin/mksh, then executing "equery belongs ksh" in /bin will search for /ksh$|^/bin/mksh$.

I have written a patch to improve this while retaining the symlink functionality.  This improves the run time of "equery belongs" by 9% for me.  I have tested this only on Python 3.3 and don't know Python well myself.  This was an excuse to learn it a little.  This patch will need to be adjusted to work with all versions.

Thank you.

--- old_helpers.py	2013-07-06 11:44:38.524514791 +0100
+++ helpers.py	2013-07-06 16:03:06.241003837 +0100
@@ -338,6 +338,46 @@
 
 		return paths
 
+	@staticmethod
+	def _filter_redundant_regexes(rs):
+		"""
+		Given a list of regular expressions to be joined via "|".join(),
+		removes those branches that have non-unique suffixes: 
+			['a$', 'ba$'] --> ['a$']
+		for the purpose of improving efficiency.  
+		
+		This is used in _prepare_search_regex to make filename matching
+		more efficient.
+
+		@type rs: list
+		@param rs: file path regular expressions each ending with '$'
+		@rtype: list
+		@return: list with possibly fewer branches
+		
+		@raise ValueError: if a regular expression does not end with '$'
+		"""
+
+		# sort in place
+		rs.sort(key= len, reverse= True)
+		rs_filtered = []
+
+		# keep a list item only if no suffix of it is present
+		for i in range(len(rs)):
+			found = False
+			for j in range(i + 1, len(rs)):
+				if rs[i].endswith(rs[j]):
+					found = True
+					break
+			if not found:
+				if not rs[i].endswith('$'):
+					raise ValueError(
+						"_filter_redundant_regexes needs all regexes to end with '$'")
+				rs_filtered.append(rs[i])
+
+		# we assume a search with very few hits; sort order is unimportant
+		return rs_filtered
+
+
 	def _prepare_search_regex(self, queries):
 		"""Create a regex out of the queries"""
 
@@ -357,7 +397,9 @@
 				else:
 					query = "/%s$" % re.escape(query)
 				result.append(query)
-		result = "|".join(result)
+
+		# remove same redundancy here, 9% improvement in run time with "equery belongs"
+		result = "|".join( self._filter_redundant_regexes(result) )
 		return result
 
 # =========










Reproducible: Always

Comment 1 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-07-06 17:35:33 UTC

Please attach your patch to the bug.

Comment 2 ncahill_alt 2013-07-10 17:13:43 UTC

Created attachment 353022 [details, diff]
patch to speed up regex file search, written for Python 3

Comment 3 ncahill_alt 2013-07-10 17:20:03 UTC

There's no need to credit me for this in the Changelog, Gentoo owns it now.

Thanks.
Neil.