Description
erik quanstrom
2005-05-21 09:58:09 UTC
More info (and fix): http://savannah.gnu.org/patch/?func=detailitem&item_id=3934 Oops. Wrong window. Correct link is http://savannah.gnu.org/patch/?func=detailitem&item_id=3803 Created attachment 59486 [details, diff] Patch #1: false positives in fgrep https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=116909 Created attachment 59488 [details, diff]
Patch#2: Improper reset of counter (fixed in upstrem CVS)
Created attachment 59490 [details, diff]
Patch #3: MBS_SUPPORT fixes
Improper handling of multibyte encodings
Created attachment 59492 [details, diff]
Patch #4: ignore case is nor always honoured
Created attachment 59493 [details, diff]
Patch #5: Handle UTF-8 as special case
Created attachment 59494 [details, diff]
Patch #6: more UTF-8 optimizations
Read changes are done in patch #5 and #6 - the rest are just fixes from official CVS needed for #5 and #6. All patches can be applied against grep 2.5.1 or grep 2.5.1a. Can be found in Fedora, for example. These two urls are relevant: * https://bugzilla.redhat.com/bugzilla/long_list.cgi?buglist=69900 * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121313 Created attachment 74453 [details]
Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1))
Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1)),
generated with:
$ perl -M"encoding 'utf8'" -le 'print 1,chr for 2**8+1..2**16' > out
Created attachment 74455 [details] gprof output of grep being run on attachment 74453 [details] gprof(1) output of grep(1) being run on attachment 74453 [details], around 99% of execution time is being spent on check_multibyte_string() which gets called 65290 times. $ time LC_ALL=en_US.utf8 grep -v ^1 out real 4m24.222s user 3m22.043s sys 0m1.079s (In reply to comment #8) > Created an attachment (id=59494) [edit] > Patch #6: more UTF-8 optimizations This patch fails to apply to a clean 2.5.1 tree, all the other patches fail partly or fail totally on a patched gentoo tree of the same package. i tried these patches at the time and they did work for me. the result was speedy. however i see that i neglected to note the exact version of grep that i was working with. actually, i wonder if more character sets could be handled without mbtowc conversion. basically write a bytewise engine that only does mbtowc conversion for the "." operator and character sets. any self-syncing character set (e.g. single-byte sequences cannot be part of a multibyte character) should be doable in this way. but, hey, there's probablly something i'm missing. This seems more like it belongs to base-system as they are in charge of sys-apps/grep. Created attachment 83645 [details, diff] Take 2, Patch 1: Special-case UTF-8, massive speed-up (60x on my testcase) I'd hit this bug a little while ago, and came up with a fix that applies against the 2.5.1a ebuild in the tree. I stole the 3 Red Hat Entreprise patches that weren't already in Portage, from this SRPM: https://rhn.redhat.com/errata/RHBA-2005-565.html http://rpmfind.net//linux/RPM/redhat/enterprise/updates/3WS/grep-2.5.1-24.5.src.html I'm attaching 3 patches, and my version of the ebuild. The first one gives the bulk of the performance improvements. The later two are optional, as far as I can tell. Patch 2 gives another 25% speed-up on my test-case. Patch 3 is supposed to fix a bug with the '-w' option, but my quick attempt to reproduce it didn't work. My test-case is as follows: lesha@sheepiness ~ $ ls -Ral /var /dev /lib &> STUFF ^C (after a few seconds) $ ls -l STUFF -rw-r--r-- 1 lesha users 2510848 Apr 1 13:46 STUFF $ export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG real 0m12.451s user 0m12.019s sys 0m0.021s $ time grep '1$' STUFF > /dev/null real 0m0.045s user 0m0.041s sys 0m0.002s After all three patches are applied: export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG real 0m0.257s user 0m0.227s sys 0m0.008s Created attachment 83646 [details, diff]
Take 2, Patch 2: Disable DFA by default in multibyte locales (25% speedup in my test case)
Created attachment 83647 [details, diff] Take 2, Patch 3: Fixes for a bug involving the -w option. I haven't encountered this bug, but here's what the commit log says: Fixed -w handling for EGexecute. Now 'make check' passes. For more information, go here: http://savannah.gnu.org/patch/?func=detailitem&item_id=3809 Created attachment 83648 [details]
Take 2, Ebuild: Current grep-2.5.1a.ebuild, with the patches inserted.
The patches are applied in the same order: 1, 2, 3.
CCing Mike Frysinger, because he made the last several non-trivial changes to the ebuild, and so might actually do something instead of shifting the responsibility to someone else :) Oops, I neglected to mention that the ebuild I attached is marked ~x86 stable. That may not be the right thing to check in... while these patches are an improvement, there is something wrong with the approach as the utf-8 case is still an /order of magnitude slower than the ascii case/. there is no reason for this. utf-8 requires no special handling for the seach pattern you're using. only character classes and "." need to know anything about the width of a utf-8 character. - erik (In reply to comment #16) > $ time grep '1$' STUFF > /dev/null > > real 0m0.045s > user 0m0.041s > sys 0m0.002s > > > After all three patches are applied: > export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG > > real 0m0.257s > user 0m0.227s > sys 0m0.008s > (In reply to comment #22) Erik, are you suggesting that the maintainers hold off incorporating a fix until there's a "correct" method available? One possible reason for the remaining 5x slowdown is that in utf-8 mode grep needs to know where the character boundaries are. So, it needs to at least do utf-8 decoding. Anyway, my feeling is that it's better to have the 5x penalty (at the expense of fewer people being motivated to fix it), than to have a 300x penalty. So, I'm for incorporation. accept my aplogies for the misdirected rant. i'm a little frustrated with gnu grep. your patches are good and well considered. and clearly this is a lot better than nothing. however, my frustration lies in the fact that one doesn't need to know where the character boundaries are unless maching a single /unknown/ character as in "." or a negative character class. your test case matched a single known letter. i think you wanted "^l". it is not possible for "l" to match anything but l in utf-8, because all multi-byte encodings have their bucky bits set. also a character that is encoded as >1 byte will also only match that character. thanks for the good work. - erik (In reply to comment #23) > (In reply to comment #22) > > Erik, are you suggesting that the maintainers hold off incorporating a fix > until there's a "correct" method available? > > One possible reason for the remaining 5x slowdown is that in utf-8 mode grep > needs to know where the character boundaries are. So, it needs to at least do > utf-8 decoding. > > Anyway, my feeling is that it's better to have the 5x penalty (at the expense > of fewer people being motivated to fix it), than to have a 300x penalty. So, > I'm for incorporation. > upstream is actively working on this *** Bug 159138 has been marked as a duplicate of this bug. *** |