grep performance is very poor with LANG=en_us.utf8 LC_ALL=en_us.utf8 the testcase i have is a file with 350k modes uids gids and filenames. with LANG=en_US the testcase takes 2:33. without it takes 1.93 /secs/ it does not make a difference if you change any of the use settings. e.g. USE=+/-nls or USE=+/-pcre makes no measureable differece in the times. i did some profiling and found the problem to be in check_multibyte string. -------------- testcase: ; time grep -c '^00004' elocate.db 32189 113.76user 10.79system 2:33.67elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+237minor)pagefaults 0swaps ; printenv | grep utf LANG=en_US.utf8 LC_ALL=en_US.utf8 ; LANG=en_US LC_ALL=C time grep -c '^00004' elocate.db 32189 0.20user 0.24system 0:01.93elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+176minor)pagefaults 0swaps ; gprof grep gmon.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 94.03 65.18 65.18 33284 0.00 0.00 check_multibyte_string [etc] Reproducible: Always Steps to Reproduce: 1. LANG=en_US.utf8 LC_ALL=en_US.utf8 grep <args> 2. wait wait wait 3. results Actual Results: results were correct ... eventually Expected Results: correct results in a reasonable time. Portage 2.0.51.19 (default-linux/x86/2005.0, gcc-3.3.5-20050130, glibc-2.3.4. 20041102-r1, 2.6.11-gentoo-r8ewq i686) ================================================================= System uname: 2.6.11-gentoo-r8ewq i686 Pentium III (Coppermine) Gentoo Base System version 1.4.16 Python: dev-lang/python-2.3.5 [2.3.5 (#1, Apr 29 2005, 10:26:12)] distcc 2.16 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled] dev-lang/python: 2.3.5 sys-apps/sandbox: [Not Present] sys-devel/autoconf: 2.59-r6, 2.13 sys-devel/automake: 1.7.9-r1, 1.8.5-r3, 1.5, 1.4_p6, 1.6.3, 1.9.5 sys-devel/binutils: 2.15.92.0.2-r7 sys-devel/libtool: 1.5.16 virtual/os-headers: 2.6.8.1-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CFLAGS="-O2 -march=pentium3 -fomit-frame-pointer -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/bin:/usr/bin:/var/bind /etc /usr/kde/2/share/config /usr/kde/3. 3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3/share/config / usr/lib/X11/xkb /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/bin:/usr/bin:/etc/gconf /etc/gconf /etc/terminfo /etc/env. d" CXXFLAGS="-O2 -march=pentium3 -fomit-frame-pointer -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig ccache distcc distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/Linux/ distributions/gentoo" LANG="en_US.utf8" LC_ALL="en_US.utf8" MAKEOPTS="-j2" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 X acpi adns alsa apm arts avi berkdb bitmap-fonts blas bzlib cdparanoia cdr crypt cups curl divx4linux dts dvd dvdr dvdread emboss encode esd fam foomaticdb fortran gdbm gif gnome gpm gtk gtk2 imlib ipv6 jpeg kde lapack libcaca libg++ libwww mad mhash mikmod mmap mmx motif mp3 mpeg ncurses nls nvidia offensive ogg oggvorbis opengl oss pam pdflib perl png python qt quicktime readline real rtc sdl spell sse ssl svga tcltk theora tiff truetype truetype-fonts type1-fonts unicode usb vorbis xanim xml xml2 xmms xpm xv xvid xvmc zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CBUILD, CTARGET, LDFLAGS, LINGUAS, PORTDIR_OVERLAY
More info (and fix): http://savannah.gnu.org/patch/?func=detailitem&item_id=3934
Oops. Wrong window. Correct link is http://savannah.gnu.org/patch/?func=detailitem&item_id=3803
Created attachment 59486 [details, diff] Patch #1: false positives in fgrep https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=116909
Created attachment 59488 [details, diff] Patch#2: Improper reset of counter (fixed in upstrem CVS)
Created attachment 59490 [details, diff] Patch #3: MBS_SUPPORT fixes Improper handling of multibyte encodings
Created attachment 59492 [details, diff] Patch #4: ignore case is nor always honoured
Created attachment 59493 [details, diff] Patch #5: Handle UTF-8 as special case
Created attachment 59494 [details, diff] Patch #6: more UTF-8 optimizations
Read changes are done in patch #5 and #6 - the rest are just fixes from official CVS needed for #5 and #6. All patches can be applied against grep 2.5.1 or grep 2.5.1a. Can be found in Fedora, for example.
These two urls are relevant: * https://bugzilla.redhat.com/bugzilla/long_list.cgi?buglist=69900 * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121313
Created attachment 74453 [details] Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1)) Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1)), generated with: $ perl -M"encoding 'utf8'" -le 'print 1,chr for 2**8+1..2**16' > out
Created attachment 74455 [details] gprof output of grep being run on attachment 74453 [details] gprof(1) output of grep(1) being run on attachment 74453 [details], around 99% of execution time is being spent on check_multibyte_string() which gets called 65290 times. $ time LC_ALL=en_US.utf8 grep -v ^1 out real 4m24.222s user 3m22.043s sys 0m1.079s
(In reply to comment #8) > Created an attachment (id=59494) [edit] > Patch #6: more UTF-8 optimizations This patch fails to apply to a clean 2.5.1 tree, all the other patches fail partly or fail totally on a patched gentoo tree of the same package.
i tried these patches at the time and they did work for me. the result was speedy. however i see that i neglected to note the exact version of grep that i was working with. actually, i wonder if more character sets could be handled without mbtowc conversion. basically write a bytewise engine that only does mbtowc conversion for the "." operator and character sets. any self-syncing character set (e.g. single-byte sequences cannot be part of a multibyte character) should be doable in this way. but, hey, there's probablly something i'm missing.
This seems more like it belongs to base-system as they are in charge of sys-apps/grep.
Created attachment 83645 [details, diff] Take 2, Patch 1: Special-case UTF-8, massive speed-up (60x on my testcase) I'd hit this bug a little while ago, and came up with a fix that applies against the 2.5.1a ebuild in the tree. I stole the 3 Red Hat Entreprise patches that weren't already in Portage, from this SRPM: https://rhn.redhat.com/errata/RHBA-2005-565.html http://rpmfind.net//linux/RPM/redhat/enterprise/updates/3WS/grep-2.5.1-24.5.src.html I'm attaching 3 patches, and my version of the ebuild. The first one gives the bulk of the performance improvements. The later two are optional, as far as I can tell. Patch 2 gives another 25% speed-up on my test-case. Patch 3 is supposed to fix a bug with the '-w' option, but my quick attempt to reproduce it didn't work. My test-case is as follows: lesha@sheepiness ~ $ ls -Ral /var /dev /lib &> STUFF ^C (after a few seconds) $ ls -l STUFF -rw-r--r-- 1 lesha users 2510848 Apr 1 13:46 STUFF $ export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG real 0m12.451s user 0m12.019s sys 0m0.021s $ time grep '1$' STUFF > /dev/null real 0m0.045s user 0m0.041s sys 0m0.002s After all three patches are applied: export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG real 0m0.257s user 0m0.227s sys 0m0.008s
Created attachment 83646 [details, diff] Take 2, Patch 2: Disable DFA by default in multibyte locales (25% speedup in my test case)
Created attachment 83647 [details, diff] Take 2, Patch 3: Fixes for a bug involving the -w option. I haven't encountered this bug, but here's what the commit log says: Fixed -w handling for EGexecute. Now 'make check' passes. For more information, go here: http://savannah.gnu.org/patch/?func=detailitem&item_id=3809
Created attachment 83648 [details] Take 2, Ebuild: Current grep-2.5.1a.ebuild, with the patches inserted. The patches are applied in the same order: 1, 2, 3.
CCing Mike Frysinger, because he made the last several non-trivial changes to the ebuild, and so might actually do something instead of shifting the responsibility to someone else :)
Oops, I neglected to mention that the ebuild I attached is marked ~x86 stable. That may not be the right thing to check in...
while these patches are an improvement, there is something wrong with the approach as the utf-8 case is still an /order of magnitude slower than the ascii case/. there is no reason for this. utf-8 requires no special handling for the seach pattern you're using. only character classes and "." need to know anything about the width of a utf-8 character. - erik (In reply to comment #16) > $ time grep '1$' STUFF > /dev/null > > real 0m0.045s > user 0m0.041s > sys 0m0.002s > > > After all three patches are applied: > export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG > > real 0m0.257s > user 0m0.227s > sys 0m0.008s >
(In reply to comment #22) Erik, are you suggesting that the maintainers hold off incorporating a fix until there's a "correct" method available? One possible reason for the remaining 5x slowdown is that in utf-8 mode grep needs to know where the character boundaries are. So, it needs to at least do utf-8 decoding. Anyway, my feeling is that it's better to have the 5x penalty (at the expense of fewer people being motivated to fix it), than to have a 300x penalty. So, I'm for incorporation.
accept my aplogies for the misdirected rant. i'm a little frustrated with gnu grep. your patches are good and well considered. and clearly this is a lot better than nothing. however, my frustration lies in the fact that one doesn't need to know where the character boundaries are unless maching a single /unknown/ character as in "." or a negative character class. your test case matched a single known letter. i think you wanted "^l". it is not possible for "l" to match anything but l in utf-8, because all multi-byte encodings have their bucky bits set. also a character that is encoded as >1 byte will also only match that character. thanks for the good work. - erik (In reply to comment #23) > (In reply to comment #22) > > Erik, are you suggesting that the maintainers hold off incorporating a fix > until there's a "correct" method available? > > One possible reason for the remaining 5x slowdown is that in utf-8 mode grep > needs to know where the character boundaries are. So, it needs to at least do > utf-8 decoding. > > Anyway, my feeling is that it's better to have the 5x penalty (at the expense > of fewer people being motivated to fix it), than to have a 300x penalty. So, > I'm for incorporation. >
upstream is actively working on this
*** Bug 159138 has been marked as a duplicate of this bug. ***