Bug 93443

Summary:	grep 79x slowdown with LANG=en_us.utf8 LC_ALL=en_us.utf8
Product:	Gentoo Linux	Reporter:	erik quanstrom <quanstro>
Component:	[OLD] Core system	Assignee:	Gentoo's Team for Core System packages <base-system>
Status:	RESOLVED UPSTREAM
Severity:	normal	CC:	avarab, betelgeuse, gentoo, snarkmaster, utf8
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Patch #1: false positives in fgrep Patch#2: Improper reset of counter (fixed in upstrem CVS) Patch #3: MBS_SUPPORT fixes Patch #4: ignore case is nor always honoured Patch #5: Handle UTF-8 as special case Patch #6: more UTF-8 optimizations Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1)) gprof output of grep being run on attachment 74453 Take 2, Patch 1: Special-case UTF-8, massive speed-up (60x on my testcase) Take 2, Patch 2: Disable DFA by default in multibyte locales (25% speedup in my test case) Take 2, Patch 3: Fixes for a bug involving the -w option. Take 2, Ebuild: Current grep-2.5.1a.ebuild, with the patches inserted.

Description erik quanstrom 2005-05-21 09:58:09 UTC

grep performance is very poor with LANG=en_us.utf8 LC_ALL=en_us.utf8
the testcase i have is a file with 350k modes uids gids and filenames.
with LANG=en_US the testcase takes 2:33. without it takes 1.93 /secs/

it does not make a difference if you change any of the use settings.
e.g. USE=+/-nls or USE=+/-pcre makes no measureable differece in the times.

i did some profiling and found the problem to be in check_multibyte string.

--------------

testcase:

; time grep -c '^00004' elocate.db 
32189
113.76user 10.79system 2:33.67elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+237minor)pagefaults 0swaps

; printenv | grep utf
LANG=en_US.utf8
LC_ALL=en_US.utf8


; LANG=en_US LC_ALL=C time grep -c '^00004' elocate.db 
32189
0.20user 0.24system 0:01.93elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps

; gprof grep gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 94.03     65.18    65.18    33284     0.00     0.00  check_multibyte_string
[etc]

Reproducible: Always
Steps to Reproduce:
1. LANG=en_US.utf8 LC_ALL=en_US.utf8 grep <args>
2. wait wait wait
3. results

Actual Results:  
results were correct ... eventually

Expected Results:  
correct results in a reasonable time.

Portage 2.0.51.19 (default-linux/x86/2005.0, gcc-3.3.5-20050130, glibc-2.3.4.
20041102-r1, 2.6.11-gentoo-r8ewq i686)
=================================================================
System uname: 2.6.11-gentoo-r8ewq i686 Pentium III (Coppermine)
Gentoo Base System version 1.4.16
Python:              dev-lang/python-2.3.5 [2.3.5 (#1, Apr 29 2005, 10:26:12)]
distcc 2.16 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled]
dev-lang/python:     2.3.5
sys-apps/sandbox:    [Not Present]
sys-devel/autoconf:  2.59-r6, 2.13
sys-devel/automake:  1.7.9-r1, 1.8.5-r3, 1.5, 1.4_p6, 1.6.3, 1.9.5
sys-devel/binutils:  2.15.92.0.2-r7
sys-devel/libtool:   1.5.16
virtual/os-headers:  2.6.8.1-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CFLAGS="-O2 -march=pentium3 -fomit-frame-pointer -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/bin:/usr/bin:/var/bind /etc /usr/kde/2/share/config /usr/kde/3.
3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3/share/config /
usr/lib/X11/xkb /usr/share/config /var/qmail/control"
CONFIG_PROTECT_MASK="/bin:/usr/bin:/etc/gconf /etc/gconf /etc/terminfo /etc/env.
d"
CXXFLAGS="-O2 -march=pentium3 -fomit-frame-pointer -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs autoconfig ccache distcc distlocks sandbox sfperms strict"
GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/Linux/
distributions/gentoo"
LANG="en_US.utf8"
LC_ALL="en_US.utf8"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 X acpi adns alsa apm arts avi berkdb bitmap-fonts blas bzlib cdparanoia 
cdr crypt cups curl divx4linux dts dvd dvdr dvdread emboss encode esd fam 
foomaticdb fortran gdbm gif gnome gpm gtk gtk2 imlib ipv6 jpeg kde lapack 
libcaca libg++ libwww mad mhash mikmod mmap mmx motif mp3 mpeg ncurses nls 
nvidia offensive ogg oggvorbis opengl oss pam pdflib perl png python qt 
quicktime readline real rtc sdl spell sse ssl svga tcltk theora tiff truetype 
truetype-fonts type1-fonts unicode usb vorbis xanim xml xml2 xmms xpm xv xvid 
xvmc zlib userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CBUILD, CTARGET, LDFLAGS, LINGUAS, PORTDIR_OVERLAY

Comment 1 Canal Vorfeed 2005-05-21 14:45:21 UTC

More info (and fix):

http://savannah.gnu.org/patch/?func=detailitem&item_id=3934

Comment 2 Canal Vorfeed 2005-05-21 15:06:28 UTC

Oops. Wrong window.

Correct link is

http://savannah.gnu.org/patch/?func=detailitem&item_id=3803

Comment 3 Canal Vorfeed 2005-05-21 15:09:48 UTC

Created attachment 59486 [details, diff]
Patch #1: false positives in fgrep

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=116909

Comment 4 Canal Vorfeed 2005-05-21 15:14:19 UTC

Created attachment 59488 [details, diff]
Patch#2: Improper reset of counter (fixed in upstrem CVS)

Comment 5 Canal Vorfeed 2005-05-21 15:16:19 UTC

Created attachment 59490 [details, diff]
Patch #3: MBS_SUPPORT fixes

Improper handling of multibyte encodings

Comment 6 Canal Vorfeed 2005-05-21 15:18:35 UTC

Created attachment 59492 [details, diff]
Patch #4: ignore case is nor always honoured

Comment 7 Canal Vorfeed 2005-05-21 15:20:03 UTC

Created attachment 59493 [details, diff]
Patch #5: Handle UTF-8 as special case

Comment 8 Canal Vorfeed 2005-05-21 15:22:38 UTC

Created attachment 59494 [details, diff]
Patch #6: more UTF-8 optimizations

Comment 9 Canal Vorfeed 2005-05-21 15:25:13 UTC

Read changes are done in patch #5 and #6 - the rest are just fixes from official
CVS needed for #5 and #6.

All patches can be applied against grep 2.5.1 or grep 2.5.1a. Can be found in
Fedora, for example.

Comment 10 Ævar Arnfjörð Bjarmason 2005-12-10 13:35:15 UTC

These two urls are relevant:

* https://bugzilla.redhat.com/bugzilla/long_list.cgi?buglist=69900
* https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121313

Comment 11 Ævar Arnfjörð Bjarmason 2005-12-10 13:48:22 UTC

Created attachment 74453 [details]
Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1))

Testcase that includes some multibyte UTF-8 characters (2^16-(2^8+1)),
generated with:

$ perl -M"encoding 'utf8'" -le 'print 1,chr for 2**8+1..2**16' > out

Comment 12 Ævar Arnfjörð Bjarmason 2005-12-10 13:58:46 UTC

Created attachment 74455 [details]
gprof output of grep being run on attachment 74453 [details]

gprof(1) output of grep(1) being run on attachment 74453 [details], around 99% of
execution time is being spent on check_multibyte_string() which gets called
65290 times.

$ time LC_ALL=en_US.utf8 grep -v ^1 out

real	4m24.222s
user	3m22.043s
sys	0m1.079s

Comment 13 Ævar Arnfjörð Bjarmason 2005-12-10 15:01:08 UTC

(In reply to comment #8)
> Created an attachment (id=59494) [edit]
> Patch #6: more UTF-8 optimizations

This patch fails to apply to a clean 2.5.1 tree, all the other patches fail
partly or fail totally on a patched gentoo tree of the same package.

Comment 14 erik quanstrom 2005-12-10 18:21:22 UTC

i tried these patches at the time and they did work for me.
the result was speedy.

however i see that i neglected to note the exact version of grep
that i was working with.

actually, i wonder if more character sets could be handled without
mbtowc conversion. basically write a bytewise engine that only does
mbtowc conversion for the "." operator and character sets. any self-syncing
character set (e.g. single-byte sequences cannot be part of a multibyte
character) should be doable in this way.

but, hey, there's probablly something i'm missing.

Comment 15 Petteri Räty (RETIRED) gentoo-dev

2006-04-01 08:43:03 UTC

This seems more like it belongs to base-system as they are in charge of sys-apps/grep.

Comment 16 Alexey Spiridonov 2006-04-01 11:57:16 UTC

Created attachment 83645 [details, diff]
Take 2, Patch 1: Special-case UTF-8, massive speed-up (60x on my testcase) 

I'd hit this bug a little while ago, and came up with a fix that applies against the 2.5.1a ebuild in the tree. I stole the 3 Red Hat Entreprise patches that weren't already in Portage, from this SRPM:

https://rhn.redhat.com/errata/RHBA-2005-565.html
http://rpmfind.net//linux/RPM/redhat/enterprise/updates/3WS/grep-2.5.1-24.5.src.html

I'm attaching 3 patches, and my version of the ebuild. The first one gives the bulk of the performance improvements. 

The later two are optional, as far as I can tell. Patch 2 gives another 25% speed-up on my test-case. Patch 3 is supposed to fix a bug with the '-w' option, but my quick attempt to reproduce it didn't work. 

My test-case is as follows:

lesha@sheepiness ~ $ ls -Ral /var /dev /lib &> STUFF
^C (after a few seconds)

$ ls -l STUFF
-rw-r--r--  1 lesha users 2510848 Apr  1 13:46 STUFF

$ export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG

real    0m12.451s
user    0m12.019s
sys     0m0.021s

$ time grep '1$' STUFF > /dev/null 

real    0m0.045s
user    0m0.041s
sys     0m0.002s


After all three patches are applied:
export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG

real    0m0.257s
user    0m0.227s
sys     0m0.008s

Comment 17 Alexey Spiridonov 2006-04-01 11:58:21 UTC

Created attachment 83646 [details, diff]
Take 2, Patch 2: Disable DFA by default in multibyte locales (25% speedup in my test case)

Comment 18 Alexey Spiridonov 2006-04-01 12:01:15 UTC

Created attachment 83647 [details, diff]
Take 2, Patch 3: Fixes for a bug involving the -w option. 

I haven't encountered this bug, but here's what the commit log says:
  Fixed -w handling for EGexecute. Now 'make check' passes.
For more information, go here:
  http://savannah.gnu.org/patch/?func=detailitem&item_id=3809

Comment 19 Alexey Spiridonov 2006-04-01 12:03:26 UTC

Created attachment 83648 [details]
Take 2, Ebuild: Current grep-2.5.1a.ebuild, with the patches inserted.

The patches are applied in the same order: 1, 2, 3.

Comment 20 Alexey Spiridonov 2006-04-01 12:06:04 UTC

CCing Mike Frysinger, because he made the last several non-trivial changes to the ebuild, and so might actually do something instead of shifting the responsibility to someone else :)

Comment 21 Alexey Spiridonov 2006-04-01 12:15:51 UTC

Oops, I neglected to mention that the ebuild I attached is marked ~x86 stable. That may not be the right thing to check in...

Comment 22 erik quanstrom 2006-04-01 12:27:29 UTC

while these patches are an improvement, there is something
wrong with the approach as the utf-8 case is still an /order
of magnitude slower than the ascii case/.

there is no reason for this. utf-8 requires no special handling
for the seach pattern you're using. only character classes and "."
need to know anything about the width of a utf-8 character.

- erik

(In reply to comment #16)

> $ time grep '1$' STUFF > /dev/null 
> 
> real    0m0.045s
> user    0m0.041s
> sys     0m0.002s
> 
> 
> After all three patches are applied:
> export LANG=en_US.UTF-8; time grep '1$' STUFF > /dev/null; export -n LANG
> 
> real    0m0.257s
> user    0m0.227s
> sys     0m0.008s
>

Comment 23 Alexey Spiridonov 2006-04-02 13:14:04 UTC

(In reply to comment #22)

Erik, are you suggesting that the maintainers hold off incorporating a fix until there's a "correct" method available? 

One possible reason for the remaining 5x slowdown is that in utf-8 mode grep needs to know where the character boundaries are. So, it needs to at least do utf-8 decoding. 

Anyway, my feeling is that it's better to have the 5x penalty (at the expense of fewer people being motivated to fix it), than to have a 300x penalty. So, I'm for incorporation.

Comment 24 erik quanstrom 2006-04-02 18:22:50 UTC

accept my aplogies for the misdirected rant.  i'm a little frustrated with gnu grep.  your patches are good and well considered. and clearly this is a lot
better than nothing.

however, my frustration lies in the fact that one doesn't need to know where the character boundaries are unless maching a single /unknown/ character as in "." or a negative character class.  your test case matched a single known letter. i think you wanted "^l". it is not possible for "l" to match anything but l in utf-8, because all multi-byte encodings have their bucky bits set. also a character that is encoded as >1 byte will also only match that character.

thanks for the good work.  

- erik

(In reply to comment #23)
> (In reply to comment #22)
> 
> Erik, are you suggesting that the maintainers hold off incorporating a fix
> until there's a "correct" method available? 
> 
> One possible reason for the remaining 5x slowdown is that in utf-8 mode grep
> needs to know where the character boundaries are. So, it needs to at least do
> utf-8 decoding. 
> 
> Anyway, my feeling is that it's better to have the 5x penalty (at the expense
> of fewer people being motivated to fix it), than to have a 300x penalty. So,
> I'm for incorporation.
>

Comment 25 SpanKY gentoo-dev

2006-11-05 01:04:19 UTC

upstream is actively working on this

Comment 26 SpanKY gentoo-dev

2006-12-26 13:45:22 UTC

*** Bug 159138 has been marked as a duplicate of this bug. ***