when sed is compiled with LANG being some sort of UTF-8 value (i.e. not C, or en) and is run with a similar environment variable, sed looses the ability to distinguish between most upper case and lower case characters in character classes Reproducible: Always Steps to Reproduce: 1.LANG="en_US-UTF-8" emerge sed 2.LANG="en_US-UTF-8" echo qwerasdfzxcv | sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' 3. Actual Results: sed outputs: "qwer"a"sdfzxcv" Expected Results: sed outputs: qwerasdfzxcv it seems like if at either compile time or run time, LANG isn't UTF-8, then sed will work correctly. Also, note that sed correctly classifys the character a as a lowercase character.
Some clarifications ... Firstly, this appears to be about which locale is active when sed is run (rather than built). Secondly, the test case is invalid. Try it in this manner: 1) echo qwerasdfzxcv | LANG="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' 2) echo qwerasdfzxcv | LANG="en_GB" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' 3) echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' In my case, (3) results in the seemingly spurious output: "qwer"a"sdfzxcv" I tried the same thing on Debian Etch (which supports UTF) and the result in all 3 cases is: qwerasdfzxcv I confess to being somewhat baffled.
Use [[:upper:]] instead of [A-Z].
Yes, '[[:upper:]]' does produce the expected results. But there's more to this. I did a side-by-side comparison of Gentoo and Debian Etch using a similar test case (provided by kojiro) for all locales: for l in $(locale -a); do echo testing LC_ALL=$l; LC_ALL=$l <<< qwerasdfzxcv sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'; done ... and the results were: Gentoo: testing LC_ALL=C qwerasdfzxcv testing LC_ALL=en_GB.utf8 "qwer"a"sdfzxcv" testing LC_ALL=en_US.utf8 "qwer"a"sdfzxcv" testing LC_ALL=POSIX qwerasdfzxcv Debian: testing LC_ALL=C qwerasdfzxcv testing LC_ALL=en_GB qwerasdfzxcv testing LC_ALL=en_GB.iso88591 qwerasdfzxcv testing LC_ALL=en_GB.iso885915 qwerasdfzxcv testing LC_ALL=en_GB.utf8 qwerasdfzxcv testing LC_ALL=POSIX qwerasdfzxcv Marty then put his finger on it. He suggested that it may be as a result of the collation order (controlled independently by LC_COLLATE if so desired). So ... more tests: Gentoo ------ Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: "qwer"a"sdfzxcv" Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Debian ------ Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Given a collation order of "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ" in the UTF locales, it explains Gentoo's results but not the fact that it appears to differ from other distributions in honouring the collation order. So, is this NOTABUG? In my opinion, I think it is a bug and I think Debian's behaviour is sensible. There are situations where the locale - the collation order in this case - should alter the outcome but I'm not sure that changing the meaning of, say, '[A-Z]' in a regular expression is one of them. I'm curious to hear the opinions of anyone else on this matter. There doesn't seem to be any concrete information on how LC_COLLATE should, if at all, affect sed either. In BSD systems, the man page says "The COLUMNS, LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of sed as described in environ(7)" but offers no specific information.
Please, read the documentation before filing bugs.
try sed-4.1.5-r1 ... it should give you consistent regex behavior with other distros
Reopening to reassign to base-system.
Adding userrel to the cc list.
Yes indeed, the behavior appears to make much more sense with 4.1.5-r1. Shouldn't we add the inSVN keyword?
no