Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 208051 - sys-apps/sed broken when compiled and ran with LANG=UTF-8
Summary: sys-apps/sed broken when compiled and ran with LANG=UTF-8
Status: VERIFIED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal
Assignee: Gentoo's Team for Core System packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-29 14:24 UTC by marty rosenberg
Modified: 2008-01-30 03:06 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description marty rosenberg 2008-01-29 14:24:33 UTC
when sed is compiled with LANG being some sort of UTF-8 value (i.e. not C, or en) and is run with a similar environment variable, sed looses the ability to distinguish between most upper case and lower case characters in character classes

Reproducible: Always

Steps to Reproduce:
1.LANG="en_US-UTF-8" emerge sed
2.LANG="en_US-UTF-8" echo qwerasdfzxcv | sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
3.

Actual Results:  
sed outputs:
"qwer"a"sdfzxcv"

Expected Results:  
sed outputs:
qwerasdfzxcv

it seems like if at either compile time or run time, LANG isn't UTF-8, then sed will work correctly.  Also, note that sed correctly classifys the character a
as a lowercase character.
Comment 1 kfm 2008-01-29 14:50:34 UTC
Some clarifications ... Firstly, this appears to be about which locale is active when sed is run (rather than built). Secondly, the test case is invalid. Try it in this manner:

1) echo qwerasdfzxcv | LANG="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
2) echo qwerasdfzxcv | LANG="en_GB" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
3) echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'

In my case, (3) results in the seemingly spurious output: "qwer"a"sdfzxcv"

I tried the same thing on Debian Etch (which supports UTF) and the result in all 3 cases is: qwerasdfzxcv

I confess to being somewhat baffled.
Comment 2 Arfrever Frehtes Taifersar Arahesis (RETIRED) gentoo-dev 2008-01-29 15:26:29 UTC
Use [[:upper:]] instead of [A-Z].
Comment 3 kfm 2008-01-29 16:34:15 UTC
Yes, '[[:upper:]]' does produce the expected results. But there's more to this.

I did a side-by-side comparison of Gentoo and Debian Etch using a similar test case (provided by kojiro) for all locales:

for l in $(locale -a); do echo testing LC_ALL=$l; LC_ALL=$l <<< qwerasdfzxcv sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'; done

... and the results were:

Gentoo:

  testing LC_ALL=C
  qwerasdfzxcv
  testing LC_ALL=en_GB.utf8
  "qwer"a"sdfzxcv"
  testing LC_ALL=en_US.utf8
  "qwer"a"sdfzxcv"
  testing LC_ALL=POSIX
  qwerasdfzxcv

Debian:

  testing LC_ALL=C
  qwerasdfzxcv
  testing LC_ALL=en_GB
  qwerasdfzxcv
  testing LC_ALL=en_GB.iso88591
  qwerasdfzxcv
  testing LC_ALL=en_GB.iso885915
  qwerasdfzxcv
  testing LC_ALL=en_GB.utf8
  qwerasdfzxcv
  testing LC_ALL=POSIX
  qwerasdfzxcv

Marty then put his finger on it. He suggested that it may be as a result of the collation order (controlled independently by LC_COLLATE if so desired). So ... more tests:

Gentoo
------
Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: "qwer"a"sdfzxcv"

Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Debian
------
Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Given a collation order of "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ" in the UTF locales, it explains Gentoo's results but not the fact that it appears to differ from other distributions in honouring the collation order.

So, is this NOTABUG? In my opinion, I think it is a bug and I think Debian's behaviour is sensible. There are situations where the locale - the collation order in this case - should alter the outcome but I'm not sure that changing the meaning of, say, '[A-Z]' in a regular expression is one of them. I'm curious to hear the opinions of anyone else on this matter.

There doesn't seem to be any concrete information on how LC_COLLATE should, if at all, affect sed either. In BSD systems, the man page says "The COLUMNS, LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of sed as described in environ(7)" but offers no specific information.
Comment 4 Jakub Moc (RETIRED) gentoo-dev 2008-01-29 20:51:56 UTC
Please, read the documentation before filing bugs. 
Comment 5 SpanKY gentoo-dev 2008-01-30 00:26:10 UTC
try sed-4.1.5-r1 ... it should give you consistent regex behavior with other distros
Comment 6 Bo Ørsted Andresen (RETIRED) gentoo-dev 2008-01-30 01:11:22 UTC
Reopening to reassign to base-system.
Comment 7 Jorge Manuel B. S. Vicetto (RETIRED) gentoo-dev 2008-01-30 01:14:59 UTC
Adding userrel to the cc list.
Comment 8 michael@smith-li.com 2008-01-30 02:52:27 UTC
Yes indeed, the behavior appears to make much more sense with 4.1.5-r1.

Shouldn't we add the inSVN keyword?
Comment 9 SpanKY gentoo-dev 2008-01-30 02:58:41 UTC
no