Bug 208051

Summary:	sys-apps/sed broken when compiled and ran with LANG=UTF-8
Product:	Gentoo Linux	Reporter:	marty rosenberg <marty.rosenberg>
Component:	[OLD] Core system	Assignee:	Gentoo's Team for Core System packages <base-system>
Status:	VERIFIED FIXED
Severity:	normal	CC:	comrel, kfm, michael, rane
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description marty rosenberg 2008-01-29 14:24:33 UTC

when sed is compiled with LANG being some sort of UTF-8 value (i.e. not C, or en) and is run with a similar environment variable, sed looses the ability to distinguish between most upper case and lower case characters in character classes

Reproducible: Always

Steps to Reproduce:
1.LANG="en_US-UTF-8" emerge sed
2.LANG="en_US-UTF-8" echo qwerasdfzxcv | sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
3.

Actual Results:  
sed outputs:
"qwer"a"sdfzxcv"

Expected Results:  
sed outputs:
qwerasdfzxcv

it seems like if at either compile time or run time, LANG isn't UTF-8, then sed will work correctly.  Also, note that sed correctly classifys the character a
as a lowercase character.

Comment 1 kfm 2008-01-29 14:50:34 UTC

Some clarifications ... Firstly, this appears to be about which locale is active when sed is run (rather than built). Secondly, the test case is invalid. Try it in this manner:

1) echo qwerasdfzxcv | LANG="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
2) echo qwerasdfzxcv | LANG="en_GB" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
3) echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'

In my case, (3) results in the seemingly spurious output: "qwer"a"sdfzxcv"

I tried the same thing on Debian Etch (which supports UTF) and the result in all 3 cases is: qwerasdfzxcv

I confess to being somewhat baffled.

Comment 2 Arfrever Frehtes Taifersar Arahesis (RETIRED) gentoo-dev

2008-01-29 15:26:29 UTC

Use [[:upper:]] instead of [A-Z].

Comment 3 kfm 2008-01-29 16:34:15 UTC

Yes, '[[:upper:]]' does produce the expected results. But there's more to this.

I did a side-by-side comparison of Gentoo and Debian Etch using a similar test case (provided by kojiro) for all locales:

for l in $(locale -a); do echo testing LC_ALL=$l; LC_ALL=$l <<< qwerasdfzxcv sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'; done

... and the results were:

Gentoo:

  testing LC_ALL=C
  qwerasdfzxcv
  testing LC_ALL=en_GB.utf8
  "qwer"a"sdfzxcv"
  testing LC_ALL=en_US.utf8
  "qwer"a"sdfzxcv"
  testing LC_ALL=POSIX
  qwerasdfzxcv

Debian:

  testing LC_ALL=C
  qwerasdfzxcv
  testing LC_ALL=en_GB
  qwerasdfzxcv
  testing LC_ALL=en_GB.iso88591
  qwerasdfzxcv
  testing LC_ALL=en_GB.iso885915
  qwerasdfzxcv
  testing LC_ALL=en_GB.utf8
  qwerasdfzxcv
  testing LC_ALL=POSIX
  qwerasdfzxcv

Marty then put his finger on it. He suggested that it may be as a result of the collation order (controlled independently by LC_COLLATE if so desired). So ... more tests:

Gentoo
------
Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: "qwer"a"sdfzxcv"

Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Debian
------
Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'
Result: qwerasdfzxcv

Given a collation order of "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ" in the UTF locales, it explains Gentoo's results but not the fact that it appears to differ from other distributions in honouring the collation order.

So, is this NOTABUG? In my opinion, I think it is a bug and I think Debian's behaviour is sensible. There are situations where the locale - the collation order in this case - should alter the outcome but I'm not sure that changing the meaning of, say, '[A-Z]' in a regular expression is one of them. I'm curious to hear the opinions of anyone else on this matter.

There doesn't seem to be any concrete information on how LC_COLLATE should, if at all, affect sed either. In BSD systems, the man page says "The COLUMNS, LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of sed as described in environ(7)" but offers no specific information.

Comment 4 Jakub Moc (RETIRED) gentoo-dev

2008-01-29 20:51:56 UTC

Please, read the documentation before filing bugs.

Comment 5 SpanKY gentoo-dev

2008-01-30 00:26:10 UTC

try sed-4.1.5-r1 ... it should give you consistent regex behavior with other distros

Comment 6 Bo Ørsted Andresen (RETIRED) gentoo-dev

2008-01-30 01:11:22 UTC

Reopening to reassign to base-system.

Comment 7 Jorge Manuel B. S. Vicetto (RETIRED) Gentoo Infrastructure

2008-01-30 01:14:59 UTC

Adding userrel to the cc list.

Comment 8 michael@smith-li.com 2008-01-30 02:52:27 UTC

Yes indeed, the behavior appears to make much more sense with 4.1.5-r1.

Shouldn't we add the inSVN keyword?

Comment 9 SpanKY gentoo-dev

2008-01-30 02:58:41 UTC

no