Summary: | sys-apps/sed broken when compiled and ran with LANG=UTF-8 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | marty rosenberg <marty.rosenberg> |
Component: | [OLD] Core system | Assignee: | Gentoo's Team for Core System packages <base-system> |
Status: | VERIFIED FIXED | ||
Severity: | normal | CC: | comrel, kfm, michael, rane |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
marty rosenberg
2008-01-29 14:24:33 UTC
Some clarifications ... Firstly, this appears to be about which locale is active when sed is run (rather than built). Secondly, the test case is invalid. Try it in this manner: 1) echo qwerasdfzxcv | LANG="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' 2) echo qwerasdfzxcv | LANG="en_GB" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' 3) echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' In my case, (3) results in the seemingly spurious output: "qwer"a"sdfzxcv" I tried the same thing on Debian Etch (which supports UTF) and the result in all 3 cases is: qwerasdfzxcv I confess to being somewhat baffled. Use [[:upper:]] instead of [A-Z]. Yes, '[[:upper:]]' does produce the expected results. But there's more to this. I did a side-by-side comparison of Gentoo and Debian Etch using a similar test case (provided by kojiro) for all locales: for l in $(locale -a); do echo testing LC_ALL=$l; LC_ALL=$l <<< qwerasdfzxcv sed -e 's/\([A-Z][A-Z]*\)/"\1"/g'; done ... and the results were: Gentoo: testing LC_ALL=C qwerasdfzxcv testing LC_ALL=en_GB.utf8 "qwer"a"sdfzxcv" testing LC_ALL=en_US.utf8 "qwer"a"sdfzxcv" testing LC_ALL=POSIX qwerasdfzxcv Debian: testing LC_ALL=C qwerasdfzxcv testing LC_ALL=en_GB qwerasdfzxcv testing LC_ALL=en_GB.iso88591 qwerasdfzxcv testing LC_ALL=en_GB.iso885915 qwerasdfzxcv testing LC_ALL=en_GB.utf8 qwerasdfzxcv testing LC_ALL=POSIX qwerasdfzxcv Marty then put his finger on it. He suggested that it may be as a result of the collation order (controlled independently by LC_COLLATE if so desired). So ... more tests: Gentoo ------ Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: "qwer"a"sdfzxcv" Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Debian ------ Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Test: echo qwerasdfzxcv | LANG="en_GB.UTF-8" LC_COLLATE="C" sed -e 's/\([A-Z][A-Z]*\)/"\1"/g' Result: qwerasdfzxcv Given a collation order of "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ" in the UTF locales, it explains Gentoo's results but not the fact that it appears to differ from other distributions in honouring the collation order. So, is this NOTABUG? In my opinion, I think it is a bug and I think Debian's behaviour is sensible. There are situations where the locale - the collation order in this case - should alter the outcome but I'm not sure that changing the meaning of, say, '[A-Z]' in a regular expression is one of them. I'm curious to hear the opinions of anyone else on this matter. There doesn't seem to be any concrete information on how LC_COLLATE should, if at all, affect sed either. In BSD systems, the man page says "The COLUMNS, LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of sed as described in environ(7)" but offers no specific information. Please, read the documentation before filing bugs. try sed-4.1.5-r1 ... it should give you consistent regex behavior with other distros Reopening to reassign to base-system. Adding userrel to the cc list. Yes indeed, the behavior appears to make much more sense with 4.1.5-r1. Shouldn't we add the inSVN keyword? no |