Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 385507 - Localization Guide: expand on collation issues (en_US is case insensitive)
Summary: Localization Guide: expand on collation issues (en_US is case insensitive)
Status: RESOLVED WONTFIX
Alias: None
Product: Gentoo Hosted Projects
Classification: Unclassified
Component: eselect (show other bugs)
Hardware: All Linux
: Normal enhancement
Assignee: Gentoo eselect Team
URL: http://www.gentoo.org/doc/en/guide-lo...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-03 14:09 UTC by Andy Dalton
Modified: 2011-10-24 21:10 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Dalton 2011-10-03 14:09:54 UTC
According to GNU sed's bug-reporting web page [1], when UTF-8 locales are specified and sed's using glibc collation functions, you can end up with case-insensitive character matching.  I can verify this on my machine.

$ echo AAA | sed 's/[a-z]/a/'
aAA

[a-z] should not have matched A.


As far as I understand, my system is correctly configured for UTF-8, according to "Using UTF-8 With Gentoo" [2]:

$ eselect locale list
Available targets for the LANG variable:
  [1]   C
  [2]   en_US
  [3]   en_US.iso88591
  [4]   en_US.utf8
  [5]   POSIX
  [6]   en_US.UTF-8 *
  [ ]   (free form)

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Sed's bug-reporting page [1] says this isn't a bug, but is related to the collation function provided by glibc with UTF-8 locales.  They mention you can export a variable LC_COLLATE=C to work around the problem.  I can verify that this addresses that particular problem:

$ export LC_COLLATE=C
$ echo AAA | sed 's/[a-z]/a/'
AAA

A question I have is, is it safe for 'eselect profile set' to include LC_COLLATE=C in /etc/env.d/02locale, and if so, should it?  If not, what would be the "correct" way to resolve this problem?  If there is a universal fix, should that be included in [2]?

This bug may be related to bug 275986, but this problem seems more serious than a simple "test failed" because of the ubiquitous use of sed in shell scripts.

If you need any additional information, please let me know.


[1] http://www.gnu.org/software/sed/manual/html_node/Reporting-Bugs.html
[2] http://www.gentoo.org/doc/en/utf-8.xml

Reproducible: Always

Steps to Reproduce:
1. echo AAA | sed 's/[a-z]/a/'

Actual Results:  
aAA

Expected Results:  
AAA
Comment 1 Rafał Mużyło 2011-10-03 19:40:12 UTC
To be honest, any script that uses [a-z] filtering to filter alphabet without *explicitly* setting LC_ALL=C is broken by design.
Just compare
echo ttt | LANG=C sed -e 's:[a-z]::'
and
echo ttt | LANG=et_EE sed -e 's:[a-z]::'
Comment 2 Mike Gilbert gentoo-dev 2011-10-09 14:45:06 UTC
I don't think setting LC_COLLATE globally is going to fly; that affects more than just sed. For example, the output order of "ls" depends on this setting.

You have already given what seems to be a nice simple solution: just set LC_COLLATE if you need to do case-sensitive matching.
Comment 3 SpanKY gentoo-dev 2011-10-11 00:39:57 UTC
it is not an encoding issue.  your locale (en_US) says to do case-insensitive matching.  if you don't like that behavior, use a diff LC_COLLATE value in your environment.

also, you misread the bug reporting page.  what you reference is telling you that there *isn't* a bug here, but rather your expectations are incorrect.
Comment 4 Andy Dalton 2011-10-11 00:51:09 UTC
(In reply to comment #3)
> it is not an encoding issue.  your locale (en_US) says to do case-insensitive
> matching.  if you don't like that behavior, use a diff LC_COLLATE value in your
> environment.
> 
> also, you misread the bug reporting page.  what you reference is telling you
> that there *isn't* a bug here, but rather your expectations are incorrect.

No, I didn't misread the bug reporting page, that's why I specifically said the following in my original report: "Sed's bug-reporting page [1] says this isn't a bug, but is related to the collation function provided by glibc with UTF-8 locales."

I understand the issue. My hope in filing this report was that either (1) there was a way to universally resolve the issue, or (2) you might find somewhere in Gentoo's documentation to address this issue.  It seems (1) isn't an option, but I still think (2) would be beneficial.  If you disagree, I respect your decision.
Comment 5 SpanKY gentoo-dev 2011-10-11 01:34:08 UTC
your bug summary says "sed does not function correctly".  that is not the issue at all ... sed is operating perfectly.  further, it has nothing to do with UTF8 locales.  you're smooshing two different issues into one ... the UTF8 note is not under the case insensitive section.

this isn't relevant to the UTF8 guide.  the localization guide does mention case matching issues already, but i'll leave it up to the docs team if they want to expand upon it.

http://www.gentoo.org/doc/en/guide-localization.xml#doc_chap3_sect3
Comment 6 Andy Dalton 2011-10-11 01:59:24 UTC
(In reply to comment #5)
> your bug summary says "sed does not function correctly".  that is not the issue
> at all ... sed is operating perfectly.  further, it has nothing to do with UTF8
> locales.  you're smooshing two different issues into one ... the UTF8 note is
> not under the case insensitive section.
> 
> this isn't relevant to the UTF8 guide.  the localization guide does mention
> case matching issues already, but i'll leave it up to the docs team if they
> want to expand upon it.
> 
> http://www.gentoo.org/doc/en/guide-localization.xml#doc_chap3_sect3

I understand your point, and I agree -- I was unintentionally smooshing the issues into one.

I hadn't seen the localization guide, but now that I've taken a look, I do see the following note:

"Note: Some programs are written in such a way that they expect traditional English ordering of the alphabet, while some locales, most notably the Estonian one, use a different ordering. Therefore it's recommended to explicitly set LC_COLLATE to C when dealing with system-wide settings."

Then, in Listing 3.1, it gives an example of how /etc/env.d/02locale might look:
LANG="de_DE.UTF-8"
LC_COLLATE="C"

Since 'eselect locale' exists, I tend to avoid manually editing that file in favor of using that tool.  If it's recommended to explicitly set LC_COLLATE to C, could not 'select locale set' do that for us?
Comment 7 Sven Vermeulen (RETIRED) gentoo-dev 2011-10-23 10:44:00 UTC
Documentation-wise, I don't think it needs to be elaborated more. Regarding the "eselect local" behavior -> reassigning to eselect folks.
Comment 8 Ulrich Müller gentoo-dev 2011-10-23 22:39:29 UTC
(In reply to comment #0)
> A question I have is, is it safe for 'eselect profile set' to include
> LC_COLLATE=C in /etc/env.d/02locale, and if so, should it?

I think that hardcoding such side effects is not a good idea. Please note that eselect currently leaves any extra variable assignments in the 02locale file alone.

What we could do is to extend the "set" and "show" actions such that they would operate on a different variable than LANG. Maybe something like "eselect locale set --variable=LC_COLLATE C" and "eselect locale show --variable=LC_COLLATE"?
Comment 9 Andy Dalton 2011-10-24 13:19:46 UTC
(In reply to comment #8)
> I think that hardcoding such side effects is not a good idea. Please note that
> eselect currently leaves any extra variable assignments in the 02locale file
> alone.
> 
> What we could do is to extend the "set" and "show" actions such that they would
> operate on a different variable than LANG. Maybe something like "eselect locale
> set --variable=LC_COLLATE C" and "eselect locale show --variable=LC_COLLATE"?

I was afraid that hardcoding the LC_COLLATE variable wouldn't be a good choice.  Extending the functionality for the "set" and "show" actions for 'eselect locale' would be great.
Comment 10 Sven Vermeulen (RETIRED) gentoo-dev 2011-10-24 18:54:12 UTC
As much as I like eselect for providing a uniform interface towards our users for various settings, doesn't that seem a bit too ... overengineered? 

What's wrong with editing /etc/env.d/02locale? For a user, he'll need to see documentation on it anyhow, regardless of the solution (as eselect won't tell the user that there is a variable clause available as well).
Comment 11 Ulrich Müller gentoo-dev 2011-10-24 21:10:12 UTC
(In reply to comment #10)
> As much as I like eselect for providing a uniform interface towards our users
> for various settings, doesn't that seem a bit too ... overengineered? 
> 
> What's wrong with editing /etc/env.d/02locale? For a user, he'll need to see
> documentation on it anyhow, regardless of the solution (as eselect won't tell
> the user that there is a variable clause available as well).

Why did you reassign this bug to eselect then? ;)

But right, the eselect module covers the basic usage case. If we extend (and thereby complicate) it, then at some point reading its documentation will require more effort for the user than editing the config file.

Closing as WONTFIX.