Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 208082 - Localization Guide Troubleshooting
Summary: Localization Guide Troubleshooting
Status: RESOLVED FIXED
Alias: None
Product: [OLD] Docs on www.gentoo.org
Classification: Unclassified
Component: Other documents (show other bugs)
Hardware: All Linux
: High minor (vote)
Assignee: Docs Team
URL: http://www.gentoo.org/doc/en/guide-lo...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-29 18:57 UTC by michael@smith-li.com
Modified: 2008-10-10 22:23 UTC (History)
6 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description michael@smith-li.com 2008-01-29 18:57:57 UTC
Coreutils and other programs are occasionally very sensitive to locale settings. bug 208051 and bug 186349 are examples of such sensitivity.

For the most part I expect users will want to have their locale settings determine things like default date and currency formatting, and not affect the collation order of regular expressions.

Setting LC_COLLATE=C in /etc/env.d/02locale often resolves this kind of problem if it's not overridden by LC_ALL.

So I recommend that the Localization Guide be changed to make mention of problems caused by a non-POSIX collation order, something like this:

"Note: Locale settings can sometimes cause unexpected behavior in utilities that use glibc's regular expressions library, like sed and grep. Setting LC_COLLATE=C can prevent such unexpected behavior without impacting the rest of your localization, as long as you don't override it with LC_ALL."
Comment 1 Jan Kundrát (RETIRED) gentoo-dev 2008-01-29 19:25:30 UTC
We already explain that LANG and LC_ALL serve as default settings. Please also note that people have to set LC_CTYPE to some utf8-aware locale in order to use utf-8 on their system.

If there's an utility that breaks horribly when run with non-C locale, it's a bug in the application, not in the locale itself. Applications' authors should be familiar with locales before they start to use them.

INVALID from me.
Comment 2 michael@smith-li.com 2008-01-29 20:10:02 UTC
(In reply to comment #1)
> We already explain that LANG and LC_ALL serve as default settings.

LC_ALL is not a good setting, as it specifically triggers this kind of problem.

> Please also note that people have to set LC_CTYPE to some utf8-aware locale
> in order to use utf-8 on their system.

To be sure, this problem is not related to UTF-8 or any encoding. It's a locale problem, not an encoding problem.

> Applications' authors should be familiar with locales before they start to
> use them.

Applications' authors can't be held accountable by sed behaving differently with the same locale on two different systems. Furthermore how would you propose that they become familiar? The localization guide does not mention this problem. Neither does sed's info or man page. It's not mentioned in any GNU documentation page for regex or glibc. The POSIX manual page for sed makes the most complete reference to it:

              Determine the locale for the behavior of ranges, equiva-
              lence classes, and multi-character collating elements
              within regular expressions.

If it's a problem that application developers don't understand locales, then it is indeed a documentation problem. Even if it is a bug in glibc or coreutils, it is still something that bears mentioning to every person who sets their locale to something besides POSIX or C.
Comment 3 Jan Kundrát (RETIRED) gentoo-dev 2008-01-29 20:47:11 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > We already explain that LANG and LC_ALL serve as default settings.
> 
> LC_ALL is not a good setting, as it specifically triggers this kind of problem.

s/default settings/settings that override others when they aren't set/, sorry for confusion

> To be sure, this problem is not related to UTF-8 or any encoding. It's a locale
> problem, not an encoding problem.

Right, but one has to se LC_CTYPE to some utf-8 enabled value in order to use utf-8 in system.

> Applications' authors can't be held accountable by sed behaving differently
> with the same locale on two different systems.

Indeed, but then it's a bug in sed. This bug should be fixed instead of asking users to employ workarounds.

> Furthermore how would you
> propose that they become familiar? The localization guide does not mention this
> problem. Neither does sed's info or man page. It's not mentioned in any GNU
> documentation page for regex or glibc. The POSIX manual page for sed makes the
> most complete reference to it:
> 
>               Determine the locale for the behavior of ranges, equiva-
>               lence classes, and multi-character collating elements
>               within regular expressions.

Anyone who uses stuff like tr should read more than our localization guide. If they do, they'd know that some locales, like the Estonian, for example, have different character ordering, and the programmer should make sure she explicitly asks for a C locale when she needs it. Of course when our patches somehow affect the application in an unexpected way, it's a bug on our side, but you should tell maintainers of tr/sed/grep/..., not the documentation team.

> If it's a problem that application developers don't understand locales, then it
> is indeed a documentation problem.

Our localization guide is targetted to users, not application developers, sorry.

> Even if it is a bug in glibc or coreutils,
> it is still something that bears mentioning to every person who sets their
> locale to something besides POSIX or C.

We should get fixed whatever needs fixing, documenting bugs doesn't scale.
Comment 4 SpanKY gentoo-dev 2008-01-29 23:55:06 UTC
why does our localization guide show setting LC_ALL at the system level ?  this cant possibly be something we want our users to think is OK.  Code Listing 3.2 should show only LANG being set, not LC_ALL.

the original suggestion was to add a semi-FAQ about unexpected (to the user) behavior when using certain values of LANG.  for example, a LANG of en_US will produce a collation that often times people do not expect.  by setting LC_COLLATE to C, they can get their expected behavior, while retaining localization output for everything else.
Comment 5 Bo Ørsted Andresen (RETIRED) gentoo-dev 2008-01-30 17:07:11 UTC
(In reply to comment #4)
> why does our localization guide show setting LC_ALL at the system level ?
> this cant possibly be something we want our users to think is OK.  Code
> Listing 3.2 should show only LANG being set, not LC_ALL.

+1

Users should never set LC_ALL on the system level. It prevents LC_ALL=C emerge foo from working when someone files a bug report in some weird language.
Comment 6 Jan Kundrát (RETIRED) gentoo-dev 2008-01-30 22:24:31 UTC
Er, I confised LC_CTYPE with LC_COLLATE, my bad, sorry.
Comment 7 kfm 2008-01-31 13:30:22 UTC
I'm curious - why would it be necessary to define LC_COLLATE with a utf-8 locale in order to use utf-8 at large? I'm currently using LANG="en_GB.UTF-8" LC_COLLATE="C" and it does not appear to have compromised anything. If there are significant drawbacks in doing so, I would rather like to know.
Comment 8 SpanKY gentoo-dev 2008-01-31 16:20:40 UTC
please review comment #8 again ... no one is saying any setting is required nor is utf8 a factor here at all
Comment 9 michael@smith-li.com 2008-01-31 17:47:30 UTC
(In comment #1, Jan said)
> We already explain that LANG and LC_ALL serve as default settings. Please also
> note that people have to set LC_CTYPE to some utf8-aware locale in order to 
> use utf-8 on their system.

(In comment #6, Jan said)
> Er, I confised LC_CTYPE with LC_COLLATE, my bad, sorry.

I think Kerin's comment #7 is asking why LC_COLLATE *or* LC_CTYPE would have to be UTF-8 in order to use UTF-8 at large.

I'd like to know, too. After all, if LC_COLLATE=C is an invalid collation locale for an otherwise UTF-8 system, then my suggested fix to the Localization Guide is invalid because it somehow damages the "UTF-8ness" of the system.
Comment 10 kfm 2008-01-31 17:52:52 UTC
Er, I was responding directly to Jan's assertion in comment 1 ...

> We already explain that LANG and LC_ALL serve as default settings. Please also
> note that people have to set LC_CTYPE to some utf8-aware locale in order to use
> utf-8 on their system.

He said he meant LC_COLLATE, not LC_TYPE. So, the assertion is "people have to
set LC_COLLATE to some utf8-aware locale in order to use utf-8 on their
system".

I don't know about you but that seeme to me to be precisely saying that a
setting is required - in lieu of utf-8 support no less - doesn't it? So, in
turn, I am directly asking what the basis is for that assertion. And it _is_
on-topic.

Now, which comment was I supposed to read again?

PS: I support kojiro's request.
Comment 11 Jan Kundrát (RETIRED) gentoo-dev 2008-01-31 19:00:21 UTC
To make it clear, when I was speaking about LC_COLLATE, that was was an error. On my system, I use LC_CTYPE=cs_CZ.utf-8 for utf-8, and that's the required setting for "utf-8 enabled system", AFAIK.

So, spanky/others, what should we change and how exactly? I can understand that setting LANG globally might be wrong, but what's the absolutely correct way to switch system's language, then?
Comment 12 Bo Ørsted Andresen (RETIRED) gentoo-dev 2008-01-31 19:41:24 UTC
(In reply to comment #11)
> So, spanky/others, what should we change and how exactly? I can understand
> that setting LANG globally might be wrong, but what's the absolutely correct
> way to switch system's language, then?

Setting LANG is correct. Setting LC_ALL isn't and shouldn't be mentioned as an option.
Comment 13 kfm 2008-03-03 14:53:37 UTC
> To make it clear, when I was speaking about LC_COLLATE, that was was an error.
> On my system, I use LC_CTYPE=cs_CZ.utf-8 for utf-8, and that's the required
> setting for "utf-8 enabled system", AFAIK.

It isn't, given that the user should be defining LANG instead. From the Single UNIX Specification:

LANG - "This variable shall determine the locale category for native language, local customs, and coded character set in the absence of the LC_ALL and other LC_* ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so on."

In other words, all LC_* settings will follow LANG unless otherwise overridden.

Of course, LC_ALL behaves similarly except that it takes precedence over _all_ other environment variables and this is why we should not (under any circumstances) be suggesting it to users as a general localisation method.

Incidentally, here's a post corroborating that which we have been expounding; that seeing LC_ALL instead of LANG is not a great idea and, yes, there are people who actually set LC_COLLATE as well so as to avoid strange behaviour with UTF-8 locales and, as such, it's a splendid idea for us to mention it:

http://mail.nl.linux.org/linux-utf8/2003-03/msg00025.html

In fact, the author of the above post makes a good point about how LC_ALL can sometimes be useful when invoking certain applications:

"LC_ALL is most useful to switch off any i18n-environment-variable
dependent behaviour with LC_ALL=C. Programs (such as all known Linux
releases of acroread) with loads of problems in their locale handling
are best called with LC_ALL=C."

Of course, for such a workaround to work, it would have been necessary not to set LC_ALL globally in the first place ...

So, it should not be defined globally. Rather, LANG should. And it would be reasonable to alert users as to the problems that can occur with collation when using UTF8 locales and to inform them that defining LC_COLLATE in addition to LANG will proceed to make things work as people generally expect.

Also, I find this statement in the localisation guide very dubious and would like it to be removed based upon the notion that there is likely to be no hard evidence to support it:

"Note: Even though most programs work with LC_ALL only, some of them misbehave if LC_ALL is set but LANG isn't. If you want to play safe, set them both."

And, this statement:

"Most typically users only set the LANG variable and perhaps LC_CTYPE variable on user level by adding definitions to shells startup files defining the environment variable manually from command line:"

... is clearly based upon a misapprehension because, as I've already explained, defining LC_TYPE in addition to LANG is redundant.

Then there's this statement:

"A common practice is to use only per user locale settings and leave the default system locale unset"

How common is it really? It doesn't seem to be common as far as other distros are concerned. I have a Debian system also and that sets a default system locale in /etc/default/locale. Same for Ubuntu. The last time I used Slackware, it exported a locale in /etc/profile.d/lang.sh. This is worth bringing up because when Gentoo is seen to prescribe a different approach for no obvious reason, I think it's fair to ask why. As far as the habits of the Gentoo userbase are concerned, who knows? My point is that it's a speculative comment and doesn't come across as being evidence-based. If we are to present one of two possible approaches for defining a locale (i.e. system-wide or per-user) then I believe that we should only disclose factual information that enables the user to make an informed decision as to which method is appropriate for their needs.

As it stands, the document seems to express a bias towards modifying locales on a per-user basis but I don't think it presents a clear case for doing it either way and instead appears to prescribe the habits of the author.

Personally, I'm biased towards simply setting LANG system-wide because (a) it's very easy to do and very easy to explain how to do to others (b) it's a case of set it and forget it. In any case, perhaps it would help if both approaches were documented under individual headings/sections with an introduction beforehand explaining the relative merits of either approach.
Comment 14 Dan Coats 2008-04-25 17:32:17 UTC
Rather than open a new bug, I thought I would mention that the guide should be updated for openrc/baselayout2 as well.
Comment 15 nm (RETIRED) gentoo-dev 2008-07-20 00:43:15 UTC
(In reply to comment #14)
> Rather than open a new bug, I thought I would mention that the guide should be
> updated for openrc/baselayout2 as well.

You haven't said what updates are necessary, nor where it needs updates. Closing until we hear back.
Comment 16 michael@smith-li.com 2008-09-30 00:53:28 UTC
The document should say that if you intend to set a LANG to locale other than C, you should explicitly set LC_COLLATE to 'C'. It should say that you should never set LC_ALL to anything unless you are testing for locale problems and know what you are doing.
Comment 17 nm (RETIRED) gentoo-dev 2008-10-07 06:15:27 UTC
(In reply to comment #16)
> The document should say that if you intend to set a LANG to locale other than
> C, you should explicitly set LC_COLLATE to 'C'. It should say that you should
> never set LC_ALL to anything unless you are testing for locale problems and
> know what you are doing.


Why, and why? :)
Comment 18 kfm 2008-10-07 13:51:03 UTC
I find it remarkable that this bug continues to drag on like this. This is a typical example of how, when something is addressed in the wrong way in Gentoo for long enough, it somehow becomes right and unassailable by effective scrutiny and common sense.

Do people even consider reading any of the comments here before chiming in? Setting LC_ALL globally is clearly daft. I explained why in great detail in comment 13. There are plenty of disadvantages in doing so ... application weirdness, not being able to define any LC_* variable due to the immutable nature of LC_ALL. No-one has presented so much as one advantage in doing so and I don't believe that anyone can or will.

No other distro sets LC_ALL. They set LANG like they're supposed to.

As for setting LC_COLLATE, the opening comment pointed to two concrete examples as to why, if left unset, it can cause suprising - and sometimes outright borken - behaviour in applications, in the case where a UTF-8 locale is in use. By the way, there is at least one distro, Arch Linux, that exports LC_COLLATE="C" for this very reason - and that's in addition to LANG. So clearly, Michael and I are not the only folks that recognise that there is a problem here. I can't comprehend how anyone would deem this _not_ to be a worthy topic to be covered in the localisation guide.

Finally, I want to re-iterate the point that telling the user to set LC_CTYPE is thoroughly pointless. As I have already explained quite clearly, it inherits from LANG (and LC_ALL for that matter) anyway.
Comment 19 kfm 2008-10-07 13:59:33 UTC
About LC_ALL, just to be absolutely clear: it should be left for those rare cases where the user needs to override LANG (and all of its LC_* children) in one fell swoop. Setting LC_ALL in the manner proscribed by the guide makes it impossible for the user to change _any_ locale related setting within his imported environment.
Comment 20 Łukasz Damentko (RETIRED) gentoo-dev 2008-10-07 14:29:05 UTC
Kerin, 

It would be much easier if you provided patches for the docs instead of discussing Gentoo politics in this bug. We're quite tired of politics in this project already, if you want to push things forward, better stay on topic.


Josh,

Let's drop LC_ALL like all other distros I checked (debian, ubuntu, freebsd, mandriva) do, leave LC_ALL blank and set LC_COLLATE to C like Kerin suggested.

It will be something like:

LANG="pl_PL"
LC_CTYPE="pl_PL"
LC_NUMERIC="pl_PL"
LC_TIME="pl_PL"
LC_COLLATE="C"
LC_MONETARY="pl_PL"
LC_MESSAGES="pl_PL"
LC_PAPER="pl_PL"
LC_NAME="pl_PL"
LC_ADDRESS="pl_PL"
LC_TELEPHONE="pl_PL"
LC_MEASUREMENT="pl_PL"
LC_IDENTIFICATION="pl_PL"
LC_ALL=

We should suggest that instead of LANG + LC_ALL we're using at the moment. We could use .UTF-8 too (for example "LC_NAME="pl_PL.UTF-8") if we wanted to go UTF-8 in guide-localization. I strongly suggest that.

Bug #238235 seems to be a good example of the problem Kerin writes about. See http://bugs.gentoo.org/show_bug.cgi?id=238235#c3 for example.

What we should change:

Code listing 3.2 in guide-localization and "Setting the locale" section in utf-8.xml. And probably something in the handbook (but I can't find what).
Comment 21 kfm 2008-10-07 16:30:54 UTC
> It would be much easier if you provided patches for the docs instead of

OK, Łukasz - that's a fair point and duly taken. I have actually contributed documentation changes directly in the past and am aware of the potential effectiveness of such a strategy.

However - and let me be perfectly clear - I would point out that, as there has been a difficulty in (a) establishing anything remotely resembling a consensus that there is a problem that needs to be addressed (b) thus, in reaching any point where anyone can agree to do anything, my focus has been on actually making a case for doing something - which is apparently enough of a challenge in itself. I think that the case had been quite clearly made and my irritation stems from having to repeat the same points again.

> discussing Gentoo politics in this bug. We're quite tired of politics in this
> project already, if you want to push things forward, better stay on topic.

Whilst not wishing to rile any developers who have a stake in this bug, I make no apology for making clear my frustration at this stage - particularly when I have been researching the matter and making valid contributions to the bug which, up until now, have demonstrably been falling upon seemingly deaf ears.

In any case, there is nothing more that I have to say on the bug so I'll refrain from making any further comments; ergo, no more "off-topic" remarks from me ;)

Also, thank you for taking the issue into consideration.
Comment 22 Jan Kundrát (RETIRED) gentoo-dev 2008-10-10 16:06:13 UTC
(In reply to comment #20)
> Let's drop LC_ALL like all other distros I checked (debian, ubuntu, freebsd,
> mandriva) do, leave LC_ALL blank and set LC_COLLATE to C like Kerin suggested.
> 
> It will be something like:
> 
> LANG="pl_PL"
> LC_CTYPE="pl_PL"
> LC_NUMERIC="pl_PL"
> LC_TIME="pl_PL"
> LC_COLLATE="C"
> LC_MONETARY="pl_PL"
> LC_MESSAGES="pl_PL"
> LC_PAPER="pl_PL"
> LC_NAME="pl_PL"
> LC_ADDRESS="pl_PL"
> LC_TELEPHONE="pl_PL"
> LC_MEASUREMENT="pl_PL"
> LC_IDENTIFICATION="pl_PL"
> LC_ALL=
> 
> We should suggest that instead of LANG + LC_ALL we're using at the moment. We
> could use .UTF-8 too (for example "LC_NAME="pl_PL.UTF-8") if we wanted to go
> UTF-8 in guide-localization. I strongly suggest that.
> 
> Bug #238235 seems to be a good example of the problem Kerin writes about. See
> http://bugs.gentoo.org/show_bug.cgi?id=238235#c3 for example.
> 
> What we should change:
> 
> Code listing 3.2 in guide-localization and "Setting the locale" section in
> utf-8.xml. And probably something in the handbook (but I can't find what).

I've done something similar. I haven't touched the Handbook.

BTW, I'd really like to know what I was drinking when writing comment #6.

Reporters/commenters, I'm sorry it took so long to fix. You know, we're slackers :p.
Comment 23 Heiko Baums 2008-10-10 20:06:28 UTC
I unfortunately can't reopen this bug.

(In reply to comment #22)
> I've done something similar. I haven't touched the Handbook.

I don't think, that these comments in the handbook are really the correct way.
As you, Jan Kundrát, have written in comment #3, I think it's a bug in the programs like sed or in glibc, if some programs fail with a different LC_COLLATE than "C". Because LC_COLLATE determines, how things are sorted on Linux, incl. the KDE menu, I need a LC_COLLATE="de_DE.UTF-8", because I want an alphabetical sort order and I don't like to have lower-case characters sorted below the upper-case characters.

So if programs don't work correctly with a different LC_COLLATE than "C", then there should be filed a bug report to the upstream developers of these programs or to the glibc developers, if this is a glibc issue.
Comment 24 kfm 2008-10-10 22:23:17 UTC
I said I had nothing further to add. Well I was wrong :)

Jan, thank you for addressing the issue in the documentation - particularly with respect to LC_ALL.

Łukasz, I forgot to say last time that I agree that recommending a UTF-8 based locale for LANG is a very good idea because unicode is defined in USE almost everywhere within the "default" and "default-linux" namespace, and the overwhelming majority of users will use such a profile.

Heiko, I do not agree that it is wrong to present overriding LC_COLLATE in the documentation as a valid *potential* configuration choice ... bearing in mind that the user is responsible for the definition of all locale settings in this distro and that there are pragmatic reasons for doing so for those users who just want their systems to work and not have to worry about the obscure bugs that may be triggered otherwise.

Consider the sentiment expressed in Comment 2 and that locale handling is a widely misunderstood affair among users and developers alike. Also consider that Gentoo - in contrast to Arch - is extraordinarily flexible, extraordinarily portable ... and innately complex as a result. Not only is the process of building software handled by the user, there are are a huge number of other variables to contend with; seldom are two Gentoo Linux systems perfectly alike. The impact of a given bug can be far, far greater in scope than it may be for a 'conventional' distro. Consequently, its developers face some particularly challenging issues during the development process, when supporting its userbase and even when soliciting help from upstream.

Were you are a Gentoo user, you may indeed find that the KMenu is not ordered as you expect if you choose to define LC_COLLATE in this manner. Aside from being arguably trivial, this would not actually be a bug, this would be a normative behaviour that is introduced by a user-initiated choice (please do remember that Gentoo's base system does not specify any locale defaults). In contrast, the examples that have been touched upon here are genuine bugs that have triggered surprising and/or genuinely broken behaviour.

What's more important - that a user may export LC_COLLATE such that he encounters a minor (and easily rectified) annoyance with respect to the ordering of the items in their desktop menu? Or that he may export LC_COLLATE so as to mitigate a class of bug that can result in fundamental system tools not producing the expected results? I'm in the latter camp.

On the other hand, I would concede that it could perhaps be addressed in a more in-depth manner than it has been, in that the change is not accompanied by an explanation of why the user might want to override LC_COLLATE (or not). When I get some time, I may do what Łukasz suggested to me and present a patch against the documentation that elaborates a little further and helps the user to make a better-informed choice, obviously without getting dragged too far into the details.