Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 113233 - UTF-8 Guide uses 'xx_YY.UTF-8' while only 'xx_YY.utf8' works for newer glibc
Summary: UTF-8 Guide uses 'xx_YY.UTF-8' while only 'xx_YY.utf8' works for newer glibc
Status: RESOLVED INVALID
Alias: None
Product: [OLD] Docs-user
Classification: Unclassified
Component: Localisation Guide (show other bugs)
Hardware: All Linux
: High enhancement (vote)
Assignee: Docs Team
URL: http://www.gentoo.org/doc/en/utf-8.xm...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-22 02:18 UTC by Wiktor Wandachowicz
Modified: 2005-11-23 05:16 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Wiktor Wandachowicz 2005-11-22 02:18:36 UTC
For a long time I thought that the UTF-8 support in Gentoo is broken. I always
followed the "UTF-8 Guide" but I was never able to persuade Midnight Commander to
use UTF-8 correctly. Today I'm finishing a new amd64 install and I thought about
trying the UTF-8 again. I've followed the Guide and I've found that it proposes
incorrect settings for LC_ALL / LANG variables. So if I set "en_US.UTF-8" or
"pl_PL.UTF-8" for LC_ALL in /etc/env.d/02locale I'm unable to turn on UTF-8
support. However, if I use "en_US.utf8" or "pl_PL.utf8" it works!


Reproducible: Always
Steps to Reproduce:
1. Visit the URL that demonstrates the problem
2. See that in "Code Listing 2.1: Checking for an existing UTF-8 locale" there
   are "en_GB" and "en_GB.UTF-8" entries listed
3. Try executing the following commands:
   # cat /etc/locales.build
   # locale -a
   # cat /etc/env.d/02locale

Actual Results:  
Important part of my /etc/locales.build file:

----- QUOTE BEGIN -----
en_US/ISO-8859-1
en_US.UTF-8/UTF-8
pl_PL/ISO-8859-2
pl_PL.UTF-8/UTF-8
----- QUOTE END -----

The output of 'locale -a':

# locale -a
C
en_US
en_US.utf8
pl_PL
pl_PL.utf8
POSIX

Contents of the /etc/env.d/02locale file:

# cat /etc/env.d/02locale
# /etc/env.d/02locale:
# Define default system locale
LANG="pl_PL.utf8"
LC_ALL="pl_PL.utf8"


Expected Results:  
I understand that the "xx_YY.UTF-8" setting worked once, otherwise it wouldn't be
in the Guide. However, 'locale -a' displays something else that the "UTF-8 Guide"
mentions, so I think that it could be enhanced somehow.
I guess that it may still work as it is for compatibility reasons with older
glibc (?). However, newer glibc (and I've tried all: x86, ~x86, amd64 and ~amd64)
no longer use "xx_YY.UTF-8" variant, but "xx_YY.utf8" instead. At least that's
what my experience shows.
So, maybe the Guide could notify the users that ".UTF-8" may not always be
correct, and ".utf8" suffix should be used instead. That would shorten the
troubleshooting time enormously - from months to hours in my case (!).

Please consider testing this issue and fix the "UTF-8 Guide" if the problem is
repeatable. Thanks!
Comment 1 Jakub Moc (RETIRED) gentoo-dev 2005-11-22 02:50:27 UTC
Works fine here:

<snip>
# locale
LANG=cs_CZ.UTF-8
...

#locale -a
C
cs_CZ
cs_CZ.utf8
en_US
en_US.utf8
POSIX

Besides, per yesterday's conversation on #-dev, this turns out to be ncurses
issue, not a glibc one.

CCing truedfx for some comments...
Comment 2 Wiktor Wandachowicz 2005-11-22 05:39:55 UTC
Maybe that's the problem. I'll try to remerge "mc" with "-ncurses +slang"
and see what happens. I'll also do everything with LANG="pl_PL.UTF-8" to see
if it makes any difference.

Comment 3 SpanKY gentoo-dev 2005-11-22 07:08:32 UTC
i wouldnt bother ... the slang in portage is old and broken ... no one
has updated it to the 2.0 version which has fixed UTF8 handling

try upgrading to ncurses 5.5
Comment 4 Harald van Dijk (RETIRED) gentoo-dev 2005-11-22 07:19:22 UTC
> Besides, per yesterday's conversation on #-dev, this turns out to be ncurses
> issue, not a glibc one.

> CCing truedfx for some comments...

Actually, the ncurses issue was the exact opposite: .utf8 didn't work, .UTF-8
did. ncurses 5.4 in the Linux console needs .UTF-8 locales; with .utf8 locales,
it would not realise not to use the terminfo description, and try to print lines
using the wrong character codes, which would lead to screen corruption.

As for this bug, I'm not sure what's wrong, but both .UTF-8 and .utf8 work here
(glibc 2.3.6-r1, ncurses 5.5-r1), so I'll also suggest to check how ncurses 5.5
behaves.
Comment 5 Wiktor Wandachowicz 2005-11-22 13:24:13 UTC
Now I've checked several things and have a better overview.

I created several text files fith different encodings (I used "iconv" to convert
between charsets). I think that I finally got the UTF-8 running on the console,
because of the tests that proved this. All of you were right, I just didn't
believe I got what I wanted.

I just want to ask what do you think about this:
- I set the font and translation in /etc/conf.d/consolefont
- I set the LC_ALL="pl_PL.UTF-8"
- This gives me a good result, because files with UTF-8 characters are displayed
  correctly on the console (cat)
- The less is less optimal, becasue sometimes the output may be garbled, but
  I can control its behaviour through the LESSCHARSET variable

<now the tricky part>
- I start Midnight Commander (compiled with "-ncurses +slang") and suddenly
  the hints right over its "command line" are garbled - they are cut sometimes
  in the middle or give funny visual effects on the background.
  Test files are displayed correctly (F4 - Edit).
- I suspect that this is because of the fact that the translated file
  (hints.mc.pl) uses the ISO-8859-2 encoding
- I convert the original file (using iconv) to the UTF-8 encoding, which fixes
  this problem
- On-line help misses all the localized characters, and displays spaces instead
- I couldn't figure out how to fix that, using iconv didn't help

<and another one>
- I have man pages translated into Polish, so I try "man bash"
- Lots of localized characters are displayed incorrectly
- I played with /etc/man.conf and tried all possible combinations of
  NROFF setting, but this didn't improve the situation
- I suspect that this is also caused by the fact that man pages use ISO-8859-2

Now my questions:
* Is it really necessary to convert all ISO-8859-2 encoded files to UTF-8
  just in order to display them correctly on UTF-8 enabled console?
  (and I'm not asking about X terminal of any kind)
* Should the man pages be converted from ISO-8859-2 to UTF-8 just in order
  to display them correctly on the UTF-8 enabled console?

If the answer to both questions is "yes", then it looks like changing the
locale to *.UTF-8 is not worth the trouble right now. Lots of resources use
the ISO-8859-* standard encodings, and dealing with them on UTF-8 console
is troublesome. Of course, such documents can be converted both ways, but the
conversion still needs to be done (the worst-case scenario: convert from
ISO-8859-* to UTF-8 just to see or edit the file, and convert back afterwards).

What is your opinion on this?
Comment 6 Jakub Moc (RETIRED) gentoo-dev 2005-11-22 13:42:19 UTC
Uhm, from the above, I pretty much see this like mc-specific issue and Polish
manpages issue. I can't see any of the mentioned problems with cs_CZ.UTF-8 (the
manpages are definitely displayed correctly), and mc works as well.

IMHO, you should file new bugs about mc and man-pages-pl, this bug looks INVALID
to me (read - not a documentation issue). 
Comment 7 Wiktor Wandachowicz 2005-11-23 05:16:54 UTC
Ok, that's perfectly reasonable.
I withdraw my request and mark the bug invalid.

I'll do more tests and file new bugs if appropriate, as you suggest.

Thanks for your time!