18375 – Enabling UTF-8 support?

Bug 18375 - Enabling UTF-8 support?

Summary: Enabling UTF-8 support?

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High normal (vote)
Assignee:	utf8 herd (RETIRED)

URL:
Whiteboard:
Keywords:

Depends on:	15880
Blocks:	24267
	Show dependency tree

Reported:	2003-03-28 10:56 UTC by Danny Milosavljevic
Modified:	2004-10-25 13:07 UTC (History)
CC List:	23 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Danny Milosavljevic 2003-03-28 10:56:13 UTC

Hi!

Is it me or are most of the packages compiled without UTF-8 support (even glibc 2.2 *and* 2.3's locales for that are not installed! O_o)... If there already is a USE flag for that, I apologize.

Otherwhise I'd suggest adding "utf8" to the list of valid USE options... 
(intended to be set before bootstrap for it to take global effect?)

then add to ncurses ebuilds:
  IUSE="utf8"
 use utf8 && myconf="${myconf} --enable-widec"

and to dialog ebuilds:
  IUSE="utf8"
        w=""
        use utf8 && w="w"
        econf --with-ncurses"${w}" || die     <--- instead of just "--with-ncurses"
  
btw Dialog's dialogs in utf8 still look like poo, but at least the windows are properly aligned now (actually rectangular :->)... maybe font issue.

The /etc/rc.conf's KEYMAP has already been changed to support unicode (KEYMAP="-u <whatevermap>").

As most of the newer packages already incorperate optional utf8 support, all that is left is turning it on (if optional at all, and not already required).

Although ncurses for one does the weird thing of renaming it's library when utf8 is turned on O_o ow well, I guess it cant be helped. (I linked libncursesw.so to libncurses.so and now it works even with apps not recompiled after that - why shouldn't it, all programs have to check the locale (f. e. LANG variable) on STARTUP to decide if to use utf8 or not, not at compile time [see posix standard] ;))

I "fixed" glibc's utf8 locales for the time being using:
localedef -v -c -i de_AT -f UTF-8 de_AT.UTF-8
that did not work(?), even with cwd /usr/share/locale

but then with
localedef -v -c -i de_AT -f UTF-8 /usr/share/locale/de_AT.UTF-8
which did partially work (Xfree/Gtk/Qt/all stopped complaining about unknown locale, and umlauts are now two bytes in length as they should be - checked using jstar in a utf8 xterm, which isn't utf8 aware ;))
I dont understand the difference between these two commands in terms of executed stuff, but there seems to be an inconsistency, undocumented change or bug somewhere... (or I just overlooked something ;))

Bash has an outstanding "bug" of not respecting the LANG="de_AT.UTF-8" set within a shell (where to set it earlier?) for its own readline "lib", until spawning of a child shell. In the next subshell(s) it works like a breeze. Maybe there is a "reread config" builtin command though ? *lost*

Also, for setting the LANG environment variable, is there a config file?
If not, I'd suggest putting:
  /etc/profile.d/lang: 
     source /etc/conf.d/lang
and
  /etc/conf.d/lang:
     LANG="whatever"

Also would be possible user-based, but that seems overkill to me.

Thats all I can think of for now :->

laters :)

Comment 1 Danny Milosavljevic 2003-04-06 05:24:34 UTC

also for the program "screen" to work with color, create file /etc/profile.d/linux-utf8:
[ "${TERM}" = "linux" ] && export TERM="linux-utf8"

This of course assumes that utf8 is used, which isn't the case when the proposed "USE" flag is not set. Dunno how to check that in bash. Maybe checking in the corresponding ebuild and just creating the file linux-utf8 when the use flag is set is enough.


Hey, is anyone actually reading this ? :D

Comment 2 Danny Milosavljevic 2003-04-06 05:49:53 UTC

also for curses related programs to work with the last comment,

cp /usr/share/terminfo/l/linux /usr/share/terminfo/l/linux-utf8

phew...

Comment 3 Danny Milosavljevic 2003-04-06 06:04:10 UTC

still looking into why the border characters (of the program "dialog", for example) look fine in a utf8 xterm, but wrong in a utf8 console... weird...
Can't seem to find any font which does it right(tm).

Comment 4 Danny Milosavljevic 2003-04-17 12:39:26 UTC

To make screen working correctly in an utf8 xterm I had to add the following to /etc/termcap and remove the original "v0|term" entry:

v0|xterm|xterm in Unicode (UTF-8) mode:\
       :am:km:mi:ms:xn:\
        :co#80:it#8:li#24:\
        :AL=\E[%dL:DC=\E[%dP:DL=\E[%dM:DO=\E[%dB:IC=\E[%d@:\
        :K1=\EOw:K2=\EOu:K3=\EOy:K4=\EOq:K5=\EOs:LE=\E[%dD:\
        :RI=\E[%dC:UP=\E[%dA:ae=^O:al=\E[L:as=^N:bl=^G:bt=\E[Z:\
        :cd=\E[J:ce=\E[K:cl=\E[H\E[2J:cm=\E[%i%d;%dH:cr=^M:\
        :cs=\E[%i%d;%dr:ct=\E[3g:dc=\E[P:dl=\E[M:do=^J:ec=\E[%dX:\
        :ei=\E[4l:ho=\E[H:ic=\E[@:im=\E[4h:\
        :is=\E7\E[r\E[m\E[?7h\E[?1;3;4;6l\E[4l\E8\E>:\
        :k0=\E[21~:k1=\E[11~:k2=\E[12~:k3=\E[13~:k4=\E[14~:\
        :k5=\E[15~:k6=\E[17~:k7=\E[18~:k8=\E[19~:k9=\E[20~:\
        :kD=\E[3~:kI=\E[2~:kN=\E[6~:kP=\E[5~:kb=\177:kd=\EOB:\
        :ke=\E[?1l\E>:kh=\EOH:kl=\EOD:kr=\EOC:ks=\E[?1h\E=:\  
        :ku=\EOA:le=^H:md=\E[1m:me=\E[m\017:mr=\E[7m:nd=\E[C:\
        :rc=\E8:sc=\E7:se=\E[27m:sf=^J:so=\E[7m:sr=\EM:st=\EH:ta=^I:\
        :te=\E[2J\E[?47l\E8:ti=\E7\E[?47h:ue=\E[24m:up=\E[A:\
        :us=\E[4m:vb=\E[?5h\E[?5l:ve=\E[?25h:vi=\E[?25l:\
        :vs=\E[?25h:

Whereas the space at the beginning of every line here is "<space><tab>"

Now I can do:
xterm -u8 -e screen
and that's good :D

Comment 5 Markus Bertheau (RETIRED) gentoo-dev

2003-04-17 15:21:03 UTC

Danny: doesn't screen -U do what you want?

Comment 6 Danny Milosavljevic 2003-04-25 12:36:49 UTC

no. its the same as without -U.
What does partially do what I want is 
export TERM="xterm-utf8"
then screen
but many MANY programs cant cope with that setting (did I mention MANY already? ;))

However, with that termcap modification I didn't have a single problem until now (my "xterm" launcher does now include "screen" per default, pretty cool ;))

Comment 7 Danny Milosavljevic 2003-06-01 07:06:21 UTC

to fix the de-latin1 keymap, comment out the line

alt keycode 13 = Meta_acute

in /usr/share/keymaps/i386/qwertz/de-latin1.map so that it looks like so:

#alt keycode 13 = Meta_acute

Comment 8 Danny Milosavljevic 2003-06-01 07:14:19 UTC

What happened? Nobody interested in UTF-8 support?

If you are interested, tell me how I can help better :)

I've already helped three people set up their gentoo with utf-8 support, but after the third time it get kinda repeative and boring ;) what about fixing the official gentoo huh ? :) (ok, maybe moving into stable *after* 1.4, but what about experimental ebuilds for glibc, baselayout, ncurses and dialog which fix utf-8 ?)

Thanks...

Comment 9 Thomas Scheffler 2003-07-11 12:19:27 UTC

Hi,

I'm interested in utf-8 support, too. I really cannot imagine why gentoo lacks support for it as every major distribution has implemented it. Is this a feature?
Does somebody at gentoo already cares about it? I mean this bug is quite "old" now.

Comment 10 Alastair Tse (RETIRED) gentoo-dev

2003-07-11 13:08:07 UTC

i'm interested in getting utf-8 support on gentoo. i was wondering how far ska-fan has got with this. if he's not interested, i wouldn't mind this being assigned to me so i can work on this a bit more.

Comment 11 Markus Bertheau (RETIRED) gentoo-dev

2003-07-11 13:37:37 UTC

Feel free.

Comment 12 Thomas Scheffler 2003-07-12 14:15:47 UTC

How long do you think would it take to implement this and provide masked ebuilds I'm willing to test it when it comes to that point. I find it great that somebody seems to care. Post a message when you're ready.

Comment 13 Danny Milosavljevic 2003-07-18 09:42:19 UTC

yay yay :)  Hi, liquidx

Ok, since the above stuff is rather unordered, I will make a summary of the 
valid points to ease the task:
1) The ncurses ebuild needs --enable-widec, and a link from libncursesw.so to libncurses.so
2) The /etc/rc.conf's KEYMAP has already been changed to support unicode (KEYMAP="-u
<whatevermap>").
3) to create an UTF-8 locale: localedef -v -c -i de_AT -f UTF-8 de_AT.UTF-8 (DOES work now)
 (of course replace de_AT by your favourite languages ^^)
4) set LANG to "??_??.UTF-8"; strictly speaking, there should be a way to set LANG before bash is even started, i.e. between login and starting bash, dunno how. bash internal readline is rather spooky with this.
5) Modified termcap entry for xterm (Comment #4), this fixed loads of things
6) to fix the de-latin1 keymap, comment out the line "alt keycode 13 = Meta_acute" in "/usr/share/keymaps/i386/qwertz/de-latin1.map" (Comment #7)
7) dialog ebuild needs --with-ncursesw, but is evil enough in console even with it. (not a single problem in xterms tho)

Now invalid points:
- Forget that cp terminfo crap I wrote (Comment #2)
- localedef miraculously does not need full path anymore now ;)
- forget that export TERM="linux-utf8" (Comment #1), I was just stupid

TODO:
- Console UTF-8 border characters (Comment #3)... console is evil...

I hope this helps :)

Comment 14 Zhen Lin 2003-07-19 09:19:45 UTC

Unicode in the 0xb8000/0xb0000 console is not needed. The console only supports 256 glyphs!

Plan 9 (plan9.bell-labs.com) takes a different approach - EVERYTHING supports UTF-8 and only UTF-8 (and therefore, 7-bit ASCII). They don't have a text console,  only a graphical (pixel, not character) terminal, in order to render Unicode glyphs.

Does the framebuffer support Unicode glyphs? If not, maybe UTF-8 in console is not worth it.

A reason why UTF-8 dialog breaks so badly may be because the box drawing characters are represented differently in UTF-8, compared to 8-bit OEM/DOS

Comment 15 Markus Bertheau (RETIRED) gentoo-dev

2003-07-19 09:34:23 UTC

The vga text console on x86 supports at least 512 glyphs.

Comment 16 Carlos Henrique Bauer 2003-07-29 08:06:17 UTC

I'm trying to make UTF-8 work on my machine, too. Here is what I have
worked out so far.
 
1) According to some of the above comments, the correct locale data
directory path is /usr/share/locale/*, but when I run "strace program" I
can see the program is looking for the locale files in
/usr/lib/locale/* (glibc-2.3.2-1).
 
For example, if I run "strace vi" I can see the following in the output:

open("/usr/lib/locale/pt_BR.UTF-8/LC_IDENTIFICATION", O_RDONLY)
 
As a matter of fact, after I changed the output of localedef to
/usr/lib/locale, I no longer got "Locale not supported by C library"
messages from programs.

In RedHat 7.2 and 9.0, the UTF-8 data locale directories are stored in /usr/lib/locale.
 
So, what is the correct path for UTF-8 locales?
 
2) I configured the console font as LatArCyrHeb-16. I tested the
border characters with alsamixer and they seem to be fine.
 
The problem: the font is being set by /etc/init.d/consolefont just for
the 1st virtual terminal. In the other ones, all the colored
characteres are incorrectly displayed.

Comment 17 Danny Milosavljevic 2003-07-31 00:29:30 UTC

1) in current glibc versions, localedef supports creating locales without needing the full path (just the name is fine), they also changed the real path for locales to /usr/lib/locale.

2) ow, I had a problem of not setting every console to utf-8... as a (hacky whacky) workaround I used:
unicode_start >>/etc/issue

then it worked ;)

But the font was always correctly set here... do you mind attaching your /etc/init.d/consolefont here so I can take a look what it is doing ?  :)

Comment 18 Neil Watson 2003-08-18 08:46:07 UTC

I definitely agree that UTF-8 support should be added to Gentoo.  Having it 
default would be the best.  I've managed to get xterm (using uxterm) to view 
UTF-8 but, things like console and even glibc seem to be compiled without 
support.

Comment 19 Alexander Winston 2003-08-29 22:40:26 UTC

UTF-8 support is certainly one area that is lacking in Gentoo that should not
be. Recently, due to some inexplicable hardware problems with my Gentoo install,
I decided to survey the competition, namely Red Hat Linux 9, aka Shrike. (For
some reason this seems to happen about once a week

Comment 20 Alexander Winston 2003-08-29 22:40:26 UTC

UTF-8 support is certainly one area that is lacking in Gentoo that should not
be. Recently, due to some inexplicable hardware problems with my Gentoo install,
I decided to survey the competition, namely Red Hat Linux 9, aka Shrike. (For
some reason this seems to happen about once a weekI suppose Im really good at
breaking operating systems.) The UTF-8 support is absolutely impeccableIve yet
to find a flaw yet. Consoles, terminals, libraries, applications They all have
seem to have perfect UTF-8 support. How Red Hat manages to pull off this trick I
may never know, but I sincerely hope it can be mirrored in Gentoo Linux.

Comment 21 Alexander Winston 2003-08-29 22:43:50 UTC

DISCLAIMER: Red Hat is not the

Comment 22 Alexander Winston 2003-08-29 22:43:50 UTC

DISCLAIMER: Red Hat is not the competition. Im just using the term so I can
weave an interesting yarn for you folks. :)

Comment 23 Zhen Lin 2003-09-01 05:58:29 UTC

How RedHat pulls of this trick? Massive patches. If only there was an easy way to systematically pull patches out of a SRPM...

In any case, look at bug #27700 for some ebuilds. slang is the first one to utilise the RH9 patches

Comment 24 Patrick Kursawe (RETIRED) gentoo-dev

2003-09-11 23:41:28 UTC

Danny: I don't find an entry starting with "v0" in my /etc/termcap. On my system, this file is provided by libtermcap-compat-1.2.3

Comment 25 Danny Milosavljevic 2003-09-25 02:41:56 UTC

Patrick: 
Hmm, you are right... massive overhaul in termcap, I'll try if the new termcap
fixes utf-8 for good when I get home today...

Comment 26 Thomas Raschbacher gentoo-dev

2003-10-04 07:52:48 UTC

hey dannym... any news? :)

Comment 27 Thomas Scheffler 2003-10-28 04:07:28 UTC

There is one console input issue (32111) with utf-8 so I add dependency to
that bug here.

Comment 28 Thomas Raschbacher gentoo-dev

2003-12-08 23:13:12 UTC

is anyone actually working on this?

regards

Comment 29 Rui Malheiro 2003-12-28 06:24:45 UTC

I'm also working on getting a full UTF-8 gentoo box. My main complain is glibc should include as many *.UTF-8 locales as possible. 

Even if precompiling all possible locales is not the solution, during compile we could test for USE=utf8 and $LANG in order to build the localedef as an ending step of emerge glibc. Or mabe a use make.conf variable LOCALEDEF.

Comment 30 Danny Milosavljevic 2004-02-12 05:41:53 UTC

Status being, with these steps, a fully working utf-8 based system can be acheived. Now someone modify the ebuilds in portage already :->

Still needed is a ugly bash workaround to make bash's internal readline recognize utf-8 (in .bash_login: (call a new subshell) bash; thats it)

Also the de keymap has some weirdness with Meta_acute... I don't really know what it is about, I saw some comment about "the key below the 4", if thats it, its "E", and AltGr + E = Euro-Sign. This should not be present in de, but only in de@euro, though. Speculating here.
Can someone shed a light on this ?

As for to-be-used locale configuration, are there plans to add a variable to make.conf for limiting to-be-installed locales ?
I have dozens of locales I'll never use, however, the locales I want to use (de_AT.UTF-8) are missing.

comments?

Comment 31 Alexander Jenisch 2004-02-18 08:19:09 UTC

i've tried localedef -v -c -i de_AT -f UTF-8 de_AT.UTF-8. shouldn't this make a de_AT.UTF-8 directory in /usr/share/locale?

Comment 32 Sergey Kuleshov (RETIRED) gentoo-dev

2004-02-22 10:29:00 UTC

I am also interested in UTF8. So far I have patched slang and mc to get them working with UTF. But I faced following troubles:

1) In MC I can't enter multibyte characters (those not from acii set)
2) In some applications compiled again ncursesw some (actually most) of the gettext translated string are not diplayed at all - try centericq with LC_ALL=ru_RU.UTF-8
3) Border chars stil dont's work :(

Comment 33 Danny Milosavljevic 2004-03-04 07:48:16 UTC

Alexander:

no, newer stuff goes into a database file in /usr/lib/locale

You should be able to tell if it works by doing
LANG="de_AT.UTF-8" anygtk2program

if it complains, it doesn't :)

Comment 34 Danny Milosavljevic 2004-03-07 02:37:02 UTC

LANG="blahblah" locale charset
is a good way to see if it works

Comment 35 Danny Milosavljevic 2004-03-07 02:40:37 UTC

Svyatogor:

hmm... comfirming mc problem... maybe the author knows details ?

as for borderchars, there are two ways to get them working in a linux console:
1) use framebuffer or
2) use a special font, sacrify the bold attribute and have 256 normal chars and 256 line drawing chars...

Comment 36 Lars Weiler (RETIRED) gentoo-dev

2004-03-17 18:05:44 UTC

I'm also testing utf-8 on my machine.  The problem I currently run in is described in bug 20006 (mutt with ncursesw).  mutt compiles fine with it, but vim not.  So I decided for install both versions of ncurses parallel on my system.  I don't know, if it is possible with the ebuild, as it has to be compiled twice (once with --enable-widec and once without).  I read that this should work on LFS, maybe it's working for us, too?

Comment 37 Heinrich Wendel (RETIRED) gentoo-dev

2004-04-11 05:51:00 UTC

ncurses and slang change the names of their library when compiled with utf-8 support, should we

a.) symlink them to the old names
b.) install the non-utf8 version as well (like redhat)
c.) don't do anything

Comment 38 Zhen Lin 2004-04-11 05:53:49 UTC

Option (b) is good for backwards compatibility, but it means that apps will compile with non UTF-8 ncurses by default.

Option (a) is not very good... I prefer to use a ldscript. See bug #27700

Comment 39 Heinrich Wendel (RETIRED) gentoo-dev

2004-04-11 06:51:11 UTC

what are the advantages of a ld-script over a symlink

Comment 40 Zhen Lin 2004-04-11 08:04:37 UTC

An ldscript will cause resulting binaries to be linked to the *w name. It will also cause binaries linked to the non-wide character version to break. The good thing about this is, if the wide character version is source-compatible but binary -incompatible (like slang), it will force recompilation, instead of causing strange bugs. ldscripts are already used by ncurses -- see bug #4411.

But that's just my view.

Comment 41 Heinrich Wendel (RETIRED) gentoo-dev

2004-04-24 08:12:57 UTC

ok, we really need this settled, can anybody confirm what Zhen said?

Comment 42 Heinrich Wendel (RETIRED) gentoo-dev

2004-08-19 09:41:50 UTC

i commited an utf-8 enabled ncurses,slang and dialog ebuild. you have to set unicode in your useflags and emerge it, please test them.

Comment 43 Heinrich Wendel (RETIRED) gentoo-dev

2004-08-19 09:43:16 UTC

i couldn't patch mc yet, the utf-8 patch seems to be incompatible with the latest security fixes

Comment 44 Danny Milosavljevic 2004-09-05 11:00:42 UTC

bash 2 has a problem with initial locale setting which I only circumvented earlier.

The correct solution is this patch:
http://lists.debian.or.jp/debian-devel/200210/msg00047.html

symptoms are that in a newly started login shell bash, backspace works wrong, and if one starts another sub-bash *without* changing anything in that one, it suddenly worked.

bash 3 has fixed that already.
bash 2 ebuilds should incorperate this patch.

Comment 45 Heinrich Wendel (RETIRED) gentoo-dev

2004-09-16 06:45:42 UTC

i opened a new bug for the bash problem and fixed mc meanwhile. i will close this meta bug then. if you have problems with utf8 open a new bug and assigne it to utf8@gentoo.org

ajs
alexander.winston
dberkholz
denys.duchier
erwin
gentoo-bugs2
gentoo.forums
gentoo
greg_g
jrmalaq
liquidx
lordvan
lowzl
m.debruijne
mcamen
phosphan
pylon
radek
sascha-gentoo-bugzilla
seemant
ska-fan
svyatogor
thomas_scheffler