Summary: | sys-libs/glibc: SHIFT_JIS has a different encoding behavior when generated with JIS_X0201 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Jey <jehan> |
Component: | [OLD] Core system | Assignee: | Gentoo Toolchain Maintainers <toolchain> |
Status: | RESOLVED INVALID | ||
Severity: | normal | ||
Priority: | High | ||
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Jey
2008-08-23 18:54:47 UTC
I have just made a small test. If I do the same, but invert lines in locale.gen: ja_JP JIS_X0201 ja_JP SHIFT_JIS Then I have of course the same errors on screen (but only JIS_X00201 fails "apparently" at the end) but now both encoding will be showed as SHIFT_JIS: # LC_ALL=ja_JP.shiftjis locale charmap SHIFT_JIS # LC_ALL=ja_JP.jisx0201 locale charmap SHIFT_JIS As though the last one is the "winner". I have also a ja_JP.UTF-8, but it is still showing as UTF-8 charmap, and I have several fr_FR with different encoding, without such issue of "encoding stealing". It does this only for these 2 lines. I guess this may be related to the errors displayed? Or maybe that JIS_X0201 is "used" in ISO_2022, on which is apparently based SHIFT-JIS (as far as I understood). Anyway there is an error somewhere. You definitely know what is going on here better than I do. Could you please file a bug upstream as this behaviour seems to have existed for a very long time? http://sources.redhat.com/bugzilla/ i'll take care of triaging/moving upstream. re-opening until that happens. the warning from trying to generate ja_JP.SHIFT_JIS should be there. if you look at the character map, it is slightly modified from standard ASCII: byte 0x5C should be \ but it's ¥ instead byte 0x7E should be ~ but it's ‾ instead ISO C requires characters 0x00 through 0x7C have the same values as ASCII. this one does not, hence you get a warning. https://en.wikipedia.org/wiki/Shift_JIS#Shift_JIS_byte_map for the 2nd part, your config file is invalid. the first col needs to be unique because that's the value used when setting locale variables. so when you do: ja_JP SHIFT_JIS this allows you to do LANG=ja_JP and it'll be the same as ja_JP.SHIFT_JS. but when you then do: ja_JP SHIFT_JIS ja_JP JIS_X0201 the 2nd entry clobbers the first one. you instead want to do: ja_JP.SHIFT_JIS SHIFT_JIS ja_JP.JIS_X0201 JIS_X0201 now you can do LANG=ja_JP.SHIFT_JIS and LANG=ja_JP.JIS_X0201. i think the default for LANG=ja_JP should be: ja_JP EUC-JP although you're free to set it however you like on your system. the locale.gen config file is misleading here in its comments so i cleaned that up: http://sources.gentoo.org/gentoo/src/patchsets/glibc/extra/locale/locale.gen?r1=1.1&r2=1.2 http://sources.gentoo.org/gentoo/src/patchsets/glibc/extra/locale/locale.gen.5?r1=1.3&r2=1.4 also the locale-gen tool should catch & warn about this, so i fixed that too: http://sources.gentoo.org/gentoo/src/patchsets/glibc/extra/locale/locale-gen?r1=1.37&r2=1.38 for the last part, all the spew when trying to generate ja_JP.JIS_X0201 is correct. lets break it down one at a time. the first warning: /usr/share/i18n/locales/ja_JP:14877: LC_MESSAGES: unknown character in field `yesexpr' if we look at yesexpr in that file, it has: yesexpr "<U005E><U0028><U005B><U0079><U0059><UFF59><UFF39><U005D>/ <U007C><U306F><U3044><U007C><U30CF><U30A4><U0029>" if we look at all the characters defined in /usr/share/i18n/charmaps/JIS_X0201, we see that it does not define these two that are used in the yesexpr: は <U306F> /xe3/x81/xaf HIRAGANA LETTER HA い <U3044> /xe3/x81/x84 HIRAGANA LETTER I and if we consult the encoding for JIS_X_0201, we see that while it provides the katakana alphabet, it does not provide any hiragana characters: https://en.wikipedia.org/wiki/JIS_X_0201 so localedef complains that it is not possible to create a "yesexpr" because it wants to include hiragana, but the encoding only supports katakana. while the warning is confusing, it's more or less WAI. i think i covered everything, albeit not exactly timely ;). |