Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 868159

Summary: sys-libs/glibc: locale-gen: revisit special treatment of C.UTF-8
Product: Gentoo Linux Reporter: Sam James <sam>
Component: Current packagesAssignee: Gentoo Toolchain Maintainers <toolchain>
Status: UNCONFIRMED ---    
Severity: normal CC: gentoo, kfm
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---

Description Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-03 04:59:46 UTC
C.UTF-8 is in glibc as of 2.35 so we didn't need to patch it in anymore (which plenty of other distros did too).

From #gentoo:
[07:06:45]  <ormaaaj> sam_: I noticed eselect locale gets its list by parsing the `locale -a` output, and locale gets them from the locale-archive. Since glibc versions started including the C.UTF-8 locale, it outputs "C.utf8". I logged into a debian system from before that was added and theirs uniquely didn't accept C.utf8. I didn't dig into how locale-gen gets the
[07:06:45]  <ormaaaj> names - I assume either the locale defs or an internal enum.
[07:07:38]  <ormaaaj> So the way we're generating these seems to not even be universally compatible with all glibc versions. `locale -m` output is inconsistent with `locale -a`, but looks more correct and in line with e.g. https://encoding.spec.whatwg.org/#names-and-labels. it might be better to map the locale -a names to those to construct the string.
[07:19:10]  <tirnanog> arch is like that too. it doesn't tolerate C.utf8, does tolerate C.UTF-8, yet for everything else - e.g. en_US.utf8 vs en_US.UTF-8 - it doesn't matter at all (which is traditional glibc behaviour). that gentoo tolerates C.utf8 is, at least, consistent. I don't understand why these differences exist.
[07:23:07]  <ormaaaj> I think it's due to the way distros were "patching" in their own definition before it went upstream relatively recently. IIRC gentoo was one of them.
[07:25:49]  <tirnanog> yeah. still, C.UTF-8 is officially supported as of glibc 2.35, I think. so why would arch and gentoo, taking those two as an example, be different now? the arch behaviour seems off to me. ".utf8" has always worked; it's odd for C to be treated any differently from the others.
[07:26:35]  <tirnanog> I didn't look into it yet so I don't have any answer.
[07:34:05]  <tirnanog> one visible artifact of that distinction is that, in the affected distros, locale -a appears to show "C.UTF-8" while showing the ".utf8" suffix for other locales, whereas gentoo shows only the ".utf8" suffix and always accepts it both ways.
[07:51:29]  <ormaaaj> Looks like it just greps them out of the locale-archive file. Just a sloppy implementation.
[07:58:32]  <ormaaaj> http://dpaste.com/HFWS4HXEG
[08:06:43]  <tirnanog> hmm. it gets weirder. gentoo has a novel locale-gen which always includes "C.UTF-8 UTF-8" in the course of generating an archive. arch doesn't. basically: if you put "C.UTF-8 UTF-8" in locale.gen, either of C.UTF-8 or C.utf8 are accepted as valid locale names. if you don't, only C.UTF-8 is (in glibc 2.35+, that is, whether it be visible in the locale archive or not).
[08:06:59]  <ormaaaj> $() part is the intended paste. broken alias.
[08:08:08]  <ormaaaj> hm
[08:08:44]  <tirnanog> that explains why C.utf8 works in gentoo then. if I add "C.UTF-8 UTF-8" to locale.gen in arch and run locale-gen, C.utf8 suddenly starts working there too, in addition to C.UTF-8. oh, and you get _both_ representations showing up in locale -a thereafter. seems rather messy.
[08:09:46]  <tirnanog> in short, glibc 2.35 and onwards will always support C.UTF-8 but not C.utf8 unless the locale was explicitly generated and incorporated into the archive.
[08:15:04]  <tirnanog> I think gentoo used to patch support for C.UTF-8 in prior to 2.35. I suppose shoehorning it in via the locale-gen script is now an anachronism.
[08:15:16]  <tirnanog> still, it all seems pretty messy on the glibc side.
[08:15:54]  <tirnanog> ultimately, a solid case for not always writing it out properly as "UTF-8".
[08:16:02]  <tirnanog> er, for always, I mean.
Comment 1 Andreas K. Hüttel archtester gentoo-dev 2022-09-09 20:56:57 UTC
There's two things here.

(In reply to Sam James from comment #0)
> C.UTF-8 is in glibc as of 2.35 so we didn't need to patch it in anymore
> (which plenty of other distros did too).

And we dropped our patch.

> [08:06:43]  <tirnanog> hmm. it gets weirder. gentoo has a novel locale-gen
> which always includes "C.UTF-8 UTF-8" in the course of generating an
> archive. arch doesn't. basically: if you put "C.UTF-8 UTF-8" in locale.gen,
> either of C.UTF-8 or C.utf8 are accepted as valid locale names. if you
> don't, only C.UTF-8 is (in glibc 2.35+, that is, whether it be visible in
> the locale archive or not).
[...]
> [08:15:04]  <tirnanog> I think gentoo used to patch support for C.UTF-8 in
> prior to 2.35. I suppose shoehorning it in via the locale-gen script is now
> an anachronism.
> [08:15:16]  <tirnanog> still, it all seems pretty messy on the glibc side.

We still need this because too many things break if *no* UTF-8 locale is available. Think python.
[And someone would remove it for sure. "Mah don't need no stinkin unicode."]
Comment 2 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-09-09 22:56:57 UTC
The bug was for the utf8 vs UTF-8 issue.
Comment 3 Andreas K. Hüttel archtester gentoo-dev 2023-05-08 21:53:58 UTC
(In reply to Sam James from comment #2)
> The bug was for the utf8 vs UTF-8 issue.

Then I dont understand what the problem is; for some reason we are just more permissive?