files below /etc should not have UTF-8 filenames without a good reason 'NetLock_Arany_=Class_Gold=_Főtanúsítvány.pem' -> '../../../usr/share/ca-certificates/mozilla/NetLock_Arany_=Class_Gold=_Főtanúsítvány.crt' is created by certdata2pem.py Ideally certdata2pem.py generates filenames with printable characters only. Reproducible: Always
>files below /etc should not have UTF-8 filenames without a good reason Please cite the source of this rule/policy.
We dicussed it in #gentoo-dev and being unable to type a file in /etc seems reasonable enough as motivation.
So, what characters should be allowed? Any ASCII except NUL and / (which includes control characters)? Or printable ASCII U+0021 to U+007e only (note that /\:*"?<>| may be problematic on some filesystems)? Or only the POSIX Portable Filename Character Set as defined in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282? That may be too limited since it excludes e.g. the plus sign.
Another argument against using UTF-8 for filenames may be that even if you could type these characters on your keyboard, the name may still not match because it may be in a different normalization form (see https://unicode.org/reports/tr15/). For example, the "á" from the example could be either "á" (NFC, U+00e1 LATIN SMALL LETTER A WITH ACUTE) or "á" (NFD, U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT). It happens to be the first one, but there's no direct way to distinguish it. $ touch Főtanúsítvány $ touch Főtanúsítvány $ ls -1 Főtanúsítvány Főtanúsítvány $ ls -1 | hexdump -C 00000000 46 6f cc 8b 74 61 6e 75 cc 81 73 69 cc 81 74 76 |Fo..tanu..si..tv| 00000010 61 cc 81 6e 79 0a 46 c5 91 74 61 6e c3 ba 73 c3 |a..ny.F..tan..s.| 00000020 ad 74 76 c3 a1 6e 79 0a |.tv..ny.| 00000028 There is also the issue of confusables, e.g. A (U+0041 LATIN CAPITAL LETTER A), Α (U+0391 GREEK CAPITAL LETTER ALPHA), and А (U+0410 CYRILLIC CAPITAL LETTER A) which might even have a security impact.
There are known problems with non-UTF-8 filenames (e.g. bug #690480), but UTF-8 ASCII-outside filenames should work well. I am against anglocentric assumption that only 26+26 letters can be used. If you want to type filename, use tab completion, or 'ls' and copy+paste.
Portage could detect situation when 2 different installed filenames are identical after NFD normalization and print warning/error. (Linux handles filenames as bytes, not Unicode characters. If there is no involvement of Linux-foreign filesystems or communication with other operating systems, the only problem for users is visual confusion.) (New check obviously would not make "Főtanúsítvány" anyhow invalid.)
Regarding ca-certificates: Nobody actually types these file names, so that argument makes no sense to me. I could see making that argument for config files that the sysadmin commonly has some need to edit. However, I haven't seen any examples of config files with foreign characters in the filename. Regarding installed files in general: I don't think it is practical to limit the character set to some subset of printable ASCII. That will just lead to conflicts with upstream developers, and we will probably end up patching things downstream. It seems rather pointless to do this for filenames that people rarely look at or type out anyway. In the rare case that somebody actually needs to manipulate files with characters that they can't type, shells offer tab-completion and terminals offer copy/paste functions.
Debian's policy is this: https://www.debian.org/doc/debian-policy/ch-files.html#file-names In a nutshell, they require ASCII-only for binaries in PATH but UTF-8 elsewhere. I'd guess that their motivation is similar, i.e. names are restricted to ASCII if the user must type them.
(In reply to Mike Gilbert from comment #7) > Regarding ca-certificates: > > Nobody actually types these file names, so that argument makes no sense to > me. ca-certicates package fails build with another locale different than utf8 one, see https://bugs.gentoo.org/show_bug.cgi?id=916504 I dug a little more and libxcb bug was reported too upstream. Like to see its a known bug since 2022. Any point making trouble to other users? Like to see you here too, Sam! Jonas+1
(In reply to Enrique Domínguez from comment #9) The solution we usually implement for such problems is to force UTF-8 encoding for installed file names.