Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 435934 - Portage should accept category / package names with non-ASCII characters
Summary: Portage should accept category / package names with non-ASCII characters
Status: RESOLVED WONTFIX
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core - Ebuild Support (show other bugs)
Hardware: All Linux
: Normal enhancement (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-22 20:46 UTC by Michał Górny
Modified: 2021-09-03 08:35 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2012-09-22 20:46:58 UTC
I used to have a few ebuilds with non-ASCII characters in the package nams. Now they no longer work. I feel offended, Sir!

Citing the robustness principle[1]:

  Be conservative in what you send, liberal in what you accept

Thus, I believe portage should not limit accepted names to the letter of the PMS; instead, repoman should warn when committing names not conforming to it.

[1]:http://en.wikipedia.org/wiki/Robustness_principle
Comment 1 Zac Medico gentoo-dev 2012-09-22 20:49:21 UTC
What does the error look like?

We can add a layout.conf setting, which allows you to configure this for the repository.
Comment 2 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2012-09-22 21:07:38 UTC
(In reply to comment #1)
> What does the error look like?

!!! 'aęł' is not a valid package atom.

> We can add a layout.conf setting, which allows you to configure this for the
> repository.

A layout.conf setting would be useful for repoman. Still, you should be liberal in what you accept. We're not Ciaranis to shoot at people for not using the official language.
Comment 3 Zac Medico gentoo-dev 2012-09-22 21:57:21 UTC
Hopefully this fixes all but the repoman file.name check:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=fdd7d8cfcfb3055ba755273b684ef4e02b99c14c
Comment 4 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2012-09-22 22:03:50 UTC
By the way, I hoped this would also revert the non-version-ending enforcement for bug 174536. Portage not doing that would be the first step towards lifting the restriction.
Comment 5 Zac Medico gentoo-dev 2012-09-22 22:10:28 UTC
I did a fixup on the previous commit to include dbapi._category_re:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=0d5b0fbd79ba8b2e7dd5d2f2db7d69cad3e56766

(In reply to comment #4)
> By the way, I hoped this would also revert the non-version-ending
> enforcement for bug 174536. Portage not doing that would be the first step
> towards lifting the restriction.

Please file a separate bug for that, because the two things are only vaguely related.
Comment 6 Zac Medico gentoo-dev 2012-09-23 22:44:36 UTC
Binds filename validation to RepoConfig, so that eventually we'll be able to control it via a layout.conf setting:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=6d8d0c02457c2e94c759fe89db0bef196b78158a
Comment 7 Arfrever Frehtes Taifersar Arahesis 2012-09-23 23:11:47 UTC
Portage should print fatal error when a repository contains 2 different directories (of categories or packages), whose names are equivalent in Unicode.

E.g. b"\xc3\xb3" and b"o\xcc\x81"

>>> import unicodedata
>>> unicodedata.name(b"\xc3\xb3".decode())
'LATIN SMALL LETTER O WITH ACUTE'
>>> unicodedata.name(b"o\xcc\x81".decode()[0])
'LATIN SMALL LETTER O'
>>> unicodedata.name(b"o\xcc\x81".decode()[1])
'COMBINING ACUTE ACCENT'
>>> b"\xc3\xb3" == b"o\xcc\x81"
False
>>> b"\xc3\xb3".decode() == b"o\xcc\x81".decode()
False
>>> unicodedata.normalize("NFD", b"\xc3\xb3".decode()) == unicodedata.normalize("NFD", b"o\xcc\x81".decode())
True


http://en.wikipedia.org/wiki/Unicode_equivalence
http://en.wikipedia.org/wiki/Combining_character
http://en.wikipedia.org/wiki/Precomposed_character
Comment 8 Arfrever Frehtes Taifersar Arahesis 2012-09-23 23:47:19 UTC
When ebuild of package b"app-misc/a" from repository X contains b"DEPEND=app-misc/\xc3\xb3" and b"app-misc/\xc3\xb3" directory exists in repository Y and b"app-misc/o\xcc\x81" directory exists in repository Z, then ebuilds from both  b"app-misc/\xc3\xb3" (from repository Y) and b"app-misc/o\xcc\x81" (from repository Z) directories should be able to satisfy this dependency.

(X can be Y or X can be Z, but Y cannot be Z.)
Comment 9 SpanKY gentoo-dev 2014-01-06 18:56:22 UTC
unless i missed something, this is just a "nice to have" since such characters are forbidden by PMS
Comment 10 Ulrich Müller gentoo-dev 2015-05-05 12:47:30 UTC
(In reply to Arfrever Frehtes Taifersar Arahesis from comment #8)
> When ebuild of package b"app-misc/a" from repository X contains
> b"DEPEND=app-misc/\xc3\xb3" and b"app-misc/\xc3\xb3" directory exists in
> repository Y and b"app-misc/o\xcc\x81" directory exists in repository Z,
> then ebuilds from both  b"app-misc/\xc3\xb3" (from repository Y) and
> b"app-misc/o\xcc\x81" (from repository Z) directories should be able to
> satisfy this dependency.
> 
> (X can be Y or X can be Z, but Y cannot be Z.)

Yeah, that's the kind of problems that would result from allowing arbitrary chars in package names. Should app-misc/A, app-misc/А, and app-misc/Α map to the same package, too (that's latin, cyrillic, and greek A, respectively)? And how about app-misc/abcd, app-misc/dcba, and app-misc/‮abcd‬? (The last one is "abcd" with right-to-left directional override, i.e b"app-misc/\xe2\x80\xaeabcd\xe2\x80\xac".)

Can Portage follow the spec, please? PMS is quite explicit about what characters are allowed in package names. Also GLEP 31 limits filenames to ASCII.


(In reply to Michał Górny from comment #0)
> Citing the robustness principle[1]:
> 
>   Be conservative in what you send, liberal in what you accept

Nope, this might apply to user input, but certainly it doesn't apply to the tree. We aim for interoperability between different package managers, therefore PMs should be rather strict about what they accept as valid ebuilds.

> !!! 'aęł' is not a valid package atom.

Right, it isn't.
Comment 11 SpanKY gentoo-dev 2015-05-31 02:20:01 UTC
(In reply to Ulrich Müller from comment #10)

i don't think PMS is as explicit as you describe.  example:
A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen or a plus sign, and must not end in a hyphen followed by anything matching the version syntax described in section 3.2.

that does not state the package name is limited to that regex.  i.e. it doesn't say "may only contain" or otherwise say that other characters are forbidden.  that might have been the intention, but it isn't what the spec says ;).
Comment 12 Ulrich Müller gentoo-dev 2015-05-31 11:44:42 UTC
(In reply to SpanKY from comment #11)

PMS is written with an informed and well-disposed reader in mind. So sometimes the wording is not absolutely watertight, in order to keep the spec readable.

> A package name may contain any of the characters [A-Za-z0-9+_-].

There cannot be any reasonable doubt about the intended meaning of this.
Comment 13 Arfrever Frehtes Taifersar Arahesis 2015-10-24 20:39:24 UTC
*** Bug 563984 has been marked as a duplicate of this bug. ***
Comment 14 Ulrich Müller gentoo-dev 2021-09-03 08:35:24 UTC
Closing, as discussed in #gentoo-dev.