https://devmanual.gentoo.org/ebuild-writing/misc-files/metadata/index.html has the following rules: lang <description>, <longdescription>, <use>, <doc> In every case where a description is required, there must be at least an english description. If an additional description in another language is given, this attribute is used to specify the language used. The format is the two-character language code as defined by the ISO-639-1 norm. In fact, two-letter language code is limited. 4 digits Script section(iso15924: http://www.unicode.org/iso15924/iso15924-en.html) is also required to express a real language. Take Chinese language for example. "zh" should be either "zh_Hant" or "zh_Hans". They look different: This drumstick is delicious! <- en 这个鸡腿太好吃了! <- zh_Hans 這個鷄腿太好喫了! <- zh_Hant The last two lines are both "zh" languages of exact the same meaning and are used widely. They look different in words.
Sorry there is a typo: a script code has 4 characters (not digits). Some langugaes need not a script code postfix.
repoman already allows any value of lang attribute. E.g. lang="Wu language: Shanghainese dialect" So just DevManual should be more liberal.
The devmanual isn't normative here, but follows what is specified in the relevant GLEPs. For the lang attribute, this is GLEP 56, which in turn references the Developer Handbook. At the time of approval of GLEP 56, it said: "The format is the two-character language code as defined by the ISO-639-1 norm." <https://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo/xml/htdocs/proj/en/devrel/handbook/hb-guide-metadata.xml?hideattic=0&revision=1.11&view=markup#l132> It just so happens that a revised GLEP 68 for metadata is in the making and will presumably be approved in the next council meeting. Its current draft also says that the lang attribute contains an ISO 639-1 language code. As the purpose is to specify a *language*, not a dialect or an orthography variant, I'm not convinced that we should allow anything beyond the 2 (or possibly 3) letter code of ISO-639-*. Especially, I'm not aware of any of our tools implementing anything beyond that (e.g., being able of parsing RFC 5646 style "de-CH-1901 German as used in Switzerland using the 1901 variant orthography" or "zh-cmn-Hans-CN: Chinese, Mandarin, Simplified script, as used in China"). (In reply to Arfrever Frehtes Taifersar Arahesis from comment #2) > repoman already allows any value of lang attribute. > E.g. lang="Wu language: Shanghainese dialect" > > So just DevManual should be more liberal. Nope, I don't think we would go for an arbitrary text there. One could consider RFC 5646, but then again, it would delay GLEP 68 because there is no reference implementation (which is required).
I don't really mind any particular format here. I referenced ISO-639-1 because that's what ulm suggested as least likely to break existing tools, and correct for all current uses. As far as I'm concerned, we could go for any machine-parsable format as long as it's compatible with ISO-639-1 (i.e. existing uses).
Any reason to keep this open? GLEP 68 already sets ISO-639-1 as part of the specification.
(In reply to Göktürk Yüksek from comment #5) > Any reason to keep this open? GLEP 68 already sets ISO-639-1 as part of the > specification. Let's look at this file: https://github.com/gentoo/gentoo/blob/master/app-accessibility/metadata.xml It contains various translations like: <longdescription lang="de"> Die Kategorie app-accessibility enthält Programme für barrierefreies Arbeiten (Accessibility), wie beispielsweise Screenreader. </longdescription> <longdescription lang="nl"> De app-accessibility categorie bevat applicaties die de toegankelijkheid bevorderen, bijvoorbeeld een schermlezer. </longdescription> <longdescription lang="ja"> app-accessibilityカテゴリィにはアクセシビリティと 手伝うパッケージが含まれます。 </longdescription> Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant) translations find no place here. It's impossible to use something like <longdescription lang="zh"> to stands for either simplified Chinese (zh_Hans) or tradition Chinese (zh_Hant). In fact "zh" is not a real written language name. We need a better rule.
(In reply to yuchen.xie from comment #6) > Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant) > translations find no place here. > > It's impossible to use something like > > <longdescription lang="zh"> > > to stands for either simplified Chinese (zh_Hans) or tradition Chinese > (zh_Hant). In fact "zh" is not a real written language name. > > We need a better rule. It is easy to ask for this, but more difficult to write a proper specification. Presumably (and since we don't want to reinvent the wheel) it should be based on BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) or a subset of it. Things that we might want to cover: - Package and category metadata (GLEP 68) - News items (GLEP 42) - Wiki translations The wiki seems to use "zh-cn" as language code. Not sure what standard that is.
Is there a real use case for this? That is: 1. Are you going to actively translate descriptions into both variants? 2. Do we have any software that will actually support choosing between the two variants? Can this be done reasonably, i.e. without having to maintain a huge locale -> langcode mapping table in the PM? 3. Do you suspect users will actually need the two variants?
(In reply to Michał Górny from comment #8) > 1. Are you going to actively translate descriptions into both variants? Translation into simplified Chinese script can be provided by Gentoo users from e.g. mainland China, while translation into traditional Chinese script can be provided by Gentoo users from e.g. Taiwan. > 3. Do you suspect users will actually need the two variants? Different variants are for different sets of users. Differences between graphemes are often significant: https://en.wikipedia.org/wiki/Simplified_Chinese_characters#Method_of_simplification https://en.wikipedia.org/wiki/Simplified_Chinese_characters#Education : "In general, schools in Mainland China, Malaysia and Singapore use simplified characters exclusively, while schools in Hong Kong, Macau, and Taiwan use traditional characters exclusively."
(In reply to Ulrich Müller from comment #7) Second your idea to use BCP47 or its subset. It's widely adopted in modern web browsers. > (In reply to yuchen.xie from comment #6) > > Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant) > > translations find no place here. > > > > It's impossible to use something like > > > > <longdescription lang="zh"> > > > > to stands for either simplified Chinese (zh_Hans) or tradition Chinese > > (zh_Hant). In fact "zh" is not a real written language name. > > > > We need a better rule. > > It is easy to ask for this, but more difficult to write a proper > specification. Presumably (and since we don't want to reinvent the wheel) it > should be based on BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) or a > subset of it. Things that we might want to cover: > - Package and category metadata (GLEP 68) > - News items (GLEP 42) > - Wiki translations > > The wiki seems to use "zh-cn" as language code. Not sure what standard that > is.
(In reply to Michał Górny from comment #8) > 2. Do we have any software that will actually support choosing between the > two variants? Can this be done reasonably, i.e. without having to maintain a > huge locale -> langcode mapping table in the PM? Maybe a more pragmatic approach would be to allow any value that is legal in LINGUAS. For example, for Chinese there are "zh_CN", "zh_HK", and "zh_TW".
Meanwhile we have a package with <longdescription lang="zh">, namely app-dicts/sword-ChiSB (CCing its maintainer). According to upstream, the language is traditional Chinese (zh-Hant). So, how about updating GLEP 68 to allow IETF language tags (BCP 47) instead of ISO 639-1? We already use them for the L10N USE_EXPAND variable, so there is a precedent.
Update posted to gentoo-dev: https://archives.gentoo.org/gentoo-dev/message/0d0ea85d6b1efe334124154fa9956e93
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/data/xml-schema.git/commit/?id=197c545067748a1ecf6b371d3646a3e725923264 commit 197c545067748a1ecf6b371d3646a3e725923264 Author: Ulrich Müller <ulm@gentoo.org> AuthorDate: 2022-05-22 05:32:05 +0000 Commit: Ulrich Müller <ulm@gentoo.org> CommitDate: 2022-05-22 06:09:14 +0000 metadata.xsd: Use xs:language for lang attributes Use a built-in datatype of XML Schema instead of hand-crafting our own. Bug: https://bugs.gentoo.org/578294 Signed-off-by: Ulrich Müller <ulm@gentoo.org> metadata.xsd | 224 ++--------------------------------------------------------- 1 file changed, 6 insertions(+), 218 deletions(-)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/devmanual.git/commit/?id=389882dac0cb2e2a174cf70fdad778b71a4538d3 commit 389882dac0cb2e2a174cf70fdad778b71a4538d3 Author: Ulrich Müller <ulm@gentoo.org> AuthorDate: 2022-05-22 06:24:57 +0000 Commit: Ulrich Müller <ulm@gentoo.org> CommitDate: 2022-05-27 09:02:31 +0000 ebuild-writing/misc-files/metadata: Language tags can be BCP 47 This corresponds to the update of GLEP 68. Bug: https://bugs.gentoo.org/578294 Signed-off-by: Ulrich Müller <ulm@gentoo.org> ebuild-writing/misc-files/metadata/text.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/data/glep.git/commit/?id=f6ba29bfdb9572e186bb2cdf5c8380ac9a62ae63 commit f6ba29bfdb9572e186bb2cdf5c8380ac9a62ae63 Author: Ulrich Müller <ulm@gentoo.org> AuthorDate: 2022-05-22 05:53:45 +0000 Commit: Ulrich Müller <ulm@gentoo.org> CommitDate: 2022-05-22 05:53:45 +0000 glep-0068: Update language identifiers from ISO 639-1 to BCP 47 This will allow codes like pt-BR or zh-Hant which is already used in at least one longdescription in the Gentoo repository. Note that the L10N USE_EXPAND and GLEP 42 news items also use BCP 47 for language names. Bug: https://bugs.gentoo.org/578294 Signed-off-by: Ulrich Müller <ulm@gentoo.org> glep-0068.rst | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)