578294 – GLEP 68: The format of "lang" attribute in metadata.xml is insufficient

Bug 578294 - GLEP 68: The format of "lang" attribute in metadata.xml is insufficient

Summary: GLEP 68: The format of "lang" attribute in metadata.xml is insufficient

Status:	RESOLVED FIXED

Alias:	None

Product:	Documentation
Classification:	Unclassified
Component:	GLEP Changes (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Michał Górny

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-26 13:18 UTC by yuchen.xie
Modified:	2022-06-12 19:10 UTC (History)
CC List:	5 users (show)

See Also:	847223
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description yuchen.xie 2016-03-26 13:18:40 UTC

https://devmanual.gentoo.org/ebuild-writing/misc-files/metadata/index.html has the following rules:


lang	<description>, <longdescription>, <use>, <doc>	In every case where a description is required, there must be at least an english description. If an additional description in another language is given, this attribute is used to specify the language used. The format is the two-character language code as defined by the ISO-639-1 norm.

In fact, two-letter language code is limited. 4 digits Script section(iso15924: http://www.unicode.org/iso15924/iso15924-en.html) is also required to express a real language.

Take Chinese language for example. "zh" should be either "zh_Hant" or "zh_Hans".
They look different:

This drumstick is delicious!   <- en
这个鸡腿太好吃了！              <- zh_Hans
這個鷄腿太好喫了！              <- zh_Hant
 
The last two lines are both "zh" languages of exact the same meaning and are used widely. They look different in words.

Comment 1 yuchen.xie 2016-03-27 00:02:18 UTC

Sorry there is a typo: a script code has 4 characters (not digits).

Some langugaes need not a script code postfix.

Comment 2 Arfrever Frehtes Taifersar Arahesis 2016-04-03 09:40:07 UTC

repoman already allows any value of lang attribute.
E.g. lang="Wu language: Shanghainese dialect"

So just DevManual should be more liberal.

Comment 3 Ulrich Müller gentoo-dev

2016-04-03 12:09:30 UTC

The devmanual isn't normative here, but follows what is specified in the relevant GLEPs. For the lang attribute, this is GLEP 56, which in turn references the Developer Handbook. At the time of approval of GLEP 56, it said:

"The format is the two-character language code as defined by the ISO-639-1 norm."

<https://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo/xml/htdocs/proj/en/devrel/handbook/hb-guide-metadata.xml?hideattic=0&revision=1.11&view=markup#l132>

It just so happens that a revised GLEP 68 for metadata is in the making and will presumably be approved in the next council meeting. Its current draft also says that the lang attribute contains an ISO 639-1 language code.

As the purpose is to specify a *language*, not a dialect or an orthography variant, I'm not convinced that we should allow anything beyond the 2 (or possibly 3) letter code of ISO-639-*. Especially, I'm not aware of any of our tools implementing anything beyond that (e.g., being able of parsing RFC 5646 style "de-CH-1901 German as used in Switzerland using the 1901 variant orthography" or "zh-cmn-Hans-CN: Chinese, Mandarin, Simplified script, as used in China").


(In reply to Arfrever Frehtes Taifersar Arahesis from comment #2)
> repoman already allows any value of lang attribute.
> E.g. lang="Wu language: Shanghainese dialect"
> 
> So just DevManual should be more liberal.

Nope, I don't think we would go for an arbitrary text there. One could consider RFC 5646, but then again, it would delay GLEP 68 because there is no reference implementation (which is required).

Comment 4 Michał Górny archtester

2016-04-03 22:10:56 UTC

I don't really mind any particular format here. I referenced ISO-639-1 because that's what ulm suggested as least likely to break existing tools, and correct for all current uses. As far as I'm concerned, we could go for any machine-parsable format as long as it's compatible with ISO-639-1 (i.e. existing uses).

Comment 5 Göktürk Yüksek archtester

2016-05-03 06:36:02 UTC

Any reason to keep this open? GLEP 68 already sets ISO-639-1 as part of the specification.

Comment 6 yuchen.xie 2016-05-03 12:00:59 UTC

(In reply to Göktürk Yüksek from comment #5)
> Any reason to keep this open? GLEP 68 already sets ISO-639-1 as part of the
> specification.

Let's look at this file:

https://github.com/gentoo/gentoo/blob/master/app-accessibility/metadata.xml

It contains various translations like:

	<longdescription lang="de">
		Die Kategorie app-accessibility enthält Programme für barrierefreies
		Arbeiten (Accessibility), wie beispielsweise Screenreader.
	</longdescription>
	<longdescription lang="nl">
		De app-accessibility categorie bevat applicaties die de
		toegankelijkheid bevorderen, bijvoorbeeld een schermlezer.
	</longdescription>
	<longdescription lang="ja">
		app-accessibilityカテゴリィにはアクセシビリティと
		手伝うパッケージが含まれます。
	</longdescription>

Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant) translations find no place here.

It's impossible to use something like

 <longdescription lang="zh">

to stands for either simplified Chinese (zh_Hans) or tradition Chinese (zh_Hant). In fact "zh" is not a real written language name.

We need a better rule.

Comment 7 Ulrich Müller gentoo-dev

2016-05-03 13:44:46 UTC

(In reply to yuchen.xie from comment #6)
> Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant)
> translations find no place here.
> 
> It's impossible to use something like
> 
>  <longdescription lang="zh">
> 
> to stands for either simplified Chinese (zh_Hans) or tradition Chinese
> (zh_Hant). In fact "zh" is not a real written language name.
> 
> We need a better rule.

It is easy to ask for this, but more difficult to write a proper specification. Presumably (and since we don't want to reinvent the wheel) it should be based on BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) or a subset of it. Things that we might want to cover:
- Package and category metadata (GLEP 68)
- News items (GLEP 42)
- Wiki translations

The wiki seems to use "zh-cn" as language code. Not sure what standard that is.

Comment 8 Michał Górny archtester

2016-05-04 18:47:59 UTC

Is there a real use case for this? That is:

1. Are you going to actively translate descriptions into both variants?

2. Do we have any software that will actually support choosing between the two variants? Can this be done reasonably, i.e. without having to maintain a huge locale -> langcode mapping table in the PM?

3. Do you suspect users will actually need the two variants?

Comment 9 Arfrever Frehtes Taifersar Arahesis 2016-05-04 19:17:36 UTC

(In reply to Michał Górny from comment #8)
> 1. Are you going to actively translate descriptions into both variants?

Translation into simplified Chinese script can be provided by Gentoo users from e.g. mainland China, while translation into traditional Chinese script can be provided by Gentoo users from e.g. Taiwan.

> 3. Do you suspect users will actually need the two variants?

Different variants are for different sets of users.

Differences between graphemes are often significant:
https://en.wikipedia.org/wiki/Simplified_Chinese_characters#Method_of_simplification

https://en.wikipedia.org/wiki/Simplified_Chinese_characters#Education :
"In general, schools in Mainland China, Malaysia and Singapore use simplified characters exclusively, while schools in Hong Kong, Macau, and Taiwan use traditional characters exclusively."

Comment 10 yuchen.xie 2016-05-05 03:55:49 UTC

(In reply to Ulrich Müller from comment #7)
Second your idea to use BCP47 or its subset. It's widely adopted in modern web browsers. 

> (In reply to yuchen.xie from comment #6)
> > Unfortunately, simplified Chinese (zh_Hans) and tradition Chinese (zh_Hant)
> > translations find no place here.
> > 
> > It's impossible to use something like
> > 
> >  <longdescription lang="zh">
> > 
> > to stands for either simplified Chinese (zh_Hans) or tradition Chinese
> > (zh_Hant). In fact "zh" is not a real written language name.
> > 
> > We need a better rule.
> 
> It is easy to ask for this, but more difficult to write a proper
> specification. Presumably (and since we don't want to reinvent the wheel) it
> should be based on BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) or a
> subset of it. Things that we might want to cover:
> - Package and category metadata (GLEP 68)
> - News items (GLEP 42)
> - Wiki translations
> 
> The wiki seems to use "zh-cn" as language code. Not sure what standard that
> is.

Comment 11 Ulrich Müller gentoo-dev

2016-05-07 09:17:10 UTC

(In reply to Michał Górny from comment #8)
> 2. Do we have any software that will actually support choosing between the
> two variants? Can this be done reasonably, i.e. without having to maintain a
> huge locale -> langcode mapping table in the PM?

Maybe a more pragmatic approach would be to allow any value that is legal in LINGUAS. For example, for Chinese there are "zh_CN", "zh_HK", and "zh_TW".

Comment 12 Ulrich Müller gentoo-dev

2022-01-17 19:58:39 UTC

Meanwhile we have a package with <longdescription lang="zh">, namely app-dicts/sword-ChiSB (CCing its maintainer). According to upstream, the language is traditional Chinese (zh-Hant).

So, how about updating GLEP 68 to allow IETF language tags (BCP 47) instead of ISO 639-1? We already use them for the L10N USE_EXPAND variable, so there is a precedent.

Comment 13 Ulrich Müller gentoo-dev

2022-05-22 06:18:58 UTC

Update posted to gentoo-dev:
https://archives.gentoo.org/gentoo-dev/message/0d0ea85d6b1efe334124154fa9956e93

Comment 14 Larry the Git Cow gentoo-dev

2022-05-23 06:25:04 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/data/xml-schema.git/commit/?id=197c545067748a1ecf6b371d3646a3e725923264

commit 197c545067748a1ecf6b371d3646a3e725923264
Author:     Ulrich Müller <ulm@gentoo.org>
AuthorDate: 2022-05-22 05:32:05 +0000
Commit:     Ulrich Müller <ulm@gentoo.org>
CommitDate: 2022-05-22 06:09:14 +0000

    metadata.xsd: Use xs:language for lang attributes
    
    Use a built-in datatype of XML Schema instead of hand-crafting our own.
    
    Bug: https://bugs.gentoo.org/578294
    Signed-off-by: Ulrich Müller <ulm@gentoo.org>

 metadata.xsd | 224 ++---------------------------------------------------------
 1 file changed, 6 insertions(+), 218 deletions(-)

Comment 15 Larry the Git Cow gentoo-dev

2022-05-27 09:04:44 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/devmanual.git/commit/?id=389882dac0cb2e2a174cf70fdad778b71a4538d3

commit 389882dac0cb2e2a174cf70fdad778b71a4538d3
Author:     Ulrich Müller <ulm@gentoo.org>
AuthorDate: 2022-05-22 06:24:57 +0000
Commit:     Ulrich Müller <ulm@gentoo.org>
CommitDate: 2022-05-27 09:02:31 +0000

    ebuild-writing/misc-files/metadata: Language tags can be BCP 47
    
    This corresponds to the update of GLEP 68.
    
    Bug: https://bugs.gentoo.org/578294
    Signed-off-by: Ulrich Müller <ulm@gentoo.org>

 ebuild-writing/misc-files/metadata/text.xml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comment 16 Larry the Git Cow gentoo-dev

2022-06-12 19:09:55 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/data/glep.git/commit/?id=f6ba29bfdb9572e186bb2cdf5c8380ac9a62ae63

commit f6ba29bfdb9572e186bb2cdf5c8380ac9a62ae63
Author:     Ulrich Müller <ulm@gentoo.org>
AuthorDate: 2022-05-22 05:53:45 +0000
Commit:     Ulrich Müller <ulm@gentoo.org>
CommitDate: 2022-05-22 05:53:45 +0000

    glep-0068: Update language identifiers from ISO 639-1 to BCP 47
    
    This will allow codes like pt-BR or zh-Hant which is already used
    in at least one longdescription in the Gentoo repository.
    
    Note that the L10N USE_EXPAND and GLEP 42 news items also use BCP 47
    for language names.
    
    Bug: https://bugs.gentoo.org/578294
    Signed-off-by: Ulrich Müller <ulm@gentoo.org>

 glep-0068.rst | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)