Compilation fails with: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-104: ordinal not in range(256) Reproducible: Always Steps to Reproduce: 1. emerge -1av app-doc/gimp-help 2. Compilation fails with UnicodeEncodeError 3. Actual Results: Compilation fails with UnicodeEncodeError: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-104: ordinal not in range(256) Expected Results: No error should be encountered. There are also some things I don't understand here: Why can't it choose the right locale for each language? Why doesn't it choose UTF-8 to be able to cope with any language? Why does it use its own xml2po script, when there is a system-wide xml2po installed? Why does it choose all possible languages Languages: ca da de el en en_GB es fi fr it ja ko nl nn pt_BR ro ru zh_CN when I only have a small subset of them globally enabled? Why do I already have version 2.8.2 installed without problems (I even had v. 2.0 installed, which I unmerged) and I get this now? What changed so much between versions? See attached build.log. Some info ========= Portage 3.0.28 (python 3.9.9-final-0, default/linux/amd64/17.1/hardened, gcc-11.2.0, glibc-2.33-r7, 5.4.168-gentoo x86_64) ================================================================= System uname: Linux-5.4.168-gentoo-x86_64-Intel-R-_Core-TM-_i7-6700HQ_CPU_@_2.60GHz-with-glibc2.33 Timestamp of repository gentoo: Sat, 22 Jan 2022 sh bash 5.1_p8 ld GNU ld (Gentoo 2.37_p1 p0) 2.37 app-misc/pax-utils: 1.3.3::gentoo app-shells/bash: 5.1_p8::gentoo dev-java/java-config: 2.3.1::gentoo dev-lang/perl: 5.34.0-r6::gentoo dev-lang/python: 2.7.18_p13::gentoo, 3.6.15::gentoo, 3.7.12_p1::gentoo, 3.8.12_p1-r1::gentoo, 3.9.9-r1::gentoo, 3.10.0_p1-r1::gentoo dev-lang/rust: 1.58.1::gentoo dev-lang/rust-bin: 1.53.0::gentoo dev-util/cmake: 3.21.4::gentoo dev-util/meson: 0.60.3::gentoo sys-apps/baselayout: 2.7-r3::gentoo sys-apps/openrc: 0.42.1::gentoo sys-apps/sandbox: 2.25::gentoo sys-devel/autoconf: 2.13-r1::gentoo, 2.69-r4::gentoo, 2.71-r1::gentoo sys-devel/automake: 1.11.6-r3::gentoo, 1.12.6::gentoo, 1.13.4-r2::gentoo, 1.14.1::gentoo, 1.15.1-r2::gentoo, 1.16.4::gentoo sys-devel/binutils: 2.37_p1::gentoo sys-devel/binutils-config: 5.4::gentoo sys-devel/clang: 12.0.1::gentoo, 13.0.0::gentoo sys-devel/gcc: 7.5.0::gentoo, 8.3.0-r1::gentoo, 8.4.0::gentoo, 9.3.0::gentoo, 11.2.0::gentoo sys-devel/gcc-config: 2.5-r1::gentoo sys-devel/libtool: 2.4.6-r6::gentoo sys-devel/lld: 13.0.0::gentoo sys-devel/llvm: 12.0.1::gentoo, 13.0.0::gentoo sys-devel/make: 4.3::gentoo sys-kernel/linux-headers: 5.15-r3::gentoo (virtual/os-headers) sys-libs/glibc: 2.33-r7::gentoo
Created attachment 765323 [details] build.log
The gimp-help-2.8.x used Python-2.7 and gimp-help-2.10 was patched to use Python-3.x. The Python-3 uses UTF-8 by default and all documentation *.po files are in UTF-8. Could you please check if UTF-8 is enabled in you system and properly setup [1]? The using of "latin-1" is strange. I currently can't reproduced this issue in my system. > Why can't it choose the right locale for each language? As I could see the *.po files are used to generate documentation are in UTF-8 and currently xml2po initially process all language and then build docs for chosen. Sorry, I maintain package since 2.10 version and don't know details about build process of version 2.8. > Why doesn't it choose UTF-8 to be able to cope with any language? Could you please provide the output of 'locale -a' and 'cat /usr/src/linux/.config | grep -i UTF' commands? > Why does it use its own xml2po script, when there is a system-wide xml2po installed? Maybe it's really worth to unbundle xml2po tool to force using it from app-text/gnome-doc-utils package. I could try to do it. > Why does it choose all possible languages It initially generates some xml for all language but finally build help documentation only for preferred languages. [1]: https://wiki.gentoo.org/wiki/UTF-8
(In reply to Sergey Torokhov from comment #2) > Could you please provide the output of 'locale -a' and 'cat > /usr/src/linux/.config | grep -i UTF' commands? > locale -a C C.utf8 POSIX de_DE de_DE.iso88591 de_DE.iso885915@euro de_DE.utf8 de_DE@euro el_GR el_GR.iso88597 el_GR.utf8 en_US en_US.iso88591 en_US.utf8 es_ES es_ES.iso88591 es_ES.iso885915@euro es_ES.utf8 es_ES@euro fr_FR fr_FR.iso88591 fr_FR.iso885915@euro fr_FR@euro it_IT it_IT.iso88591 it_IT.iso885915@euro it_IT.utf8 it_IT@euro cat /usr/src/linux/.config | grep -i UTF CONFIG_EXFAT_DEFAULT_IOCHARSET="utf8" CONFIG_FAT_DEFAULT_UTF8=y CONFIG_NLS_UTF8=y I do have LINGUAS='en el de it es fr' L10N='en el de it es fr' and the linguas_de linguas_el linguas_en linguas_es linguas_fr linguas_it linguas_ru global USE flags in my /etc/portage/make.conf but I am not sure they are used, because I see no language-specific USE flags in this package - actually, no USE flags at all: equery uses app-doc/gimp-help !!! No USE flags found for app-doc/gimp-help-2.10.0-r1 So, as you see, I would love to have also the ru version too (given that I have set linguas_ru in make.conf), but it does not take it: [DEP] xml/es/.deps.mk [DEP] xml/fr/.deps.mk [DEP] xml/de/.deps.mk [DEP] xml/it/.deps.mk [DEP] xml/el/.deps.mk Of course, maybe there are no ru files there at all. The user root (that runs the emerge command) has the en_US locale: LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE=C LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL= LC_ALL is here empty, but that's correct, as all the rest is filled. UTF-8 is set up correctly here. Browsers, terminals, editors - all use the right locale, depending on the user. Fonts are set up correctly and all is displayed fine. This has been so for years now. root has en_US, users have their own locale, whatever they choose. Everything works as it should. I guess the bundled xml2po makes a choice about locale that is wrong. Let's see: At the start, all seems to go fine: cp gimp-keys.xml xml/gimp-keys-en.xml ../tools/xml2po.py -p po/ca.po ./gimp-keys.xml > xml/gimp-keys-ca.xml ../tools/xml2po.py -p po/de.po ./gimp-keys.xml > xml/gimp-keys-de.xml /bin/mkdir -p xml /bin/mkdir -p xml ../tools/xml2po.py -p po/el.po ./gimp-keys.xml > xml/gimp-keys-el.xml ../tools/xml2po.py -p po/fi.po ./gimp-keys.xml > xml/gimp-keys-fi.xml ../tools/xml2po.py -p po/fr.po ./gimp-keys.xml > xml/gimp-keys-fr.xml /bin/mkdir -p xml ../tools/xml2po.py -p po/it.po ./gimp-keys.xml > xml/gimp-keys-it.xml /bin/mkdir -p xml Languages that I choose (in make.conf), like de, it, el, fr are fine, including languages that I did NOT choose (and it should not be touching them actually), like ca, or fi. But then it also continues with Japanese, Korean and similar languages - and that's where the problems start - and where the error occurs: It starts with ja: /bin/mkdir -p xml ../tools/xml2po.py -p po/ja.po ./gimp-keys.xml > xml/gimp-keys-ja.xml Traceback (most recent call last): File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/../tools/xml2po.py", line 190, in <module> main(sys.argv[1:]) File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/../tools/xml2po.py", line 173, in main xml2po_main.merge(mofile, filenames[0]) File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/tools/xml2po/__init__.py", line 597, in merge /bin/mkdir -p xml self.out.write(doc.doc.serialize('utf-8', 1)) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-110: ordinal not in range(256) and goes on to ko: ../tools/xml2po.py -p po/ko.po ./gimp-keys.xml > xml/gimp-keys-ko.xml make[1]: *** [Makefile:379: xml/gimp-keys-el.xml] Error 1 make[1]: *** Deleting file 'xml/gimp-keys-el.xml' make[1]: *** Waiting for unfinished jobs.... Traceback (most recent call last): File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/../tools/xml2po.py", line 190, in <module> main(sys.argv[1:]) File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/../tools/xml2po.py", line 173, in main xml2po_main.merge(mofile, filenames[0]) File "/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/tools/xml2po/__init__.py", line 597, in merge self.out.write(doc.doc.serialize('utf-8', 1)) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 126-130: ordinal not in range(256) where it breaks with an error about gimp-keys-el.xml, which means about el... All this does not make any sense (you get an error about el intermixed with output from ko) and indicates that maybe parallel execution is the problem. For this reason I had already advised portage to NOT use ninja for app-doc/gimp-help. Still, the problem persisted. Now I see this make -j6 at the start of compilation in build.log. Could that be the problem? I thus decided to run it with -j1: MAKEOPTS='-j1' emerge -av app-doc/gimp-help and now, the error is much more clear: >>> Compiling source in /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0 ... make -j1 [SRC] src/preface/authors.xml [DEP] xml/fr/.deps.mk [DEP] xml/es/.deps.mk [DEP] xml/it/.deps.mk [DEP] xml/de/.deps.mk [DEP] xml/el/.deps.mk [SRC] src/preface/authors.xml Making all in quickreference make[1]: Entering directory '/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference' /bin/mkdir -p xml ../tools/xml2po.py -p po/ca.po ./gimp-keys.xml > xml/gimp-keys-ca.xml /bin/mkdir -p svg /usr/bin/xsltproc \ ./stylesheets/keys-svg.xsl \ xml/gimp-keys-ca.xml \ > svg/gimp-keys-ca.svg xml/gimp-keys-ca.xml:3: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE8 0x6E 0x63 0x69 <title>Refer�ncia r�pida del GIMP</title> ^ unable to parse xml/gimp-keys-ca.xml make[1]: *** [Makefile:390: svg/gimp-keys-ca.svg] Error 6 make[1]: *** Deleting file 'svg/gimp-keys-ca.svg' rm xml/gimp-keys-ca.xml make[1]: Leaving directory '/XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference' make: *** [Makefile:570: all-recursive] Error 1 I see multiple problems here: - Maybe MAKEOPTS should be overwritten to '-j1' in the ebuild, in order to get proper errors (and possibly to be able to use one locale at a time? not sure...). But also maybe it will work with other values, if the error is found... - Maybe the xml/gimp-keys-ca.xml really has an encoding problem - didn't check. - The ca file should actually not be chosen at all. I don't have ca in the list of my locales, neither in make.conf, not in the system - as you see above. Trying to translate from xml to po for locales that the user did not choose is asking for trouble, because you cannot assume that the system will deal fine with them. I did a test with the xml2po file that is installed in the system: xml2po -p /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/po/ca.po /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/gimp-keys.xml > gimp-keys-ca.xml and this worked without errors. I then proceeded to the next command (see output above): /usr/bin/xsltproc /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/stylesheets/keys-svg.xsl gimp-keys-ca.xml > gimp-keys-ca.svg gimp-keys-ca.xml:3: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE8 0x6E 0x63 0x69 <title>Refer�ncia r�pida del GIMP</title> ^ unable to parse gimp-keys-ca.xml So even with the system-wide xml2po the gimp-keys-ca.xml created above is not O.K.: it's 3rd line is not UTF-8 (at least the xsltproc parser says so), while it's first line indicates UTF-8 encoding: <?xml version="1.0" encoding="utf-8"?> To make sure that nothings is stale, I rebuilt dev-libs/libxslt (for xsltproc) and app-text/gnome-doc-utils (for xml2po) and retried the above - but the error persisted. Therefore, there is a problem in the creation of gimp-keys-ca.xml, maybe in the creation of other gimp-keys-XX.xml files (possibly el too). Talking about el: doing xml2po -p /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/po/el.po /XXXXXX/portage/app-doc/gimp-help-2.10.0-r1/work/gimp-help-2.10.0/quickreference/gimp-keys.xml > gimp-keys-el.xml Traceback (most recent call last): File "/usr/lib/python-exec/python3.10/xml2po", line 191, in <module> main(sys.argv[1:]) File "/usr/lib/python-exec/python3.10/xml2po", line 174, in main xml2po_main.merge(mofile, filenames[0]) File "/usr/lib/python3.10/site-packages/xml2po/__init__.py", line 611, in merge self.out.write(doc.doc.serialize('utf-8', 1)) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-110: ordinal not in range(256) so here we cannot even create gimp-keys-XX.xml for XX=el (something that worked with XX=ca above) using system-wide xml2po... It used python 3.10 for user root above, but this should not make any difference. You may want to try the above commands on your own system and tell me if there is any difference. I hope I could help you with all this information.
> linguas_de linguas_el linguas_en linguas_es linguas_fr linguas_it linguas_ru > global USE flags in my /etc/portage/make.conf > ... > So, as you see, I would love to have also the ru version too > (given that I have set linguas_ru in make.conf), but it does not take it: > > [DEP] xml/es/.deps.mk > [DEP] xml/fr/.deps.mk > [DEP] xml/de/.deps.mk > [DEP] xml/it/.deps.mk > [DEP] xml/el/.deps.mk > > Of course, maybe there are no ru files there at all. As far as I know there is now need in USE flags like 'linguas_fr' as it sufficient to specify languages like LINGUAS='en el de it es fr' L10N="${LINGUAS}" Sometimes could be usefull to add to LINGUAS something like 'de_DE' etc. 'Ru' documentation is presented in package. As I see from build process the processing of all available for 'quickreference' and then build common documentation for languages specified in LINGUAS variable if available. Unfortunately the excluding some language to process quickreference languages will not solve the problem if language is in LINGUAS. At least the same known "problem" symbols 'è' 'à' are presented in 'fr.po' and 'it.po' too. Early there is a bug https://bugs.gentoo.org/677198 with other problem of multi threading. But current issue maybe is related to python3 porting. The quiestion is why it forced to use latin-1 encoding and what is conditions when it takes place? At least I can't reproduced issue. I'm not sure it related to internal xml2po but I need to check how it process these files for me with external xml2to.
(In reply to Sergey Torokhov from comment #4) > But current issue maybe is related to python3 porting. The > quiestion is why it forced to use latin-1 encoding and what is conditions > when it takes place? At least I can't reproduced issue. I'm not sure it > related to internal xml2po but I need to check how it process these files > for me with external xml2to. I copied the el.po gimp-keys.xml somewhere, made them readable by anyone and tried: xml2po -p /XXXXXX/el.po /XXXXXX/gimp-keys.xml > gimp-keys-el.xml When this is issued by a user with the el_GR locale, it works without errors. When it is issued by root, it breaks with the error Traceback (most recent call last): File "/usr/lib/python-exec/python3.10/xml2po", line 191, in <module> main(sys.argv[1:]) File "/usr/lib/python-exec/python3.10/xml2po", line 174, in main xml2po_main.merge(mofile, filenames[0]) File "/usr/lib/python3.10/site-packages/xml2po/__init__.py", line 611, in merge self.out.write(doc.doc.serialize('utf-8', 1)) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-110: ordinal not in range(256) The same error occurs when I run it as a normal user with the en_US locale. I then tried: LC_ALL='fr_FR' xml2po -p /XXXXXX/el.po /XXXXXX/gimp-keys.xml while being the user with the en_US locale. This again brought the error above. But doing LC_ALL='el_GR' xml2po -p /XXXXXX/el.po /XXXXXX/gimp-keys.xml worked! Therefore, in my system, running xml2po with a locale other than the "right" one (el_GR for el.po, for example) forces latin-1 locale. It is interesting that I get UnicodeEncodeError: 'latin-1' codec can't encode characters in position 103-110: ordinal not in range(256) even when I specifically set the locale to some value, e.g. LC_ALL='fr_FR' xml2po -p /XXXXXX/el.po /XXXXXX/gimp-keys.xml For some reason, it must be "el_GR" when I try to process el.po. This is either xml2po specific, or it has indeed to do with the way python scripts use locale in Gentoo. So now I am curious if this happens to you too...
Similar to bug 707026 and the others, I guess.
(In reply to Sam James from comment #6) > Similar to bug 707026 and the others, I guess. Right on the spot, Sam! :-) Building with LC_ALL=en_US.UTF-8 emerge -1av app-doc/gimp-help succeeded! @Sergey I guess the reason why you could not reproduce it is that your root has a UTF-8 locale, while my root has "en_US", which is mapped to ISO-8859-1 ('latin-1') in my /etc/locale.gen: en_US ISO-8859-1 en_US.UTF-8 UTF-8 fr_FR ISO-8859-1 fr_FR@euro ISO-8859-15 el_GR.UTF-8 UTF-8 el_GR ISO-8859-7 it_IT.UTF-8 UTF-8 it_IT ISO-8859-1 it_IT@euro ISO-8859-15 es_ES.UTF-8 UTF-8 es_ES ISO-8859-1 de_DE.UTF-8 UTF-8 de_DE ISO-8859-1 de_DE@euro ISO-8859-15 es_ES@euro ISO-8859-15 Since emerge runs as root, it takes whatever locale root has... @Sam Maybe there is a need for a mechanism to set a UTF-8 locale whenever one deals with documentation packages, even if root has 'latin-1'. Maybe this could be done in some eclass to be inherited by all such doc packages. It is interesting that xml2po will use the user locale even in situations where it is clear (by "static analysis" so-to-say...) that it will fail. I mean, what is the sense of insisting to use 'latin-1' (just because the user has it) when you want to process el.po, which is going to *need* el_GR.UTF-8? Of course, this poses the question which UTF-8 locale to choose - en_US.UTF-8, el_GR.UTF-8, fr_FR.UTF-8...? But is seems that even a "heuristic" value of "en_US.UTF-8" will do. So maybe the eclass could have some "factory" that instantiates a "en_US.UTF-8" locale object by default, unless the ebuild maintainer passes (through a parameter) a different locale to use.
In fact, I already had a /etc/portage/package.env/utf8clocale with the sole line LC_ALL="en_US.UTF-8" and also a /etc/portage/package.env/utf8clocale which contained: net-libs/telepathy-glib utf8clocale.conf app-text/gnome-doc-utils utf8clocale.conf to which now I will have to add app-doc/gimp-help utf8clocale.conf and all will work without any intervention by me next time I upgrade (~700 days from now... :-))). Since I upgrade very infrequently, I had even forgotten its existence...
@Sam James, thank you for issue reference! @segmentaion fault, I missed that your "locale" output is en_US but isn't en_US.UTF8. For LC_CTYPE="en_US" emerge -1 app-doc/gimp-help the issue is reproduced for me. With LC_TYPE="C" setup the issue isn't reproduced but I'm not sure that it doesn't fall back to "C.UTF-8". I will add export LC_CTYPE="C.UTF-8" to src_compile() phase. I assume this will fix issue.
As for unbundling of xml2po to use it from external app-text/gnome-doc-utils I don't see sense as gnome-doc-utils is archived project and xml2po doesn't built in and isn't used as runtime dependency. So if QA don't mind I leave it as is. Actually I would like upstream provides pre-build help documentaion.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=e52e53152c51cb41521c3b4df36d51a12b258496 commit e52e53152c51cb41521c3b4df36d51a12b258496 Author: Sergey Torokhov <torokhov-s-a@yandex.ru> AuthorDate: 2022-02-26 23:19:31 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2022-02-27 00:52:55 +0000 app-doc/gimp-help: add python_export_utf8_locale Closes: https://bugs.gentoo.org/833566 Signed-off-by: Sergey Torokhov <torokhov-s-a@yandex.ru> Closes: https://github.com/gentoo/gentoo/pull/24290 Signed-off-by: Sam James <sam@gentoo.org> app-doc/gimp-help/gimp-help-2.10.0-r2.ebuild | 7 +++++++ 1 file changed, 7 insertions(+)
(In reply to segmentation fault from comment #7) > @Sam Maybe there is a need for a mechanism to set a UTF-8 locale whenever > one deals with documentation packages, even if root has 'latin-1'. Maybe > this could be done in some eclass to be inherited by all such doc packages. > I reckon we need to go around and shove python_export_utf8_locale into a bunch of packages (from python-utils-r1.eclass), maybe?