There are 2 copies of code which mangles non-UTF-8 filenames: - lib/portage/package/ebuild/doebuild.py:_post_src_install_uid_fix https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/package/ebuild/doebuild.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n2085 - lib/portage/dbapi/vartree.py:dblink.treewalk https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/dbapi/vartree.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n3916 (For non-binary packages, the first copy of code performs all mangling, and the second copy of code operates on already mangled filenames and does not seem to do more damage.) ... try: parent = _unicode_decode(parent, encoding=_encodings['merge'], errors='strict') except UnicodeDecodeError: new_parent = _unicode_decode(parent, encoding=_encodings['merge'], errors='replace') new_parent = _unicode_encode(new_parent, encoding='ascii', errors='backslashreplace') new_parent = _unicode_decode(new_parent, encoding=_encodings['merge'], errors='replace') os.rename(parent, new_parent) ... for fname in chain(dirs, files): try: fname = _unicode_decode(fname, encoding=_encodings['merge'], errors='strict') except UnicodeDecodeError: fpath = _os.path.join( parent.encode(_encodings['merge']), fname) new_fname = _unicode_decode(fname, encoding=_encodings['merge'], errors='replace') new_fname = _unicode_encode(new_fname, encoding='ascii', errors='backslashreplace') new_fname = _unicode_decode(new_fname, encoding=_encodings['merge'], errors='replace') new_fpath = os.path.join(parent, new_fname) os.rename(fpath, new_fpath) ... Any byte invalid in UTF-8 is replaced by \ufffd (REPLACEMENT CHARACTER). This results in at least 4 problems: 1. Symbolic links pointing to non-UTF-8 filenames point to no longer existing files. See: non-utf-8_files_test-1.ebuild 2. When >=2 non-directory files (e.g. regular files) or >=2 empty directories (or at least all non-first are empty) have names differing only amongst bytes invalid in UTF-8, then the last one silently replaces the others. See: non-utf-8_files_test-2.ebuild 3. When >=2 non-empty directories have names differing only amongst bytes invalid in UTF-8, then "OSError: [Errno 39] Directory not empty" exception occurs and PDB prompt is opened. See: non-utf-8_files_test-3.ebuild 4. When a non-directory and a directory have names differing only amongst bytes invalid in UTF-8, then "IsADirectoryError: [Errno 21] Is a directory" exception occurs and PDB prompt is opened. See: non-utf-8_files_test-4.ebuild
Created attachment 584138 [details] non-utf-8_files_test-1.ebuild Output: === src_install: ls ${ED}/usr/share/non-utf-8 === total 4.0K lrwxrwxrwx 1 root root 2 Jul 23 04:05 aa -> 'a'$'\200' -rw-r--r-- 1 root root 2 Jul 23 04:05 'a'$'\200' === src_install: cat ${ED}/usr/share/non-utf-8/aa === 1 * QA Notice: This package installs one or more file names containing * characters that are not encoded with the UTF-8 encoding. * * usr/share/non-utf-8/a\ufffd * === pkg_preinst: ls ${ED}/usr/share/non-utf-8 === total 4.0K -rw-r--r-- 1 root root 2 Jul 23 04:05 a\ufffd lrwxrwxrwx 1 root root 2 Jul 23 04:05 aa -> a� === pkg_preinst: cat ${ED}/usr/share/non-utf-8/aa === cat: /var/tmp/portage/app-misc/non-utf-8_files_test-1/image/usr/share/non-utf-8/aa: No such file or directory Although actual target of symbolic links is not changed, value stored in VDB is mangled: # cat /var/db/pkg/app-misc/non-utf-8_files_test-1/CONTENTS dir /usr dir /usr/share dir /usr/share/non-utf-8 obj /usr/share/non-utf-8/a\ufffd b026324c6904b2a9cb4b88d6d61c81d1 1563847540 sym /usr/share/non-utf-8/aa -> a\ufffd 1563847541
Created attachment 584140 [details] non-utf-8_files_test-2.ebuild Output: === src_install: ls ${ED}/usr/share/non-utf-8 === total 24K -rw-r--r-- 1 root root 2 Jul 23 04:49 'b'$'\201' -rw-r--r-- 1 root root 2 Jul 23 04:49 'b'$'\202' -rw-r--r-- 1 root root 2 Jul 23 04:49 'b'$'\203' drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\204'/ drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\205'/ drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\206'/ === src_install: cat ${ED}/usr/share/non-utf-8/b* === 1 2 3 * QA Notice: This package installs one or more file names containing * characters that are not encoded with the UTF-8 encoding. * * usr/share/non-utf-8/b\ufffd * usr/share/non-utf-8/b\ufffd * usr/share/non-utf-8/b\ufffd * usr/share/non-utf-8/c\ufffd * usr/share/non-utf-8/c\ufffd * usr/share/non-utf-8/c\ufffd * === pkg_preinst: ls ${ED}/usr/share/non-utf-8 === total 8.0K -rw-r--r-- 1 root root 2 Jul 23 04:49 b\ufffd drwxr-xr-x 2 root root 4.0K Jul 23 04:49 c\ufffd/ === pkg_preinst: cat ${ED}/usr/share/non-utf-8/b* === 3 # cat /var/db/pkg/app-misc/non-utf-8_files_test-2/CONTENTS dir /usr dir /usr/share dir /usr/share/non-utf-8 dir /usr/share/non-utf-8/c\ufffd obj /usr/share/non-utf-8/b\ufffd 6d7fce9fee471194aa8b5b6e47267f03 1563850172
Created attachment 584142 [details] non-utf-8_files_test-3.ebuild Output: === src_install: ls ${ED}/usr/share/non-utf-8 === total 12K drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\207'/ drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\210'/ drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\211'/ === src_install: cat ${ED}/usr/share/non-utf-8/d*/x === 1 2 3 Exception in callback AsynchronousTask.wait() handle: <Handle AsynchronousTask.wait()> Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2113, in _post_src_install_uid_fix encoding=_encodings['merge'], errors='strict') File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 189, in _unicode_decode s = str(s, encoding=encoding, errors=errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.6/asyncio/events.py", line 127, in _run self._callback(*self._args) File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 84, in wait self._wait_hook() File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 195, in _wait_hook self._exit_listener_stack.pop()(self) File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 205, in _ebuild_exit self._ebuild_exit_unlocked(ebuild_process) File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 265, in _ebuild_exit_unlocked _post_src_install_uid_fix(settings, out) File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2124, in _post_src_install_uid_fix os.rename(fpath, new_fpath) File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 246, in __call__ rval = self._func(*wrapped_args, **wrapped_kwargs) OSError: [Errno 39] Directory not empty: b'/var/tmp/portage/app-misc/non-utf-8_files_test-3/image/usr/share/non-utf-8/d\x88' -> b'/var/tmp/portage/app-misc/non-utf-8_files_test-3/image/usr/share/non-utf-8/d\\ufffd' --Return-- > /usr/lib64/python3.6/site-packages/portage/util/_eventloop/asyncio_event_loop.py(81)_internal_caller_exception_handler()->None -> pdb.set_trace() (Pdb)
Created attachment 584144 [details] non-utf-8_files_test-4.ebuild Output: === src_install: ls ${ED}/usr/share/non-utf-8 === total 8.0K -rw-r--r-- 1 root root 2 Jul 23 04:54 'e'$'\220' drwxr-xr-x 2 root root 4.0K Jul 23 04:54 'e'$'\221'/ === src_install: cat ${ED}/usr/share/non-utf-8/e* === 1 cat: '/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e'$'\221': Is a directory Exception in callback AsynchronousTask.wait() handle: <Handle AsynchronousTask.wait()> Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2113, in _post_src_install_uid_fix encoding=_encodings['merge'], errors='strict') File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 189, in _unicode_decode s = str(s, encoding=encoding, errors=errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 1: invalid start byte During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.6/asyncio/events.py", line 127, in _run self._callback(*self._args) File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 84, in wait self._wait_hook() File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 195, in _wait_hook self._exit_listener_stack.pop()(self) File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 205, in _ebuild_exit self._ebuild_exit_unlocked(ebuild_process) File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 265, in _ebuild_exit_unlocked _post_src_install_uid_fix(settings, out) File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2124, in _post_src_install_uid_fix os.rename(fpath, new_fpath) File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 246, in __call__ rval = self._func(*wrapped_args, **wrapped_kwargs) IsADirectoryError: [Errno 21] Is a directory: b'/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e\x90' -> b'/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e\\ufffd' --Return-- > /usr/lib64/python3.6/site-packages/portage/util/_eventloop/asyncio_event_loop.py(81)_internal_caller_exception_handler()->None -> pdb.set_trace() (Pdb)
Python 2 does not provide a built-in solution for lossless decoding of invalid bytes. Python 2 will be discontinued by upstream soon, on 2020-01-01. Python 2 will probably remain in some form in Gentoo for several years to provide support for some Python-2-only packages, but it is not necessary for Portage itself to continue supporting Python 2 for much longer time. Therefore some minimalistic solution for problematic code in Portage with Python 2 might be sufficient. (At least detection of collisions (Problem #2) and aborting.) Python 3 provides surrogateescape error handler for lossless decoding of invalid bytes. Therefore I suggest that problematic code in Portage with Python 3 be fixed in the following way: - Use surrogateescape in appropriate places - Continue printing QA warning about invalid UTF-8 bytes - Stop renaming files
About decoding / encoding, maybe a good generic solution would be to use _unicode_decode(..., errors="surrogateescape") / _unicode_encode(..., errors="surrogateescape") in dozens/hundreds of places, and to make Python-2-specific implementations [1] of these 2 functions do something special when errors="surrogateescape" is received. [1] https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/__init__.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n193
For CONTENTS, we should probably use the GLEP 74 path encoding method so that package managers can use the same implementation for Manifest files: https://www.gentoo.org/glep/glep-0074.html#path-and-filename-encoding
(In reply to Zac Medico from comment #7) > For CONTENTS, we should probably use the GLEP 74 path encoding method so > that package managers can use the same implementation for Manifest files: > > https://www.gentoo.org/glep/glep-0074.html#path-and-filename-encoding Actually, the GLEP 74 path encoding method is not intended to encode/decode arbitrary byte strings, it only aims to encode/decode special unicode characters. However, it's still similar to what we'd want for encoding arbitrary byte strings in CONTENTS, since we would want it to encode ascii space and backslash characters at least.