Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 690480 - Broken handling of non-UTF-8 files when merging from ${D} to ${ROOT}
Summary: Broken handling of non-UTF-8 files when merging from ${D} to ${ROOT}
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: Normal normal (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-23 03:00 UTC by Arfrever Frehtes Taifersar Arahesis
Modified: 2019-07-24 21:53 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
non-utf-8_files_test-1.ebuild (non-utf-8_files_test-1.ebuild,716 bytes, text/plain)
2019-07-23 03:05 UTC, Arfrever Frehtes Taifersar Arahesis
Details
non-utf-8_files_test-2.ebuild (non-utf-8_files_test-2.ebuild,895 bytes, text/plain)
2019-07-23 03:09 UTC, Arfrever Frehtes Taifersar Arahesis
Details
non-utf-8_files_test-3.ebuild (non-utf-8_files_test-3.ebuild,909 bytes, text/plain)
2019-07-23 03:12 UTC, Arfrever Frehtes Taifersar Arahesis
Details
non-utf-8_files_test-4.ebuild (non-utf-8_files_test-4.ebuild,713 bytes, text/plain)
2019-07-23 03:15 UTC, Arfrever Frehtes Taifersar Arahesis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:00:23 UTC
There are 2 copies of code which mangles non-UTF-8 filenames:

- lib/portage/package/ebuild/doebuild.py:_post_src_install_uid_fix
  https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/package/ebuild/doebuild.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n2085

- lib/portage/dbapi/vartree.py:dblink.treewalk
  https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/dbapi/vartree.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n3916

(For non-binary packages, the first copy of code performs all mangling, and the second copy of code operates on already mangled filenames and does not seem to do more damage.)


...

try:
	parent = _unicode_decode(parent,
		encoding=_encodings['merge'], errors='strict')
except UnicodeDecodeError:
	new_parent = _unicode_decode(parent,
		encoding=_encodings['merge'], errors='replace')
	new_parent = _unicode_encode(new_parent,
		encoding='ascii', errors='backslashreplace')
	new_parent = _unicode_decode(new_parent,
		encoding=_encodings['merge'], errors='replace')
	os.rename(parent, new_parent)
...
for fname in chain(dirs, files):
	try:
		fname = _unicode_decode(fname,
			encoding=_encodings['merge'], errors='strict')
	except UnicodeDecodeError:
		fpath = _os.path.join(
			parent.encode(_encodings['merge']), fname)
		new_fname = _unicode_decode(fname,
			encoding=_encodings['merge'], errors='replace')
		new_fname = _unicode_encode(new_fname,
			encoding='ascii', errors='backslashreplace')
		new_fname = _unicode_decode(new_fname,
			encoding=_encodings['merge'], errors='replace')
		new_fpath = os.path.join(parent, new_fname)
		os.rename(fpath, new_fpath)
...


Any byte invalid in UTF-8 is replaced by \ufffd (REPLACEMENT CHARACTER).
This results in at least 4 problems:

1. Symbolic links pointing to non-UTF-8 filenames point to no longer existing files.
   See: non-utf-8_files_test-1.ebuild

2. When >=2 non-directory files (e.g. regular files) or >=2 empty directories (or at least all non-first are empty) have names differing only amongst bytes invalid in UTF-8, then the last one silently replaces the others.
   See: non-utf-8_files_test-2.ebuild

3. When >=2 non-empty directories have names differing only amongst bytes invalid in UTF-8, then "OSError: [Errno 39] Directory not empty" exception occurs and PDB prompt is opened.
   See: non-utf-8_files_test-3.ebuild

4. When a non-directory and a directory have names differing only amongst bytes invalid in UTF-8, then "IsADirectoryError: [Errno 21] Is a directory" exception occurs and PDB prompt is opened.
   See: non-utf-8_files_test-4.ebuild
Comment 1 Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:05:57 UTC
Created attachment 584138 [details]
non-utf-8_files_test-1.ebuild

Output:

===   src_install: ls ${ED}/usr/share/non-utf-8   ===
total 4.0K
lrwxrwxrwx 1 root root 2 Jul 23 04:05  aa -> 'a'$'\200'
-rw-r--r-- 1 root root 2 Jul 23 04:05 'a'$'\200'
===   src_install: cat ${ED}/usr/share/non-utf-8/aa   ===
1


 * QA Notice: This package installs one or more file names containing
 * characters that are not encoded with the UTF-8 encoding.
 * 
 *      usr/share/non-utf-8/a\ufffd
 *


===   pkg_preinst: ls ${ED}/usr/share/non-utf-8   ===
total 4.0K
-rw-r--r-- 1 root root 2 Jul 23 04:05 a\ufffd
lrwxrwxrwx 1 root root 2 Jul 23 04:05 aa -> a�
===   pkg_preinst: cat ${ED}/usr/share/non-utf-8/aa   ===
cat: /var/tmp/portage/app-misc/non-utf-8_files_test-1/image/usr/share/non-utf-8/aa: No such file or directory



Although actual target of symbolic links is not changed, value stored in VDB is mangled:

# cat /var/db/pkg/app-misc/non-utf-8_files_test-1/CONTENTS
dir /usr
dir /usr/share
dir /usr/share/non-utf-8
obj /usr/share/non-utf-8/a\ufffd b026324c6904b2a9cb4b88d6d61c81d1 1563847540
sym /usr/share/non-utf-8/aa -> a\ufffd 1563847541
Comment 2 Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:09:30 UTC
Created attachment 584140 [details]
non-utf-8_files_test-2.ebuild

Output:

===   src_install: ls ${ED}/usr/share/non-utf-8   ===
total 24K
-rw-r--r-- 1 root root    2 Jul 23 04:49 'b'$'\201'
-rw-r--r-- 1 root root    2 Jul 23 04:49 'b'$'\202'
-rw-r--r-- 1 root root    2 Jul 23 04:49 'b'$'\203'
drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\204'/
drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\205'/
drwxr-xr-x 2 root root 4.0K Jul 23 04:49 'c'$'\206'/
===   src_install: cat ${ED}/usr/share/non-utf-8/b*   ===
1
2
3


 * QA Notice: This package installs one or more file names containing
 * characters that are not encoded with the UTF-8 encoding.
 * 
 *      usr/share/non-utf-8/b\ufffd
 *      usr/share/non-utf-8/b\ufffd
 *      usr/share/non-utf-8/b\ufffd
 *      usr/share/non-utf-8/c\ufffd
 *      usr/share/non-utf-8/c\ufffd
 *      usr/share/non-utf-8/c\ufffd
 * 


===   pkg_preinst: ls ${ED}/usr/share/non-utf-8   ===
total 8.0K
-rw-r--r-- 1 root root    2 Jul 23 04:49 b\ufffd
drwxr-xr-x 2 root root 4.0K Jul 23 04:49 c\ufffd/
===   pkg_preinst: cat ${ED}/usr/share/non-utf-8/b*   ===
3



# cat /var/db/pkg/app-misc/non-utf-8_files_test-2/CONTENTS
dir /usr
dir /usr/share
dir /usr/share/non-utf-8
dir /usr/share/non-utf-8/c\ufffd
obj /usr/share/non-utf-8/b\ufffd 6d7fce9fee471194aa8b5b6e47267f03 1563850172
Comment 3 Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:12:35 UTC
Created attachment 584142 [details]
non-utf-8_files_test-3.ebuild

Output:

===   src_install: ls ${ED}/usr/share/non-utf-8   ===
total 12K
drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\207'/
drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\210'/
drwxr-xr-x 2 root root 4.0K Jul 23 04:51 'd'$'\211'/
===   src_install: cat ${ED}/usr/share/non-utf-8/d*/x   ===
1
2
3


Exception in callback AsynchronousTask.wait()
handle: <Handle AsynchronousTask.wait()>
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2113, in _post_src_install_uid_fix
    encoding=_encodings['merge'], errors='strict')
  File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 189, in _unicode_decode
    s = str(s, encoding=encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/events.py", line 127, in _run
    self._callback(*self._args)
  File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 84, in wait
    self._wait_hook()
  File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 195, in _wait_hook
    self._exit_listener_stack.pop()(self)
  File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 205, in _ebuild_exit
    self._ebuild_exit_unlocked(ebuild_process)
  File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 265, in _ebuild_exit_unlocked
    _post_src_install_uid_fix(settings, out)
  File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2124, in _post_src_install_uid_fix
    os.rename(fpath, new_fpath)
  File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 246, in __call__
    rval = self._func(*wrapped_args, **wrapped_kwargs)
OSError: [Errno 39] Directory not empty: b'/var/tmp/portage/app-misc/non-utf-8_files_test-3/image/usr/share/non-utf-8/d\x88' -> b'/var/tmp/portage/app-misc/non-utf-8_files_test-3/image/usr/share/non-utf-8/d\\ufffd'
--Return--
> /usr/lib64/python3.6/site-packages/portage/util/_eventloop/asyncio_event_loop.py(81)_internal_caller_exception_handler()->None
-> pdb.set_trace()
(Pdb)
Comment 4 Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:15:36 UTC
Created attachment 584144 [details]
non-utf-8_files_test-4.ebuild

Output:

===   src_install: ls ${ED}/usr/share/non-utf-8   ===
total 8.0K
-rw-r--r-- 1 root root    2 Jul 23 04:54 'e'$'\220'
drwxr-xr-x 2 root root 4.0K Jul 23 04:54 'e'$'\221'/
===   src_install: cat ${ED}/usr/share/non-utf-8/e*   ===
1
cat: '/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e'$'\221': Is a directory


Exception in callback AsynchronousTask.wait()
handle: <Handle AsynchronousTask.wait()>
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2113, in _post_src_install_uid_fix
    encoding=_encodings['merge'], errors='strict')
  File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 189, in _unicode_decode
    s = str(s, encoding=encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/events.py", line 127, in _run
    self._callback(*self._args)
  File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 84, in wait
    self._wait_hook()
  File "/usr/lib64/python3.6/site-packages/_emerge/AsynchronousTask.py", line 195, in _wait_hook
    self._exit_listener_stack.pop()(self)
  File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 205, in _ebuild_exit
    self._ebuild_exit_unlocked(ebuild_process)
  File "/usr/lib64/python3.6/site-packages/_emerge/EbuildPhase.py", line 265, in _ebuild_exit_unlocked
    _post_src_install_uid_fix(settings, out)
  File "/usr/lib64/python3.6/site-packages/portage/package/ebuild/doebuild.py", line 2124, in _post_src_install_uid_fix
    os.rename(fpath, new_fpath)
  File "/usr/lib64/python3.6/site-packages/portage/__init__.py", line 246, in __call__
    rval = self._func(*wrapped_args, **wrapped_kwargs)
IsADirectoryError: [Errno 21] Is a directory: b'/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e\x90' -> b'/var/tmp/portage/app-misc/non-utf-8_files_test-4/image/usr/share/non-utf-8/e\\ufffd'
--Return--
> /usr/lib64/python3.6/site-packages/portage/util/_eventloop/asyncio_event_loop.py(81)_internal_caller_exception_handler()->None
-> pdb.set_trace()
(Pdb)
Comment 5 Arfrever Frehtes Taifersar Arahesis 2019-07-23 03:25:44 UTC
Python 2 does not provide a built-in solution for lossless decoding of invalid bytes.
Python 2 will be discontinued by upstream soon, on 2020-01-01.
Python 2 will probably remain in some form in Gentoo for several years to provide support for some Python-2-only packages, but it is not necessary for Portage itself to continue supporting Python 2 for much longer time.
Therefore some minimalistic solution for problematic code in Portage with Python 2 might be sufficient.
(At least detection of collisions (Problem #2) and aborting.)


Python 3 provides surrogateescape error handler for lossless decoding of invalid bytes.
Therefore I suggest that problematic code in Portage with Python 3 be fixed in the following way:
- Use surrogateescape in appropriate places
- Continue printing QA warning about invalid UTF-8 bytes
- Stop renaming files
Comment 6 Arfrever Frehtes Taifersar Arahesis 2019-07-23 05:44:14 UTC
About decoding / encoding, maybe a good generic solution would be to use _unicode_decode(..., errors="surrogateescape") / _unicode_encode(..., errors="surrogateescape") in dozens/hundreds of places, and to make Python-2-specific implementations [1] of these 2 functions do something special when errors="surrogateescape" is received.
[1] https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/__init__.py?id=3ebe48e61a02cb00c3bb2366e50b4c83ef390ecb#n193
Comment 7 Zac Medico gentoo-dev 2019-07-24 17:30:22 UTC
For CONTENTS, we should probably use the GLEP 74 path encoding method so that package managers can use the same implementation for Manifest files:

https://www.gentoo.org/glep/glep-0074.html#path-and-filename-encoding
Comment 8 Zac Medico gentoo-dev 2019-07-24 21:53:56 UTC
(In reply to Zac Medico from comment #7)
> For CONTENTS, we should probably use the GLEP 74 path encoding method so
> that package managers can use the same implementation for Manifest files:
> 
> https://www.gentoo.org/glep/glep-0074.html#path-and-filename-encoding

Actually, the GLEP 74 path encoding method is not intended to encode/decode arbitrary byte strings, it only aims to encode/decode special unicode characters. However, it's still similar to what we'd want for encoding arbitrary byte strings in CONTENTS, since we would want it to encode ascii space and backslash characters at least.