382199 – Handling of non-ASCII file names by emerge (continuation of 382021)

Bug 382199 - Handling of non-ASCII file names by emerge (continuation of 382021)

Summary: Handling of non-ASCII file names by emerge (continuation of 382021)

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Portage team

URL:
Whiteboard:
Keywords:	InVCS

Depends on:
Blocks:	172874 381649 406749
	Show dependency tree

Reported:	2011-09-07 20:09 UTC by Klaus Kusche
Modified:	2012-04-29 19:44 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Klaus Kusche 2011-09-07 20:09:50 UTC

See 382021 for the whole story:
For emerged file names containing non-ASCII chars, the CONTENTS file
does not contain the filename as it is stored on disk,
but the filename translated to unicode according to the locale setting
in effect during emerge.


Problem 1:
This makes all tools and scripts fail which process CONTENTS and do
not translate filenames back from unicode to the original name.

(In reply to comment #9)
> (In reply to comment #8)
> > However, /var/db/pkg/app-misc/ca-certificates-20110502-r4/CONTENTS
> > contains the translated filenames which do not match the actual filenames 
> > on disk.
> 
> They match, but they are encoded differently. CONTENTS is encoded in UTF-8, and
> the file names on disk are encoded with your ISO8859-15 locale setting. I've
> tested this case with ca-certificates-20110502-r4, and emerge is able to
> correctly merge and unmerge these files.
> 
> > This will certainly confuse third-party tools 
> > (like a security shell script of mine which compares all files on disk
> > with all files listed in all CONTENTS files, finding orphaned and lost files).
> 
> They won't get confused if they use the same codec translations as portage
> does. 

How would you do that in shell scripts, awk, sed or other plain Unix text tools?
The world is not only python...
My system has USE="-nls -unicode", so there is absolutely zero translation or unicode ability in all "plain" tools. 
And when displaying CONTENTS with an 8 bit locale,
it is also not visually readable - my eyes don't perform unicode translation.



Problem 2:
I believe this approach is fundamentally wrong and will fail in many contexts.

> My goal is to keep CONTENTS encoded with UTF-8 encoding, and also to have
> portage to respect your ISO8859-15 locale setting. This is what it does, so I'd
> rather not change it.

I think CONTENTS should exactly mirror what's on disk,
not messing with file names is the only safe approach.
What do you win by going to unicode?

For example, with your approach, things will get messed up if emerge and unmerge
is done with different locale settings (the back-translated file names will 
not match those on disk), which will for example happen if there are 
several administrators using different locales in their login,
or the system's locale has been changed in the meantime.

Your approach will also fail if the original file names contain chars
which are not valid in the current locale. There is no gurantee that the names 
of files installed during emerge are valid strings in the locale set for emerge.
Translation to unicode is usually quite tolerant, 
but translation back to 8 bits is definitely not.

I'm also not sure if your approach works with binpackages if the generating
and the installing computer have different locales, or if you are 
cross-installing and host and target differ w.r.t. their locale.
Should be tested before making a design decision...

> A possible alternative solution to your perceived problem would be to have the
> ca-certificates ebuild detect the ISO8859-15 locale setting and translate the
> encoding of the file names accordingly. If you want to take that approach, then
> you should file a new bug to have the ca-certificates ebuild translate the file
> names.

Adapting file names to some locale during install is a very bad idea I think:
The application might depend on the original names, even if they are invalid
w.r.t. the currently set locale. The application might fail to find its files
if the file names were translated.

Comment 1 Zac Medico gentoo-dev

2011-09-07 20:48:52 UTC

(In reply to comment #0)
> How would you do that in shell scripts, awk, sed or other plain Unix text
> tools?
> The world is not only python...

The standard utility for codeset conversion is iconv:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html

> My system has USE="-nls -unicode", so there is absolutely zero translation or
> unicode ability in all "plain" tools. 
> And when displaying CONTENTS with an 8 bit locale,
> it is also not visually readable - my eyes don't perform unicode translation.

You might consider using a plain ascii locale. Since bug 381509, portage uses utf-8 for file name translation with ascii locales. This is relatively painless since ascii is a subset of utf-8.

> I think CONTENTS should exactly mirror what's on disk,
> not messing with file names is the only safe approach.
> What do you win by going to unicode?

If we allow mixed unknown encodings inside CONTENTS, then it's not possible to know exactly which characters/glyphs are intended. By using a constant UTF-8 encoding, the problem is solved (at least for portage, but ebuilds like ca-certificates also need to respect the locale's file system encoding in order for things to work perfectly).

> Adapting file names to some locale during install is a very bad idea I think:
> The application might depend on the original names, even if they are invalid
> w.r.t. the currently set locale. The application might fail to find its files
> if the file names were translated.

This is why I advocate that ebuilds like ca-certificates be fixed so that they respect the locale's file system encoding for the files that they install.

Comment 2 Zac Medico gentoo-dev

2011-09-07 21:53:46 UTC

Another approach we might consider is to simply hardcode utf-8 encoding for all file names. It would be similar to what we did in bug 381509 for ascii locales, but we would extend it to all locales.

Comment 3 Klaus Kusche 2011-09-08 08:42:27 UTC

First of all, the current approach is definitely broken:

1.) I just unmerged ca-certificates (which was merged with an ISO locale)
using the C locale. There was no error message, but exactly those files
having non-ASCII chars were *not* unmerged (are still on disk).

2.) I emerged ca-certificates with C locale and immediately afterwards
re-emerged it with my default ISO locale. Re-emerge failed due to detected
file collisions for all files with non-ASCII names.

3.) I compared the actual file names on disk and the CONTENTS file 
after an emerge with C locale and an emerge with ISO locale.
The file names on disk were identical, but those in CONTENTS were not
(i.e. for the identical files on disk, different representations were stored
in CONTENTS depending on the locale).

The names in CONTENTS (when displayed as unicode) looked ok for ASCII,
but were garbage for ISO (because the actual filenames on disk are *not* ISO,
but unicode, so translating them from ISO to unicode misinterprets them).

> > I think CONTENTS should exactly mirror what's on disk,
> > not messing with file names is the only safe approach.
> > What do you win by going to unicode?
> 
> If we allow mixed unknown encodings inside CONTENTS, then it's not possible to
> know exactly which characters/glyphs are intended. By using a constant UTF-8
> encoding, the problem is solved (at least for portage, but ebuilds like
> ca-certificates also need to respect the locale's file system encoding in order
> for things to work perfectly).

We don't know which characters/glyphs are intended, anyway, because we don't
know what characters/glyphs the original file names on disk represent:
Their locale is unknown to both emerge and the administrator,
and is not necessarily the locale set during emerge.

So this argument is void, see above: If I emerge ca-certificates with an ISO
locale, currently emerge interprets the file names on disk as ISO 
(which is wrong, because they are already unicode, in spite of the locale),
converts them to unicode (which messes them up, making two glyphs out of each single non-ascii glyph!), and stores them in CONTENTS (where they definitely 
have wrong glyphs now).

So I try to sum up the facts and arguments:

1.) We should not modify the file names on disk:
They must be preserved in the representation generated by the build
to make sure that the application finds it's files.

2.) The locale used for file name representation by the build is unknown to us.
It is not necessarily the current filesystem locale, nor the locale set during
emerge. Hence, the char's / glyphs the file names represent are unknown.
It is not even known if they are pure 8 bit encoded or var-length unicode.
Consequently, we cannot translate them reasonably, because we don't know
the source charset to use for the translation.
The only thing known is that they are zero-terminated byte arrays
(at least I hope so).

3. The locale set during emerge is not necessarily the file system locale,
and it is not necessarily the same for all emerge/unmerge operations:
Over time and for different admins, the locale used for emerge might change.
Hence, the operation and permanent data of emerge must not be influenced
by locale settings. 

Alternatively, if emerge does any charset translations, it must store 
the charsets used together with the data to be able to apply
the correct inverse transformation later even when a different locale is set.

4. The main purpose of CONTENTS is *not* to show file names in human-readable 
form (which is impossible anyway, due to 2.), but to allow looking up the files
on disk correctly and easily, *independent* of the locale setting (see 3.).

For me, it follows that there can't be any translation
and the data in CONTENTS must be the same raw byte array used on disk.

Comment 4 Zac Medico gentoo-dev

2011-09-08 17:20:37 UTC

(In reply to comment #3)
> First of all, the current approach is definitely broken:
> 
> 1.) I just unmerged ca-certificates (which was merged with an ISO locale)
> using the C locale. There was no error message, but exactly those files
> having non-ASCII chars were *not* unmerged (are still on disk).
> 
> 2.) I emerged ca-certificates with C locale and immediately afterwards
> re-emerged it with my default ISO locale. Re-emerge failed due to detected
> file collisions for all files with non-ASCII names.

This can be solved if we use a constant utf-8 encoding for all locales, as suggested in comment #2.

> We don't know which characters/glyphs are intended, anyway, because we don't
> know what characters/glyphs the original file names on disk represent:
> Their locale is unknown to both emerge and the administrator,
> and is not necessarily the locale set during emerge.

However, the ebuild maintainer can find out the correct encoding from upstream, and can use iconv to convert encodings if necessary.

> 3. The locale set during emerge is not necessarily the file system locale,
> and it is not necessarily the same for all emerge/unmerge operations:
> Over time and for different admins, the locale used for emerge might change.
> Hence, the operation and permanent data of emerge must not be influenced
> by locale settings. 

As said, we can handle this by using a constant utf-8 encoding for all locales.

> For me, it follows that there can't be any translation
> and the data in CONTENTS must be the same raw byte array used on disk.

Alternatively, we could go with constant utf-8 encoding for all locales, and require ebuild maintainers to convert encodings with iconv when necessary.

Comment 5 Klaus Kusche 2011-09-08 19:26:24 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > First of all, the current approach is definitely broken:
> > 
> > 1.) I just unmerged ca-certificates (which was merged with an ISO locale)
> > using the C locale. There was no error message, but exactly those files
> > having non-ASCII chars were *not* unmerged (are still on disk).
> > 
> > 2.) I emerged ca-certificates with C locale and immediately afterwards
> > re-emerged it with my default ISO locale. Re-emerge failed due to detected
> > file collisions for all files with non-ASCII names.
> 
> This can be solved if we use a constant utf-8 encoding for all locales, as
> suggested in comment #2.

If you mean filenames on disk are always assumed to be utf-8 (independent of
the locale currently set) and hence there will be no actual translation between
the name on disk and CONTENTS, we agree.

> > We don't know which characters/glyphs are intended, anyway, because we don't
> > know what characters/glyphs the original file names on disk represent:
> > Their locale is unknown to both emerge and the administrator,
> > and is not necessarily the locale set during emerge.
> 
> However, the ebuild maintainer can find out the correct encoding from upstream,
> and can use iconv to convert encodings if necessary.

Now here we have 3 potential problems which need consideration:

1.) If the ebuild changes the filenames after the compile step has completed,
the application might no longer find its files (because their name has
changed from what the build has built and from what the application expects).

If the build should generate correct filenames from the beginning,
this will most likely require upstream changes to the compile step (difficult).

2.) You might force all global ebuilds to unicode filenames, 
because they should not be locale or region specific anyway. 

However, this might not hold for local ebuilds. 
For example, if I were to construct a local ebuild containing our
forms, templates and examples, this ebuild would definitely install files with
umlauts in their names, and it would definitely use ISO and not unicode
for that, because that's what our users expect and can read in their locale.
They would see only garbage for unicode filenames.

3.) Are you sure that all file systems where portage could potentially install
files on do support unicode as their filesystem locale? 
What's about e.g. ISO images? Are you sure all backup formats do?

Comment 6 Zac Medico gentoo-dev

2011-09-08 20:04:53 UTC

(In reply to comment #5)
> Now here we have 3 potential problems which need consideration:
> 
> 1.) If the ebuild changes the filenames after the compile step has completed,
> the application might no longer find its files (because their name has
> changed from what the build has built and from what the application expects).
> 
> If the build should generate correct filenames from the beginning,
> this will most likely require upstream changes to the compile step (difficult).

In practice, I think it's unlikely to find an upstream distributing file names encoded with anything other than utf-8. If they choose to use some other random encoding, then the ebuild should do conversions inside src_prepare, which is where all patches are supposed to be applied.

> 2.) You might force all global ebuilds to unicode filenames, 
> because they should not be locale or region specific anyway. 
> 
> However, this might not hold for local ebuilds. 
> For example, if I were to construct a local ebuild containing our
> forms, templates and examples, this ebuild would definitely install files with
> umlauts in their names, and it would definitely use ISO and not unicode
> for that, because that's what our users expect and can read in their locale.
> They would see only garbage for unicode filenames.

This would work fine with the existing portage behavior. If we instead hardcode a constant utf-8 encoding, then the ebuild would have to translate the file names to utf-8 in order to get correct results (and the users may need unicode support in programs that access those files).

> 3.) Are you sure that all file systems where portage could potentially install
> files on do support unicode as their filesystem locale? 
> What's about e.g. ISO images? Are you sure all backup formats do?

See the "Character Sets" section of the mkisofs man page for unicode usage details. These days, any format that doesn't support unicode is probably unmaintained and not worth using.

Comment 7 Klaus Kusche 2011-09-08 20:28:53 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Now here we have 3 potential problems which need consideration:
> > 
> > 1.) If the ebuild changes the filenames after the compile step has completed,
> > the application might no longer find its files (because their name has
> > changed from what the build has built and from what the application expects).
> > 
> > If the build should generate correct filenames from the beginning,
> > this will most likely require upstream changes to the compile step (difficult).
> 
> In practice, I think it's unlikely to find an upstream distributing file names
> encoded with anything other than utf-8. If they choose to use some other random
> encoding, then the ebuild should do conversions inside src_prepare, which is
> where all patches are supposed to be applied.

You're right. Static files should be unicode nowadays. I was thinking about
compile steps which generate files dynamically, perhaps with names depending on 
the current locale or chosen language packs. That would be worst case.

But even for static files renamed during prepare, the application might
need a patch to find and access the renamed file correctly.

> > 2.) You might force all global ebuilds to unicode filenames, 
> > because they should not be locale or region specific anyway. 
> > 
> > However, this might not hold for local ebuilds. 
> > For example, if I were to construct a local ebuild containing our
> > forms, templates and examples, this ebuild would definitely install files with
> > umlauts in their names, and it would definitely use ISO and not unicode
> > for that, because that's what our users expect and can read in their locale.
> > They would see only garbage for unicode filenames.
> 
> This would work fine with the existing portage behavior. If we instead hardcode
> a constant utf-8 encoding, then the ebuild would have to translate the file
> names to utf-8 in order to get correct results (and the users may need unicode
> support in programs that access those files).

Actually, working with files named in an unreadable locale is relatively easy: 
In a GUI, you just click them, on the command line, bash completion or 
wildcards are your friends. Works 95 % even without unicode support.

The problem is just that filenames are misdisplayed and users have to guess 
the meaning of the filename.

> > 3.) Are you sure that all file systems where portage could potentially install
> > files on do support unicode as their filesystem locale? 
> > What's about e.g. ISO images? Are you sure all backup formats do?
> 
> See the "Character Sets" section of the mkisofs man page for unicode usage
> details. These days, any format that doesn't support unicode is probably
> unmaintained and not worth using.

I accept that. I do have such old media still in use, but I'd never put 
an installation or backup onto them.

So:
We request that all installed filenames are utf8, independent of the locale,
and store them unchanged (utf8) in CONTENTS, also independent of the locale.

Then the only thing I'd request is that things don't break for broken ebuilds:
If a filename is invalid utf8, that's worth a warning, but the ebuild
should nevertheless emerge (and unmerge) correctly.
Can that be done?

Comment 8 Zac Medico gentoo-dev

2011-09-09 05:18:16 UTC

(In reply to comment #7)
> You're right. Static files should be unicode nowadays. I was thinking about
> compile steps which generate files dynamically, perhaps with names depending on 
> the current locale or chosen language packs. That would be worst case.
> 
> But even for static files renamed during prepare, the application might
> need a patch to find and access the renamed file correctly.

At least now, this case seems to apply to a practically negligible number of packages, because I haven't heard of any issues like this in the months since portage-2.1.9.x has been stabilized (bug #346819).

> So:
> We request that all installed filenames are utf8, independent of the locale,
> and store them unchanged (utf8) in CONTENTS, also independent of the locale.

When we migrate to this approach, people with non-ascii/utf8 locales may end up with some orphan files, due to the change it translation. Though it's a little messy, I don't expect it to be much of a problem.

> Then the only thing I'd request is that things don't break for broken ebuilds:
> If a filename is invalid utf8, that's worth a warning, but the ebuild
> should nevertheless emerge (and unmerge) correctly.
> Can that be done?

It should work fine. Some files may have to be automatically renamed in order to force them to be valid utf8, so that they will merge and unmerge correctly. This case already triggers an error message via elog, so it's easy to detect.

Comment 9 Zac Medico gentoo-dev

2011-09-09 20:51:01 UTC

This is fixed in git:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=db32c3e3ca1e3cc724acacc79a5be2343efc13d1

Comment 10 Zac Medico gentoo-dev

2011-09-09 22:37:47 UTC

This is fixed in 2.1.10.15 and 2.2.0_alpha55.