Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 563094 - repoman should reject ebuilds that include unicode whitespace
Summary: repoman should reject ebuilds that include unicode whitespace
Status: RESOLVED WONTFIX
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Repoman (show other bugs)
Hardware: All All
: Normal enhancement (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-15 03:29 UTC by SpanKY
Modified: 2022-07-12 03:32 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description SpanKY gentoo-dev 2015-10-15 03:29:36 UTC
i had to fix a bunch of ebuilds that used whitespace that wasn't \t\n\  here:
http://gitweb.gentoo.org/repo/gentoo.git/commit/?id=c1511618853db61acd458f9f2a9cda0f08fe7cfd

the issue is that it's not easy to even detect this (certainly not visual inspection), and the spaces were used in places which caused errors.  if you look at the commit above, the first ebuild (bashburn) used \xc2\xa0 between || and die.  since bash does not treat it as whitespace, you'd end up with:
  ...: line xxx:  die: command not found
that means if the sed failed, the die wouldn't actually run, because bash wouldn't run `die`, it'd run `\xc2\xa0die`.

if you look at that commit, there were actually a large number of such bad whitespace usage in ebuilds, most likely from copy & paste.

repoman should detect & reject all utf8 whitespace that is not in ascii.  the list of whitespace could be built at runtime:
  whitespace = set(re.findall(
    r'\s', u''.join(unichr(c) for c in xrange(sys.maxunicode + 1)), re.UNICODE))
  whitespace -= {'\t', '\n', '\r', ' '}
  whitespace_re = re.compile(r'[' + u''.join(whitespace) + r']')

but it might be better to just precompute the list:
  # Created by ...
  whitespace_re = re.compile(r'[' + u'\u2001\u2000\u2003\u2002\x85\u2004\u2007\u1680\u2009\u2008\x0b\u200a\x0c\u180e\u2005\x1d\x1c\x1f\x1e\xa0\u3000\u2029\u2028\u2006\u202f\u205f' + r']')

disallowing these characters shouldn't be a problem.  if you want to delete the chars from a file (via sed or tr or something), you can always use bash escape sequences and the raw byte values.
Comment 1 Arfrever Frehtes Taifersar Arahesis 2015-10-15 07:09:24 UTC
Some of these characters might be valid in comments or messages not written in English language.


unicodedata.category() is another way to find whitespace characters.
https://docs.python.org/3.6/library/unicodedata.html#unicodedata.category


Whether a character is a whitespace character depends on Unicode version.

E.g. \u180e was whitespace character only in Unicode <6.3.0.
http://www.unicode.org/versions/Unicode6.3.0/ :
"The General_Category property value of U+180E MONGOLIAN VOWEL SEPARATOR has been changed from Zs to Cf. The values of other related properties such as Bidi_Class, White_Space, and Other_Default_Ignorable_Code_Point have been updated accordingly."

$ python3.3 -c 'import unicodedata; print(unicodedata.unidata_version)'
6.1.0
$ python3.4 -c 'import unicodedata; print(unicodedata.unidata_version)'
6.3.0
$ python3.3 -c 'import unicodedata; print((unicodedata.name("\u180e"), unicodedata.category("\u180e")))'
('MONGOLIAN VOWEL SEPARATOR', 'Zs')
$ python3.4 -c 'import unicodedata; print((unicodedata.name("\u180e"), unicodedata.category("\u180e")))'
('MONGOLIAN VOWEL SEPARATOR', 'Cf')
$ python3.3 -c 'import re; print(re.match(r"\s", "\u180e", re.UNICODE))'
<_sre.SRE_Match object at 0x7f6f51bf31d0>
$ python3.4 -c 'import re; print(re.match(r"\s", "\u180e", re.UNICODE))'
None
$
Comment 2 SpanKY gentoo-dev 2015-10-15 15:00:33 UTC
(In reply to Arfrever Frehtes Taifersar Arahesis from comment #1)

i don't think it's worth the hassle, especially considering:
 - they have yet to show up in the tree in any valid use (and there have been
   multiple invalid uses as i showed in that commit)
 - Gentoo requires english messages in ebuilds/eclasses (ignoring translations
   in metadata.xml, but we aren't talking about that here)
 - rejecting ebuilds from the main tree/official overlays doesn't preclude
   people from putting them in their own overlays and ignoring the errors
Comment 3 Tim Harder gentoo-dev 2019-12-03 02:17:38 UTC
For anyone interested in the check and not the repoman-specific implementation, it's now available in pkgcheck [1] and should show up in CI as the keyword result "BadWhitespaceCharacter" after the next release.

[1]: https://github.com/pkgcore/pkgcheck/commit/81f97bf1
Comment 4 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-07-12 03:18:14 UTC
repoman support has been removed per bug 835013.

Please file a new bug (or, I suppose, reopen this one) if you feel this check is still applicable to pkgcheck and doesn't already exist.