As reported, see URL: sys-apps/busybox/files/busybox-1.24.1-unzip.patch: application/octet-stream; charset=binary (size=3903) sys-apps/busybox/files/busybox-1.24.1-unzip-regression.patch: application/octet-stream; charset=binary (size=4383) In particular, line 89 of busybox-1.24.1-unzip.patch contains binary garbage. Policy reference: https://devmanual.gentoo.org/general-concepts/tree/index.html#what-belongs-in-the-tree%3F
read the files and you'll see they're obviously correct
This is not about the files being "correct". The point is that file(1) reports them as non-text files: $ file -i busybox-1.24.1-unzip*.patch busybox-1.24.1-unzip.patch: application/octet-stream; charset=binary busybox-1.24.1-unzip-regression.patch: application/octet-stream; charset=binary This is because they contain lines like: + inflating: ]3j½r«IK-%Ix It is pointless to argue if this is a false positive of the QA script or not. It however clutters said script's output, and fixing it appears to be trivial, by quoting the binary chars with $'' in the offending line: + inflating: "$']3j\xc2\xbdr\xc2\xabI\x1b\x12K-%Ix'"
(In reply to Ulrich Müller from comment #2) you are the one that attempted to reference a policy that does not apply -- these files are not binary images as in png/etc... these are perfectly valid patches. `git format-patch` produced them and `patch` has no problem applying them. they are perfectly valid UTF-8 encoded. this is not "binary garbage". what you're now attempting to do, which is completely new, is reject any file in the tree that `file -i` detects as binary. it flagged these because it uses ^[^R or 0x1b 0x12 or ESC DC2. i'm not going to make bogus changes using non-portable shell logic in upstream code bases to appease a broken check. this runs against common sense.
The UTF-8 encoding isn't the problem, but the strange control characters make file(1) believe that this is binary. ESC would even be fine, but it barfs on DC2. Obviously file (or libmagic) must use some heuristic or another, and it chooses to reject any control chars other than 0x07-0x0d (BEL, BS, HT, LF, VT, FF, CR) and 0x1b (ESC). And I don't think that working around this in the QA script would be a good idea, because it could miss real binaries then.
i'm not changing the files to satisfy false positives. this is asinine.
Yeah, I guess for any such heuristic approach we must live with the fact that it cannot be perfect and will therefore have false positives and false negatives.
does the system not have a rolling whitelist or something so the output isn't cluttered w/noise ?
(In reply to SpanKY from comment #7) > does the system not have a rolling whitelist or something so the output > isn't cluttered w/noise ? That wasn't necessary so far. Current list is only 8 files: 3 files which are binaries without any doubt, 3 empty files, and the 2 files from this bug.