Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 862372

Summary: Take steps to detect malicious unicode in source code and pull requests
Product: Gentoo Security Reporter: bo0od
Component: MiscAssignee: Gentoo Security <security>
Status: UNCONFIRMED ---    
Severity: major CC: ajak, gentoo
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
See Also: https://bugs.gentoo.org/show_bug.cgi?id=821154
Whiteboard:
Package list:
Runtime testing required: ---

Description bo0od 2022-07-30 19:07:22 UTC
Quote https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities

> **Invisible Source Code Vulnerabilities**
>
> Some Vulnerabilities are Invisible
> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
>
> These adversarial encodings produce no visual artifacts.
>
> **The trick**
>
> The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.
> These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source code, not the visual order.
>
> **The attack**
>
> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.
> ...
> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.
This attack pattern is tracked as CVE-2021-42574.

CVE-2021-42574 at redhat https://access.redhat.com/security/cve/cve-2021-42574

> **The supply chain**
>
> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.
>
> **The technique**
>
> There are multiple techniques that can be used to exploit the visual reordering of source code tokens:
> * Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.
> * Commenting-Out causes a comment to visually appear as code, which in turn is not executed.
> * Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.
>
> **The variant**
>
> A similar attack exists which uses homoglyphs, or characters that appear near identical.
>
> ...
>
> The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.
> An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.
> This attack variant is tracked as CVE-2021-42694.
>
> **The defense**
>
> * Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
> * Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
> * Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.
>
> **The paper**
>
> Complete details can be found in the related https://trojansource.codes/trojan-source.pdf paper.

By authors Nicholas Boucher and Ross Anderson, 2021, https://arxiv.org/abs/2111.00169 arXiv.

tasks:

- [ ] **check if potential existing compromises:** scan all distribution source code for existing unicode
- [ ] **educate existing and future distribution source code reviewers:** add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter.
- [ ] **remove as much unicode from distribution source code as possible**: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing `®` when required it should be encoded as `&reg;`.
- [ ] **local check by reviewer:** document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode
- [ ] **remote cursory check:** add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.)
- [ ] **build scripts / CI scripts:** should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out.
- [ ] **scan upstream projects source code**: check if these are compromised by malicious unicode
- [ ] **notify upstream projects**: these might not be aware of this issue and already compromised by malicious unicode.

references:

* https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-github-defense/
* https://www.kicksecure.com/wiki/Unicode
Comment 1 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-07-31 05:51:39 UTC
I appreciate you raising the concern but this is a really generic bug that you've filed. It's not really specific to Gentoo at all in the points raised or specific suggestions, i.e. it needs tailoring.
Comment 2 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2024-05-27 00:36:40 UTC
(In reply to bo0od from comment #0)
> - [ ] **check if potential existing compromises:** scan all distribution
> source code for existing unicode

> - [ ] **educate existing and future distribution source code reviewers:**
> add a distribution source code reviewer policy to a github repository or on
> the distribution website which existing and future reviewers need to
> acknowledge that I understand the issue. More of a reminder, a conversation
> starter.

For GitHub at least, are we happy with the warning that they add (https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/)? Do we care about the other avenues of contribution?

> - [ ] **remove as much unicode from distribution source code as possible**:
> by reducing the amount of unicode in distribution source code, audits for
> malicious unicode with automated tools gets simpler. If possible, if unicode
> is considered essential, instead of writing `®` when required it should be
> encoded as `&reg;`.

I'm not sure I see any merit to this, but see also my comments to the following.

> - [ ] **local check by reviewer:** document tools that distribution source
> code reviewers could/should use to scan future contributions for malicious
> unicode

For this one I think it makes sense to be some kind of warning, maybe this is worth adding to pkgcheck? That won't cover all of the potential contributions to the entire distributions, but maybe good enough?

> - [ ] **remote cursory check:** add a github pull request hook that notifies
> when unicode is included in a pull request (This is just an additional,
> handy layer of protection. Since infrastructure should be distrusted this
> alone is not a full solution.)

Obsoleted by the native warning.

> - [ ] **build scripts / CI scripts:** should check if there is unicode in
> any files except in opt-in expected files. If there is unexpected unicode,
> the build should error out.
> - [ ] **scan upstream projects source code**: check if these are compromised
> by malicious unicode
> - [ ] **notify upstream projects**: these might not be aware of this issue
> and already compromised by malicious unicode.

While reasonable goals, I don't think these ecosystem-wide surveillance topics are at all tasks for Gentoo Security to handle.