Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 862372

Summary: Take steps to detect malicious unicode in source code and pull requests
Product: Gentoo Security Reporter: bo0od
Component: MiscAssignee: Gentoo Security <security>
Status: UNCONFIRMED ---    
Severity: major CC: gentoo
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
See Also:
Package list:
Runtime testing required: ---

Description bo0od 2022-07-30 19:07:22 UTC
Quote Trojan Source: Invisible Vulnerabilities

> **Invisible Source Code Vulnerabilities**
> Some Vulnerabilities are Invisible
> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
> These adversarial encodings produce no visual artifacts.
> **The trick**
> The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.
> These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source code, not the visual order.
> **The attack**
> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.
> ...
> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.
This attack pattern is tracked as CVE-2021-42574.

CVE-2021-42574 at redhat

> **The supply chain**
> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.
> **The technique**
> There are multiple techniques that can be used to exploit the visual reordering of source code tokens:
> * Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.
> * Commenting-Out causes a comment to visually appear as code, which in turn is not executed.
> * Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.
> **The variant**
> A similar attack exists which uses homoglyphs, or characters that appear near identical.
> ...
> The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.
> An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.
> This attack variant is tracked as CVE-2021-42694.
> **The defense**
> * Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
> * Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
> * Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.
> **The paper**
> Complete details can be found in the related paper.

By authors Nicholas Boucher and Ross Anderson, 2021, arXiv.


- [ ] **check if potential existing compromises:** scan all distribution source code for existing unicode
- [ ] **educate existing and future distribution source code reviewers:** add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter.
- [ ] **remove as much unicode from distribution source code as possible**: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing `®` when required it should be encoded as `&reg;`.
- [ ] **local check by reviewer:** document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode
- [ ] **remote cursory check:** add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.)
- [ ] **build scripts / CI scripts:** should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out.
- [ ] **scan upstream projects source code**: check if these are compromised by malicious unicode
- [ ] **notify upstream projects**: these might not be aware of this issue and already compromised by malicious unicode.


Comment 1 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-07-31 05:51:39 UTC
I appreciate you raising the concern but this is a really generic bug that you've filed. It's not really specific to Gentoo at all in the points raised or specific suggestions, i.e. it needs tailoring.