Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 862372 - Take steps to detect malicious unicode in source code and pull requests
Summary: Take steps to detect malicious unicode in source code and pull requests
Status: UNCONFIRMED
Alias: None
Product: Gentoo Security
Classification: Unclassified
Component: Misc (show other bugs)
Hardware: All Linux
: Normal major (vote)
Assignee: Gentoo Security
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-30 19:07 UTC by bo0od
Modified: 2022-07-31 11:40 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description bo0od 2022-07-30 19:07:22 UTC
Quote https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities

> **Invisible Source Code Vulnerabilities**
>
> Some Vulnerabilities are Invisible
> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
>
> These adversarial encodings produce no visual artifacts.
>
> **The trick**
>
> The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.
> These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source code, not the visual order.
>
> **The attack**
>
> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.
> ...
> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.
This attack pattern is tracked as CVE-2021-42574.

CVE-2021-42574 at redhat https://access.redhat.com/security/cve/cve-2021-42574

> **The supply chain**
>
> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.
>
> **The technique**
>
> There are multiple techniques that can be used to exploit the visual reordering of source code tokens:
> * Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.
> * Commenting-Out causes a comment to visually appear as code, which in turn is not executed.
> * Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.
>
> **The variant**
>
> A similar attack exists which uses homoglyphs, or characters that appear near identical.
>
> ...
>
> The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.
> An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.
> This attack variant is tracked as CVE-2021-42694.
>
> **The defense**
>
> * Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
> * Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
> * Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.
>
> **The paper**
>
> Complete details can be found in the related https://trojansource.codes/trojan-source.pdf paper.

By authors Nicholas Boucher and Ross Anderson, 2021, https://arxiv.org/abs/2111.00169 arXiv.

tasks:

- [ ] **check if potential existing compromises:** scan all distribution source code for existing unicode
- [ ] **educate existing and future distribution source code reviewers:** add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter.
- [ ] **remove as much unicode from distribution source code as possible**: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing `®` when required it should be encoded as `®`.
- [ ] **local check by reviewer:** document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode
- [ ] **remote cursory check:** add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.)
- [ ] **build scripts / CI scripts:** should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out.
- [ ] **scan upstream projects source code**: check if these are compromised by malicious unicode
- [ ] **notify upstream projects**: these might not be aware of this issue and already compromised by malicious unicode.

references:

* https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-github-defense/
* https://www.kicksecure.com/wiki/Unicode
Comment 1 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-07-31 05:51:39 UTC
I appreciate you raising the concern but this is a really generic bug that you've filed. It's not really specific to Gentoo at all in the points raised or specific suggestions, i.e. it needs tailoring.