|Summary:||Take steps to detect malicious unicode in source code and pull requests|
|Component:||Misc||Assignee:||Gentoo Security <security>|
|Package list:||Runtime testing required:||---|
Description bo0od 2022-07-30 19:07:22 UTC
Quote https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities > **Invisible Source Code Vulnerabilities** > > Some Vulnerabilities are Invisible > Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities. > > These adversarial encodings produce no visual artifacts. > > **The trick** > > The trick is to use Unicode control characters to reorder tokens in source code at the encoding level. > These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens. > Compilers and interpreters adhere to the logical ordering of source code, not the visual order. > > **The attack** > > The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic. > ... > Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers. This attack pattern is tracked as CVE-2021-42574. CVE-2021-42574 at redhat https://access.redhat.com/security/cve/cve-2021-42574 > **The supply chain** > > This attack is particularly powerful within the context of software supply chains. > If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability. > > **The technique** > > There are multiple techniques that can be used to exploit the visual reordering of source code tokens: > * Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment. > * Commenting-Out causes a comment to visually appear as code, which in turn is not executed. > * Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail. > > **The variant** > > A similar attack exists which uses homoglyphs, or characters that appear near identical. > > ... > > The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference. > An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code. > This attack variant is tracked as CVE-2021-42694. > > **The defense** > > * Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters. > * Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals. > * Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings. > > **The paper** > > Complete details can be found in the related https://trojansource.codes/trojan-source.pdf paper. By authors Nicholas Boucher and Ross Anderson, 2021, https://arxiv.org/abs/2111.00169 arXiv. tasks: - [ ] **check if potential existing compromises:** scan all distribution source code for existing unicode - [ ] **educate existing and future distribution source code reviewers:** add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter. - [ ] **remove as much unicode from distribution source code as possible**: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing `®` when required it should be encoded as `®`. - [ ] **local check by reviewer:** document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode - [ ] **remote cursory check:** add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.) - [ ] **build scripts / CI scripts:** should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out. - [ ] **scan upstream projects source code**: check if these are compromised by malicious unicode - [ ] **notify upstream projects**: these might not be aware of this issue and already compromised by malicious unicode. references: * https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-github-defense/ * https://www.kicksecure.com/wiki/Unicode
Comment 1 Sam James 2022-07-31 05:51:39 UTC
I appreciate you raising the concern but this is a really generic bug that you've filed. It's not really specific to Gentoo at all in the points raised or specific suggestions, i.e. it needs tailoring.