Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 830882 - dev-python/lxml: segfault in Objects/exceptions.c:237 when locale is set to en_US.ISO-8859-15 via app-portage/repoman
Summary: dev-python/lxml: segfault in Objects/exceptions.c:237 when locale is set to e...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal major (vote)
Assignee: Python Gentoo Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-10 03:19 UTC by Joshua Kinard
Modified: 2022-06-01 05:18 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
full gdb backtrace (repoman-segfault-20220110.txt,41.12 KB, text/plain)
2022-01-10 03:19 UTC, Joshua Kinard
Details
emerge --info from my dev box (emerge-info-20220110.txt,6.36 KB, text/plain)
2022-01-10 03:19 UTC, Joshua Kinard
Details
lxml test case #1, etree.parse() XML from a file (test1.py,121 bytes, text/plain)
2022-01-14 21:18 UTC, Joshua Kinard
Details
lxml test case #1 test file (test1.xml,55 bytes, application/xml)
2022-01-14 21:19 UTC, Joshua Kinard
Details
lxml test case #2, XML begins w/ newline (test2.py,151 bytes, text/plain)
2022-01-14 21:23 UTC, Joshua Kinard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joshua Kinard gentoo-dev 2022-01-10 03:19:31 UTC
Created attachment 761726 [details]
full gdb backtrace

When attempting to do some MIPS keywording, I ran into an issue where running "repoman full -d -x" in a package directory on my developer git tree caused repoman to throw a segmentation fault in libpython3.10.so.1.0.  I also reproduced it on the main Portage tree that my machines use for emerge, and also under Python-3.9.

I was able to gdb out the cause as being in Python's Objects/exceptions.c:237:

(gdb) file /usr/bin/python
Reading symbols from /usr/bin/python...
Reading symbols from /usr/lib/debug//usr/bin/python-exec2c.debug...

(gdb) directory /ramfs/portage/dev-lang/python-3.10.1-r3/work/Python-3.10.1
Source directories searched: /ramfs/portage/dev-lang/python-3.10.1-r3/work/Python-3.10.1:$cdir:$cwd

(gdb) run /usr/bin/repoman full -d -x
Starting program: /usr/bin/python /usr/bin/repoman full -d -x
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
process 21832 is executing new program: /usr/bin/python3.10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

RepoMan scours the neighborhood...
[Detaching after vfork from child process 21836]
[Detaching after vfork from child process 21857]
[Detaching after vfork from child process 21878]
[Detaching after vfork from child process 21899]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7d1db5f in BaseException_set_tb (_unused_ignored=0x0, tb=<traceback at remote 0x7ffff60a2900>, self=0x0) at Objects/exceptions.c:237
237         Py_XSETREF(self->traceback, tb);

The problem is that 'self' is NULL, so attempting to dereference 'self->traceback' kills it.

Both my developer tree and my working repo tree sit on a NAS that my machines mount over NFSv4.2.  Already tried rebooting both my dev box as well as the NAS itself to rule out the usual culprits, but no dice.

Frame #29 looked interesting, as it's the final codepath that repoman hit before the crash:

#29 0x00007ffff7e05e16 in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x55555587d270, for file /usr/lib/python3.10/site-packages/portage/xml/metadata.py, line 442, in parse_metadata_use (xml_tree=<lxml.etree._ElementTree at remote 0x7ffff628c640>, uselist={}), tstate=0x555555577470)
    at ./Include/internal/pycore_ceval.h:46

Line 442 in that file:
    usetags = xml_tree.findall("use")

I dropped a pdb.set_trace() call just before that line, and then stepped through the Python code until it segfaulted:

# repoman full -d -x

RepoMan scours the neighborhood...
> /usr/lib/python3.10/site-packages/portage/xml/metadata.py(444)parse_metadata_use()
-> usetags = xml_tree.findall("use")
(Pdb) step
--Call--
> /usr/lib/python3.10/encodings/iso8859_15.py(14)decode()
-> def decode(self,input,errors='strict'):
(Pdb)
> /usr/lib/python3.10/encodings/iso8859_15.py(15)decode()
-> return codecs.charmap_decode(input,errors,decoding_table)
(Pdb)
--Return--
> /usr/lib/python3.10/encodings/iso8859_15.py(15)decode()->('src/lxml/_elementpath.py', 24)
-> return codecs.charmap_decode(input,errors,decoding_table)
(Pdb)
Segmentation fault

This indicates that there is some kind of issue with how Python handles the ISO8859-15 locale (Western w/ Euro).  To be sure, I changed to ISO8859-1 (Western w/o Euro), and repoman will then work fine.

My guess is if it's a core issue with the ISO8859-15 locale, then glibc is probably at fault in some way, but how one debugs that out, I am unsure.  This might make it reproducible on your end.  Note that to get ISO8859-15, any references to ISO8859-1 need to be commented out in /etc/locale.gen before running locale-gen, otherwise, it will complain about one being the duplicate of the other and it will skip that one (silently).
Comment 1 Joshua Kinard gentoo-dev 2022-01-10 03:19:58 UTC
Created attachment 761727 [details]
emerge --info from my dev box
Comment 2 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-13 04:04:50 UTC
I think you'll need to report this upstream and we go from there.

You might want to try construct a minimal reproducer using the lxml Python module though.

Does it happen outside of lxml too? Just parsing any string with this encoding?
Comment 3 Joshua Kinard gentoo-dev 2022-01-13 07:08:26 UTC
(In reply to Sam James from comment #2)
> I think you'll need to report this upstream and we go from there.
> 
> You might want to try construct a minimal reproducer using the lxml Python
> module though.
> 
> Does it happen outside of lxml too? Just parsing any string with this
> encoding?

I am not sure.  I did not do any further debugging after the call chain in Python's debugger stopped at iso8859_15.py(15)decode(), so I don't know why that got to the point where it needed to throw an exception, but pass in a NULL 'self' to CPython.  Both Python's codecs and lxml modules are areas of Python I know very little about.

Also, one of the pieces of information I am missing is what in the backend Portage metadata was involved that didn't fly with ISO8859-15?  Clearly repoman was in the middle of parsing something when it choked on its own exception.  Its other modes, like 'manifest', seem to work fine under that codec, just not the 'full' mode.
Comment 4 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-13 08:00:47 UTC
This segfaults for me if I configure my locale as en_US.ISO-8859-15.
```
#!/usr/bin/env python3
from lxml import etree

xml="""
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pkgmetadata SYSTEM "https://www.gentoo.org/dtd/metadata.dtd">
<pkgmetadata>
	<maintainer type="project">
		<email>python@gentoo.org</email>
	</maintainer>
	<stabilize-allarches/>
	<upstream>
		<remote-id type="pypi">build</remote-id>
		<remote-id type="github">pypa/build</remote-id>
	</upstream>
</pkgmetadata>
"""

root = etree.fromstring(xml).findall("use")
print(root)
```
Comment 5 Joshua Kinard gentoo-dev 2022-01-13 09:39:50 UTC
(In reply to Sam James from comment #4)
> This segfaults for me if I configure my locale as en_US.ISO-8859-15.
> ```
> #!/usr/bin/env python3
> from lxml import etree
> 
> xml="""
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE pkgmetadata SYSTEM "https://www.gentoo.org/dtd/metadata.dtd">
> <pkgmetadata>
> 	<maintainer type="project">
> 		<email>python@gentoo.org</email>
> 	</maintainer>
> 	<stabilize-allarches/>
> 	<upstream>
> 		<remote-id type="pypi">build</remote-id>
> 		<remote-id type="github">pypa/build</remote-id>
> 	</upstream>
> </pkgmetadata>
> """
> 
> root = etree.fromstring(xml).findall("use")
> print(root)
> ```

That does indeed SIGSEGV, but in a completely different file:
# gdb
(gdb) file /usr/bin/python
Reading symbols from /usr/bin/python...
Reading symbols from /usr/lib/debug//usr/bin/python-exec2c.debug...

(gdb) directory /ramfs/portage/dev-lang/python-3.10.1-r3/work/Python-3.10.1
Source directories searched: /ramfs/portage/dev-lang/python-3.10.1-r3/work/Python-3.10.1:$cdir:$cwd

(gdb) run py-sigsegv-202201.py
Starting program: /usr/bin/python py-sigsegv-202201.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
process 7966 is executing new program: /usr/bin/python3.10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
__pyx_f_4lxml_5etree__getThreadErrorLog (__pyx_v_name='_GlobalErrorLog') at src/lxml/etree.c:48609
48609   src/lxml/etree.c: No such file or directory.

As the last line indicates, for some reason, the lxml source isn't being unpacked or found (I am probably missing a step somewhere -- isn't lxml built into Python these days?), but the segfault in src/lxml/etree.c looks to be somewhat different than the one in Objects/exceptions.c.  This means you accidentally found a new bug, but one that happens to have a similar root cause to the one I reported.  Hah!

I did initially suspect the metadata.xml file, but I think instead, the buggy data is in the tree metadata that is generated on the rsync master and sent down in $PORTDIR/metadata.  I *think* that's what repoman is scanning when it chokes, but probably need a portage dev to validate that, as they know repoman's internals a lot better than I do.

FWIW, dmesg shows this as similar to my original crash:
repoman[8186]: segfault at 20 ip 00007aaffb480b5f sp 00007ffdb5630b00 error 4 in libpython3.10.so.1.0[7aaffb404000+1f3000]
Code: 1f 84 00 00 00 00 00 0f 1f 40 00 48 83 ec 08 48 85 f6 74 5e 48 3b 35 a0 62 26 00 74 0d 48 8b 05 07 62 26 00 48 39 46 08 75 2b <4c> 8b 47 20 48 ff 06 48 89 77 20 4d 85 c0 74 05 49 ff 08 74 0c 31

And your testcase:
python3[28567]: segfault at 0 ip 00007bc6a34f4abc sp 00007ffdae805130 error 6 in etree.cpython-310-x86_64-linux-gnu.so[7bc6a34ab000+15c000]
Code: 24 38 0f 88 76 02 00 00 49 8b 01 48 83 f8 01 0f 84 49 01 00 00 49 89 01 48 ff 09 0f 84 fa 01 00 00 49 ff 0b 0f 84 c6 01 00 00 <49> ff 0a 0f 84 9c 01 00 00 48 8b bb 90 00 00 00 4c 89 f9 4c 89 c2

Different modules, different source, both are NULL dereferences, but both are related to locale being ISO8859-15.  Weird!

If you want to try to reproduce my error, set your locale to ISO8859-15 and just run "repoman full -d -x" from a package directory within $PORTDIR.  If you have an idea of how to backtrace that to whatever data repoman was chewing on, then I will open an upstream bug with both testcases and see what they think.
Comment 6 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-13 10:08:01 UTC
(In reply to Joshua Kinard from comment #5)
> (In reply to Sam James from comment #4)
> > This segfaults for me if I configure my locale as en_US.ISO-8859-15.
[snip]
>
> 48609   src/lxml/etree.c: No such file or directory.
> 
> As the last line indicates, for some reason, the lxml source isn't being
> unpacked or found (I am probably missing a step somewhere -- isn't lxml
> built into Python these days?), but the segfault in src/lxml/etree.c looks
> to be somewhat different than the one in Objects/exceptions.c.  This means
> you accidentally found a new bug, but one that happens to have a similar
> root cause to the one I reported.  Hah!

Try installing dev-python/lxml with debug symbols & installsources.

And wow!

> 
> I did initially suspect the metadata.xml file, but I think instead, the
> buggy data is in the tree metadata that is generated on the rsync master and
> sent down in $PORTDIR/metadata.  I *think* that's what repoman is scanning
> when it chokes, but probably need a portage dev to validate that, as they
> know repoman's internals a lot better than I do.
> 

Yeah, I went into my git checkout's dev-python/build (noticed it
from your attached backtrace), ran 'repoman full -dx' and got the crash.

You could strace (possibly with limited syscalls like open()) to see
if it touches metadata/*.

> FWIW, dmesg shows this as similar to my original crash:
> repoman[8186]: segfault at 20 ip 00007aaffb480b5f sp 00007ffdb5630b00 error
> 4 in libpython3.10.so.1.0[7aaffb404000+1f3000]
> Code: 1f 84 00 00 00 00 00 0f 1f 40 00 48 83 ec 08 48 85 f6 74 5e 48 3b 35
> a0 62 26 00 74 0d 48 8b 05 07 62 26 00 48 39 46 08 75 2b <4c> 8b 47 20 48 ff
> 06 48 89 77 20 4d 85 c0 74 05 49 ff 08 74 0c 31
> 
> And your testcase:
> python3[28567]: segfault at 0 ip 00007bc6a34f4abc sp 00007ffdae805130 error
> 6 in etree.cpython-310-x86_64-linux-gnu.so[7bc6a34ab000+15c000]
> Code: 24 38 0f 88 76 02 00 00 49 8b 01 48 83 f8 01 0f 84 49 01 00 00 49 89
> 01 48 ff 09 0f 84 fa 01 00 00 49 ff 0b 0f 84 c6 01 00 00 <49> ff 0a 0f 84 9c
> 01 00 00 48 8b bb 90 00 00 00 4c 89 f9 4c 89 c2
> 
> Different modules, different source, both are NULL dereferences, but both
> are related to locale being ISO8859-15.  Weird!
> 
> If you want to try to reproduce my error, set your locale to ISO8859-15 and
> just run "repoman full -d -x" from a package directory within $PORTDIR.  If
> you have an idea of how to backtrace that to whatever data repoman was
> chewing on, then I will open an upstream bug with both testcases and see
> what they think.

Reproduced that too! I'd try shoving a print() on xml_tree at:
>Line 442 in that file:
>    usetags = xml_tree.findall("use")
Comment 7 Joshua Kinard gentoo-dev 2022-01-13 20:13:07 UTC
Okay, I was wrong, you weren't.  I should have read the top of metadata.py in repoman's source, and it is looking at the metadata.xml files in the package directories, and not the tree-generated metadata.

That said, this "bug" only manifests under some weird conditions, primarily when lxml.etree itself attempts to throw a traceback.  If the traceback comes out of core Python, there is no segfault.  So it is some kind of interaction between lxml and Python going on.  And, I think it is limited to just Gentoo systems.

In your provided testcase, because you use the triple quote to define the "xml" variable, that causes your sample XML to begin with a newline, which lxml really doesn't like.  On an unaffected system, lxml will complain:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1912, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document, line 2, column 6

On an affected system, if using the ISO-8859-15 locale, you get a segmentation fault.  Strangely enough, if you fix that issue, the segmentation fault doesn't happen:

Traceback (most recent call last):
  File "./sam.py", line 18, in <module>
    root = etree.fromstring(xml).findall("use")
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

I went on to create another testcase that uses etree.parse() incorrectly:
#!/usr/bin/env python3
from lxml import etree
x=b'<foo>bar</foo>'
x2=etree.parse(x)

etree.parse() expects a file path, not an XML string, so it should naturally return this exception trace:

Traceback (most recent call last):
  File "/usr/portage/local/sys-kernel/mips-sources/./x.py", line 5, in <module>
    x2=etree.parse(x)
  File "src/lxml/etree.pyx", line 3536, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1901, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError
OSError: Error reading file '<foo>bar</foo>': failed to load external entity "<foo>bar</foo>"

But if our locale is set to ISO-8859-15, we segfault.  Both your testcase and my testcase fault in the same C function.  It turns out, though, that lxml generates its C sourcecode from Python metacode.  Learning that, I pointed GDB at the generated source and discovered that the generated code is not very helpful:

Program received signal SIGSEGV, Segmentation fault.
__pyx_f_4lxml_5etree__getThreadErrorLog (__pyx_v_name='_GlobalErrorLog') at src/lxml/etree.c:48609
warning: Source file is more recent than executable.
48609         __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
(gdb) list
48604         __Pyx_XDECREF(((PyObject *)__pyx_r));
48605         __Pyx_INCREF(((PyObject *)__pyx_v_log));
48606         __pyx_r = ((struct __pyx_obj_4lxml_5etree__BaseErrorLog *)__pyx_v_log);
48607         __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
48608         __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
48609         __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
48610         goto __pyx_L7_except_return;
48611       }
48612       goto __pyx_L6_except_error;
48613       __pyx_L6_except_error:;
(gdb)

This source is generated from src/lxml/parser.pxi in the lxml source, and the two testcases above fault at two different exception points in the '_raiseParseError' function:

    Your case:
    raise error_log._buildParseException(XMLSyntaxError, u"Document is not well formed")

    My case:
    raise IOError, message

This kinda leads me to think the real fault lies in Objects/exceptions.c, where CPython is not checking a pointer for NULL before attempting to dereference it.  Why the call to 'findall' on an "lxml.etree._ElementTree object" gets all the way into Objects/exception.c baffles me.

To that end, I have managed to work out a testcase for my original issue:

test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>bar</foo>

test.py:
#!/usr/bin/env python3
from lxml import etree
x=etree.parse("./test.xml")
x.findall("foo")

I tested this on both my Gentoo dev box (amd64) and my SGI machine (mips), and both return a segmentation fault.  The MIPS machine actually hints that the real bug may be in lxml's _elementpath.py file somewhere:

do_page_fault(): sending SIGSEGV to python3 for invalid read access from 0000000000000010
epc = 0000000076e65d18 in libpython3.10.so.1.0[76dd0000+310000]
ra  = 000000007615d754 in _elementpath.cpython-310-mips64-linux-gnuabin32.so[76150000+30000]

Looking at _elementpath.py, the "findall" function is just a stub for "iterfind":
    def findall(elem, path, namespaces=None):
        return list(iterfind(elem, path, namespaces))

    def iterfind(elem, path, namespaces=None):
        selector = _build_path_iterator(path, namespaces)
        result = iter((elem,))
        for select in selector:
            result = select(result)
        return result

This is really base-level Python here, so I'm not sure if the real issue is hiding further down in _build_path_iterator or if bad C code is being generated.  If I could run lxml as a pure Python module instead of generated C code, I think it'd be easier to trace down (or at least confirm it's bad C code).

That said, this issue is only happening on my Gentoo systems.  I ran this test on a FreeBSD 13.0-RELEASE-p6 machine under locale ISO8859-15, and it does not segfault, returning a proper traceback.  It was using Python 3.8.x.  On a Devuan Linux 4 system, after setting the locale to ISO8859-15, it also doesn't segfault on the test case and returns proper tracebacks.  That system has Python 3.9.  Ran the test on a CentOS 7 VM as well and got the same results, under Python 3.8.

At this point, the only thing I can think of is it is either compiler flags causing bad code to get emitted by gcc, or we've got a patch in one of the system libraries causing an issue.

In any event, this looks to be more of a general issue in dev-python/lxml and not an issue in repoman.  Want me to close this bug and open a new one under the right bug section for lxml?
Comment 8 Joshua Kinard gentoo-dev 2022-01-14 20:41:19 UTC
FWIW, I have pretty much ruled out the cause being in either gcc-11.2.1 or gcc-10.3.1, as I have tested both out in rebuilding Python-3.10 and dev-python/lxml.  Still segfaulting on the test cases.  I also backed my CFLAGS all the way down to just "-O0 -pipe", still segfaulting.  Also tried lxml-4.6.5, segfault there, too.

Dropping to O0 did expand a little bit more on the crash from your testcase if there's a newline at the beginning of the string, as I now have some visibility into the _Py_DECREF call:

Originally, I couldn't get any further than here:
#1  0x00007ffff701a5cd in __pyx_f_4lxml_5etree__getThreadErrorLog (__pyx_v_name='_GlobalErrorLog') at src/lxml/etree.c:48342
warning: Source file is more recent than executable.
48342         __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
(gdb) list
48337         __Pyx_XDECREF(((PyObject *)__pyx_r));
48338         __Pyx_INCREF(((PyObject *)__pyx_v_log));
48339         __pyx_r = ((struct __pyx_obj_4lxml_5etree__BaseErrorLog *)__pyx_v_log);
48340         __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
48341         __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
48342         __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
48343         goto __pyx_L7_except_return;
48344       }
48345       goto __pyx_L6_except_error;
48346       __pyx_L6_except_error:;
(gdb)

With O0, now I can see this:
#0  0x00007ffff6fded08 in _Py_DECREF (op=0x0) at /usr/include/python3.10/object.h:492
492         if (--op->ob_refcnt != 0) {
(gdb) list
487         // Non-limited C API and limited C API for Python 3.9 and older access
488         // directly PyObject.ob_refcnt.
489     #ifdef Py_REF_DEBUG
490         _Py_RefTotal--;
491     #endif
492         if (--op->ob_refcnt != 0) {
493     #ifdef Py_REF_DEBUG
494             if (op->ob_refcnt < 0) {
495                 _Py_NegativeRefcount(filename, lineno, op);
496             }

'op' is NULL, so an obvious '--op' will definitely segfault.

Still uncertain how lxml gets to this point.  Vars __pyx_t_5, __pyx_t_7, and __pyx_t_8 are all 0, but only __pyx_t_8 is causing the segfault.  I can't tell if it's supposed to be a pointer or a standard integer or such.

I am going to try running these testcases under pypy3 (once it decides to stop drawing pretty ASCII pictures in my console), as looking around the lxml source, that will disable the Cython-generated portions, and maybe that will, if the bugs can be reproduced in some fashion, let me trace lxml's raw python code.
Comment 9 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-14 20:49:35 UTC
We may be able to disable the Cython goo with WITH_CYTHON=false (see also bug 685768, think it may be automagic right now).

Also, I can't say this is necessarily it, but there's some reported compatibility problems with lxml and newer libxslt/libxml2, and it's possible that e.g. FreeBSD didn't upgrade to the problematic versions of those.
Comment 10 Joshua Kinard gentoo-dev 2022-01-14 21:09:26 UTC
(In reply to Sam James from comment #9)
> We may be able to disable the Cython goo with WITH_CYTHON=false (see also
> bug 685768, think it may be automagic right now).
> 
> Also, I can't say this is necessarily it, but there's some reported
> compatibility problems with lxml and newer libxslt/libxml2, and it's
> possible that e.g. FreeBSD didn't upgrade to the problematic versions of
> those.

I was looking at lxml's ebuild to see if I could figure out how to disable Cython, but I am not very familiar w/ how our dev-python/* ebuilds work when it comes to configuring dependencies.  Call it being way too used to autoconf-based --disable-foo/--without-foo magic.  Trying pypy3 out looked to be the quickest way to either get a non-cython build or rely on upstream's generated C files.  Starting to think pypy3 is not going to be a quick way, because it is still drawing....fractals (I think) in my console window.

I've got a few more ideas to try, including dropping to Python-3.8, before I give up and put something together for lxml upstream.  I am worried that they may claim it's a fault unique to us and refuse to help, though.  So want to test all the things I am capable of testing beforehand.

As far as the FreeBSD attempt went, I only tested the prebuilt binpkg versions and did not attempt to install from Ports.  The dependent packages their pkg tool wanted to install were:

    # pkg install py38-lxml
    Updating FreeBSD repository catalogue...
    FreeBSD repository is up to date.
    All repositories are up to date.
    Checking integrity... done (0 conflicting)
    The following 4 package(s) will be affected (of 0 checked):
    
    New packages to be INSTALLED:
            libgcrypt: 1.9.4
            libgpg-error: 1.43
            libxslt: 1.1.34_2
            py38-lxml: 4.7.1
    
    Number of packages to be installed: 4
    
    The process will require 16 MiB more space.

So it's not pulling in libxml2 by default.  I haven't looked at libxslt yet, as I'm somewhat convinced the bug is definitely within lxml itself.  The test cases I have don't even use XSLT.  I've reduced the XML test text down to just two lines, the standard XML DOCTYPE and "<foo>bar</foo>".  I'll attach them in a few minutes.
Comment 11 Joshua Kinard gentoo-dev 2022-01-14 21:18:30 UTC
Created attachment 762178 [details]
lxml test case #1, etree.parse() XML from a file

First test case that demonstrates the original SIGSEGV when locale is set to ISO-8859-15 on a Gentoo system.  This test uses lxml.etree.parse() to read two lines of XML from the test1.xml, then call the 'findall' method of a bound lxml.etree._ElementTree object.  The failure will be in Python core, Objects/exceptions.c:237, BaseException_set_tb, because argument 'self' is NULL and a dereference is attempted.

Note: the second test case can also be triggered here if the file argument passed to etree.parse() is missing.
Comment 12 Joshua Kinard gentoo-dev 2022-01-14 21:19:11 UTC
Created attachment 762179 [details]
lxml test case #1 test file

Test file for lxml test case #1
Comment 13 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-14 21:19:55 UTC
OK, that libxslt version is ~same as ours.

`WITH_CYTHON=false ebuild lxml-4.7.1.ebuild clean merge` avoids Cython for me.
Comment 14 Joshua Kinard gentoo-dev 2022-01-14 21:23:33 UTC
Created attachment 762180 [details]
lxml test case #2, XML begins w/ newline

Second test case that demonstrates another SIGSEGV when locale is set to ISO-8859-15 on a Gentoo system.  This test uses lxml.etree.fromstring() to read two lines of XML from a variable.  If the XML in that variable begins with a newline, it causes an exception in lxml because lxml expects the first line to begin with "<?xml ...", and the throwing of that exception leads to a SIGSEGV in Python core because a _Py_DECREF() call was passed a NULL pointer variable that it attempts to pre-decrement and dereference.
Comment 15 Joshua Kinard gentoo-dev 2022-01-14 21:25:28 UTC
(In reply to Sam James from comment #13)
> OK, that libxslt version is ~same as ours.
> 
> `WITH_CYTHON=false ebuild lxml-4.7.1.ebuild clean merge` avoids Cython for
> me.

I'll give this a try, as pypy3 failed to compile and I am not going to try and chase that one down.
Comment 16 Joshua Kinard gentoo-dev 2022-01-14 21:29:38 UTC
(In reply to Joshua Kinard from comment #15)
> (In reply to Sam James from comment #13)
> > OK, that libxslt version is ~same as ours.
> > 
> > `WITH_CYTHON=false ebuild lxml-4.7.1.ebuild clean merge` avoids Cython for
> > me.
> 
> I'll give this a try, as pypy3 failed to compile and I am not going to try
> and chase that one down.

Doesn't look like this works:

 * python3_10: running distutils-r1_run_phase python_compile
python3.10 setup.py build -j 14
warning: src/lxml/xmlerror.pxi:657:22: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:658:69: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:659:20: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:664:22: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:665:73: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:666:20: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:671:22: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:672:73: local variable 'args' referenced before assignment
warning: src/lxml/xmlerror.pxi:673:20: local variable 'args' referenced before assignment
Building lxml version 4.7.1.
Building with Cython 0.29.26.
Building against libxml2 2.9.12 and libxslt 1.1.34
Compiling src/lxml/etree.pyx because it changed.
Compiling src/lxml/objectify.pyx because it changed.
Compiling src/lxml/builder.py because it changed.
Compiling src/lxml/_elementpath.py because it changed.
Compiling src/lxml/html/diff.py because it changed.
Compiling src/lxml/html/clean.py because it changed.
Compiling src/lxml/sax.py because it changed.
[1/7] Cythonizing src/lxml/_elementpath.py
[2/7] Cythonizing src/lxml/builder.py
[3/7] Cythonizing src/lxml/etree.pyx
[4/7] Cythonizing src/lxml/html/clean.py
[5/7] Cythonizing src/lxml/html/diff.py
[6/7] Cythonizing src/lxml/objectify.pyx
[7/7] Cythonizing src/lxml/sax.py
[snip]
Comment 17 Joshua Kinard gentoo-dev 2022-01-14 21:41:17 UTC
It looks like setup.py needs --without-cython passed to it.  One of the files mentions that the source distribution is supposed to have pre-generated C files so that building without a cython dependency is possible, but in 4.7.1, at least our version, those pre-generated files are missing:

Building lxml version 4.7.1.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/objectify.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/builder.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/_elementpath.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/html/diff.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/html/clean.c' is not available.
WARNING: Trying to build without Cython, but pre-generated 'src/lxml/sax.c' is not available.

I unpacked the source tarball of lxml-4.7.1, and 'find . -name \*.c' returns no results.  Ditto for lxml-4.6.5.  So it looks like upstream is no longer pre-generating those C files, and the build system does not appear to allow a pure Python installation (e.g., even when passing --without-cython, it still attempts to call the C compiler).
Comment 18 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-01-14 21:51:20 UTC
(ugh, yes, it looks like it's now hard required.)
Comment 19 Joshua Kinard gentoo-dev 2022-01-14 22:01:26 UTC
Also just tested building dev-python/lxml with cython-0.29.25, segfault in both test cases.  I think that's the end of my available rabbit holes, so I guess I will look up how lxml likes having bugs filed and open something up with upstream and see what they say.
Comment 20 Joshua Kinard gentoo-dev 2022-03-25 21:52:59 UTC
I am going to obsolete this bug since repoman is being deprecated as the primary dev tool.  If someone else wants to investigate this, I can pass some notes over, though I think everything relevant is already included in this bug.
Comment 21 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2022-03-26 05:09:40 UTC
I don't think this is repoman-specific. The crash is in lxml.
Comment 22 Christophe PEREZ 2022-06-01 04:37:27 UTC
seems to be solved in lxml https://bugs.launchpad.net/lxml/+bug/1972907

Could be nice to push lxml-4.9.0-2
Comment 23 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-06-01 04:58:33 UTC
(In reply to Christophe PEREZ from comment #22)
> seems to be solved in lxml https://bugs.launchpad.net/lxml/+bug/1972907
> 
> Could be nice to push lxml-4.9.0-2

I bumped lxml to 4.9.0 earlier and wheel versions shouldn't matter. So, apparently we're done!
Comment 24 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-06-01 04:58:50 UTC
(thanks for finding that!)
Comment 25 Larry the Git Cow gentoo-dev 2022-06-01 05:06:15 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=fc6127e07d7aeeb535e85944400d6471282caad4

commit fc6127e07d7aeeb535e85944400d6471282caad4
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2022-06-01 04:59:37 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2022-06-01 05:06:03 +0000

    dev-python/lxml: revbump w/ tigher cython dep to avoid miscompile
    
    Generated bad exception handling code.
    
    Closes: https://bugs.gentoo.org/830882
    Signed-off-by: Sam James <sam@gentoo.org>

 dev-python/lxml/{lxml-4.9.0.ebuild => lxml-4.9.0-r1.ebuild} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 26 Christophe PEREZ 2022-06-01 05:18:05 UTC
(In reply to Sam James from comment #24)
> (thanks for finding that!)

You're welcome ! ;)

https://github.com/streamlink/streamlink/issues/4562 is MY bug ! :D