Created attachment 452398 [details, diff] proposed fix I noticed that "qpkg -c" was segfaulting randomly on my hardened installs while working fine on my non-hardened system. Calling it 20 times caused at least one segfault on any of my hardened installs. At least portage-utils-0.62 and portage-utils-0.63 are affected by this bug. I started to debugging the issue and it turned out, that app-portage/portage-utils is calling a function to remove white spaces on a pointer, removing (at least) 0x20, 0x9 and 0xc from pointer addresses. The next use of such a "tuned" pointer will of course cause a segfault. My hardened installations with all the address randomization is just making sure that this happens quite regular and exposes the error. Removing the incorrect white space clean up fixes the issue for my, see the attachment.
Created attachment 452400 [details] gdb backtrace of a crash buf = 0x7f800bf7 is a damaged pointer (should probably be 0x7f800bf720)
Not tested, but the initial assumption that the hardened kernel is exposing the issue is probably wrong. I assume now that this bug should affect any system having binary packages which can be cleaned up. The hardened systems I encountered the problem on are all using binary packages which can be cleaned up. And the non-hardened test system doesn't have any...
i can't reproduce it here, and your desc doesn't make much sense ... rmspace doesn't modify the pointer itself. so i don't know what you mean by removing 0x20/0x9/0xc. the only bug i can see is if you have a SLOT file that is excessively large ... the code will incorrectly try to free a stack pointer. but you shouldn't have a SLOT file that is larger than like 1KiB. do you have a really large file ? find /var/db/pkg/ -name SLOT -exec ls -l {} +
Created attachment 453198 [details] find /var/db/pkg/ -name SLOT -exec ls -l {} + I've uploaded the output of the command, but the biggest file is 16B, most are around 2B. Don't thing that qualifies as really big... (I also verified it's still segfaulting without the patch. Here how it looks without debug code: owl alex # qpkg -c Segmentation fault owl alex # qpkg -c * Total space that would be freed in packages directory: 0 K owl alex # qpkg -c * Total space that would be freed in packages directory: 0 K owl alex # qpkg -c * Total space that would be freed in packages directory: 0 K owl alex # qpkg -c * Total space that would be freed in packages directory: 0 K owl alex # qpkg -c * Total space that would be freed in packages directory: 0 K owl alex # qpkg -c Segmentation fault about rmspace: I'm not really a C crack and especially all the pointer stuff is rusty at best. I only looked at some lines out of context here, but after finding out that the pointer loses white-spaces on subsequent calls the rmspace was the obvious candidate. I'll undo the patch and add some printf's again to illustrate that better here. I added some debug printf's last time already and it clearly showed that white space values got removed from the pointer address itself... but when now trying to find a cast error in the code I can't. I may even be able to do that later today, or for sure next week.
Created attachment 453310 [details, diff] patch adding debug code
Created attachment 453312 [details] crash with - most - of debug patch applied This crash log was created without the following code from the patch: + slot[0]=0xa; + slot[1]=0xb; + slot[2]=0xa; + slot[3]=0xb; + slot[4]=0xa; + slot[5]=0xb; + slot[6]=0xa; + slot[7]=0xb; In fact with those lines above it always crash. (see next comment why)
deleting whitespace from the SLOT file shouldn't be a problem i've got a hardened box though where i can reproduce this, so i'll debug there
Created attachment 453314 [details] crash with full debug patch applied Still rusty C skill, but I think I finally understand what's wrong: Problem is, how the slot variable is created and later used here: char slot[_Q_PATH_MAX]; reserves space on the stack and reserves "_Q_PATH_MAX" on it. We then cast slot in the function call: eat_file_at(ctx->vdb_fd, buf, (char **)&slot, &slot_len); But slot is not a pointer its own address. In short: slot == &slot in this case The debug code is underlining that fact: > DEBUG slot created with slot=0x38f5e46dba0 and &slot=0x38f5e46dba0 eat_file_fd is therefore putting a pointer into the array slot, and not updating the pointer of slot. (It's not a pointer variable to start with, of course. Just something which can be used like a pointer): *bufptr = xrealloc(*bufptr, *bufsize); Now calling rmspace on slot will of course "edit" the pointer and not the value the pointer is pointing to. If slot contains non-zero data it will crash always. The code is only working due to slot be set to zero initial by the compiler... Since this will trigger this statement to reserve new ram and put the pointer into slot: if (!*bufptr || *bufsize < read_size) With a non-zero value at the beginning of slot it will then of course segfault with buf[0] = '\0'; See the attachment. Does that make sense now?
Created attachment 453330 [details, diff] proposed fix #2 Corrected the real problem. tested the new fix with k=1; while [ $k -lt 1000 ]; do echo -n "$k "; qpkg -c; k=$(($k+1));done no segfault
It's probably redundant now, but I just tested the debug patch on a non-hardened system. It also crashes, since slot is non-zero, as expected: DEBUG slot created with slot=0x7ffdd28df510 and &slot=0x7ffdd28df510 DEBUG pre-eat : slot=0x7ffdd28df510 (char **)&slot=0x7ffdd28df510 DEBUG eat_file_fd entry bufptr=0x7ffdd28df510 *bufptr=0xb0a0b0a0b0a0b0a DEBUG: pre-crash *bufptr=0xb0a0b0a0b0a0b0a Segmentation fault (core dumped) Without setting slot to non-zero it seems works, through. I would have expected at least one crash from time to time... but it's not.
ok, i was dumb. simple fix: https://gitweb.gentoo.org/proj/portage-utils.git/commit/?id=743d4fe35a8d6029bc0da10b5b3d9f095a07f2c1
(In reply to Alexander Wetzel from comment #9) sorry, didn't see your follow ups. you found the problem here ;). but this version would leak memory since the slot is never freed, and be a waste normally since we only care about the value temporarily. i side-stepped the prob in my commit. long term, as the comment above this func says, i'd like to kill the code with something a bit more dynamic ala the rest of the code in libq/vdb.c.
The merged fix is indeed much better than me messing around:-) But I think the final fix merged also leaks memory, isn't it? slot is only a pointer with a zero value, the called sub then allocates memory and stores the pointer in it, sine the pointer value is zero on the first function call. But the allocated memory is never freed during runtime. When get_vdb_atoms exits we lose the pointer for good, without freeing the memory. Theoretical I guess we should not accept null pointers for the function at all, so it's more obvious the caller also has to free the memory when done.
(In reply to Alexander Wetzel from comment #13) there is no memory leak. slot is a buffer on the stack, and we store its address into the slotp pointer. the subcall will then not allocate a pointer because a valid one already exists. there is still a problem as i alluded to earlier in that, if the SLOT file is larger than _Q_PATH_MAX (which can be 4k or 8k), then we'll get memory corruption when we try to use free() on a stack pointer. but there's no valid case (today) where SLOT files should be that large, so i'm not going to worry about it. i hope this code will get gutted/rewritten like the TODO says in which case the issue goes away.
*** Bug 607146 has been marked as a duplicate of this bug. ***