Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 598974 - app-portage/portage-utils: qpkg: -c randomly segfaults
Summary: app-portage/portage-utils: qpkg: -c randomly segfaults
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Tools (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Portage Utils Team
URL:
Whiteboard:
Keywords:
: 607146 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-11-04 23:01 UTC by Alexander Wetzel
Modified: 2017-01-26 00:46 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
proposed fix (portage-fix.patch,341 bytes, patch)
2016-11-04 23:01 UTC, Alexander Wetzel
Details | Diff
gdb backtrace of a crash (backtrace,3.85 KB, text/plain)
2016-11-04 23:14 UTC, Alexander Wetzel
Details
find /var/db/pkg/ -name SLOT -exec ls -l {} + (SLOT-list.txt,26.24 KB, text/plain)
2016-11-13 12:16 UTC, Alexander Wetzel
Details
patch adding debug code (debug.patch,1.83 KB, patch)
2016-11-14 20:16 UTC, Alexander Wetzel
Details | Diff
crash with - most - of debug patch applied (debug-out,1.83 KB, text/plain)
2016-11-14 20:20 UTC, Alexander Wetzel
Details
crash with full debug patch applied (debug-out2,1.48 KB, text/plain)
2016-11-14 20:46 UTC, Alexander Wetzel
Details
proposed fix #2 (fix2.patch,551 bytes, patch)
2016-11-14 20:58 UTC, Alexander Wetzel
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Wetzel 2016-11-04 23:01:30 UTC
Created attachment 452398 [details, diff]
proposed fix

I noticed that "qpkg -c" was segfaulting randomly on my hardened installs while working fine on my non-hardened system. Calling it 20 times caused at least one segfault on any of my hardened installs. 

At least portage-utils-0.62 and portage-utils-0.63 are affected by this bug.

I started to debugging the issue and it turned out, that app-portage/portage-utils is calling a function to remove white spaces on a pointer, removing (at least) 0x20, 0x9 and 0xc from pointer addresses. 
The next use of such a "tuned" pointer will of course cause a segfault.

My hardened installations with all the address randomization is just making sure that this happens quite regular and exposes the error.

Removing the incorrect white space clean up fixes the issue for my, see the attachment.
Comment 1 Alexander Wetzel 2016-11-04 23:14:11 UTC
Created attachment 452400 [details]
gdb backtrace of a crash

buf = 0x7f800bf7 is a damaged pointer (should probably be 0x7f800bf720)
Comment 2 Alexander Wetzel 2016-11-05 09:25:13 UTC
Not tested, but the initial assumption that the hardened kernel is exposing the issue is probably wrong. I assume now that this bug should affect any system having binary packages which can be cleaned up.

The hardened systems I encountered the problem on are all using binary packages which can be cleaned up. And the non-hardened test system doesn't have any...
Comment 3 SpanKY gentoo-dev 2016-11-13 00:18:49 UTC
i can't reproduce it here, and your desc doesn't make much sense ... rmspace doesn't modify the pointer itself.  so i don't know what you mean by removing 0x20/0x9/0xc.

the only bug i can see is if you have a SLOT file that is excessively large ... the code will incorrectly try to free a stack pointer.  but you shouldn't have a SLOT file that is larger than like 1KiB.

do you have a really large file ?
  find /var/db/pkg/ -name SLOT -exec ls -l {} +
Comment 4 Alexander Wetzel 2016-11-13 12:16:10 UTC
Created attachment 453198 [details]
find /var/db/pkg/ -name SLOT -exec ls -l {} +

I've uploaded the output of the command, but the biggest file is 16B, most are around 2B. Don't thing that qualifies as really big...
(I also verified it's still segfaulting without the patch. Here how it looks without debug code:

owl alex # qpkg -c
Segmentation fault
owl alex # qpkg -c
 * Total space that would be freed in packages directory: 0 K
owl alex # qpkg -c
 * Total space that would be freed in packages directory: 0 K
owl alex # qpkg -c
 * Total space that would be freed in packages directory: 0 K
owl alex # qpkg -c
 * Total space that would be freed in packages directory: 0 K
owl alex # qpkg -c
 * Total space that would be freed in packages directory: 0 K
owl alex # qpkg -c
Segmentation fault


about rmspace:
I'm not really a C crack and especially all the pointer stuff is rusty at best.

I only looked at some lines out of context here, but after finding out that the pointer loses white-spaces on subsequent calls the rmspace was the obvious candidate. 
I'll undo the patch and add some printf's again to illustrate that better here.

I added some debug printf's last time already and it clearly showed that white space values got removed from the pointer address itself...  but when now trying to find a cast error in the code I can't.
 
I may even be able to do that later today, or for sure next week.
Comment 5 Alexander Wetzel 2016-11-14 20:16:10 UTC
Created attachment 453310 [details, diff]
patch adding debug code
Comment 6 Alexander Wetzel 2016-11-14 20:20:04 UTC
Created attachment 453312 [details]
crash with - most - of debug patch applied

This crash log was created without the following code from the patch:
+       slot[0]=0xa;
+       slot[1]=0xb;
+       slot[2]=0xa;
+       slot[3]=0xb;
+       slot[4]=0xa;
+       slot[5]=0xb;
+       slot[6]=0xa;
+       slot[7]=0xb;

In fact with those lines above it always crash. (see next comment why)
Comment 7 SpanKY gentoo-dev 2016-11-14 20:24:20 UTC
deleting whitespace from the SLOT file shouldn't be a problem

i've got a hardened box though where i can reproduce this, so i'll debug there
Comment 8 Alexander Wetzel 2016-11-14 20:46:57 UTC
Created attachment 453314 [details]
crash with full debug patch applied

Still rusty C skill, but I think I finally understand what's wrong:

Problem is, how the slot variable is created and later used here:

char slot[_Q_PATH_MAX];

reserves space on the stack and reserves "_Q_PATH_MAX" on it.

We then cast slot in the function call:
eat_file_at(ctx->vdb_fd, buf, (char **)&slot, &slot_len);

But slot is not a pointer its own address. 
In short: slot == &slot in this case

The debug code is underlining that fact: 
> DEBUG slot created with slot=0x38f5e46dba0 and &slot=0x38f5e46dba0

eat_file_fd is therefore putting a pointer into the array slot, and not updating the pointer of slot. (It's not a pointer variable to start with, of course. Just something which can be used like a pointer):

*bufptr = xrealloc(*bufptr, *bufsize);

Now calling rmspace on slot will of course "edit" the pointer and not the value the pointer is pointing to.

If slot contains non-zero data it will crash always. The code is only working due to slot be set to zero initial by the compiler...

Since this will trigger this statement to reserve new ram and put the pointer into slot:
if (!*bufptr || *bufsize < read_size) 

With a non-zero value at the beginning of slot it will then of course segfault with
buf[0] = '\0';

See the attachment.

Does that make sense now?
Comment 9 Alexander Wetzel 2016-11-14 20:58:31 UTC
Created attachment 453330 [details, diff]
proposed fix #2

Corrected the real problem.

tested the new fix with 
k=1; while [ $k -lt 1000 ]; do echo -n "$k "; qpkg -c; k=$(($k+1));done

no segfault
Comment 10 Alexander Wetzel 2016-11-14 21:41:46 UTC
It's probably redundant now, but I just tested the debug patch on a non-hardened system.

It also crashes, since slot is non-zero, as expected:

DEBUG slot created with slot=0x7ffdd28df510 and &slot=0x7ffdd28df510
DEBUG pre-eat : slot=0x7ffdd28df510 (char **)&slot=0x7ffdd28df510
DEBUG eat_file_fd entry bufptr=0x7ffdd28df510 *bufptr=0xb0a0b0a0b0a0b0a
DEBUG: pre-crash *bufptr=0xb0a0b0a0b0a0b0a
Segmentation fault (core dumped)

Without setting slot to non-zero it seems works, through. I would have expected at least one crash from time to time... but it's not.
Comment 12 SpanKY gentoo-dev 2016-11-15 04:05:02 UTC
(In reply to Alexander Wetzel from comment #9)

sorry, didn't see your follow ups.  you found the problem here ;).  but this version would leak memory since the slot is never freed, and be a waste normally since we only care about the value temporarily.  i side-stepped the prob in my
commit.

long term, as the comment above this func says, i'd like to kill the code with something a bit more dynamic ala the rest of the code in libq/vdb.c.
Comment 13 Alexander Wetzel 2016-11-18 21:17:23 UTC
The merged fix is indeed much better than me messing around:-)

But I think the final fix merged also leaks memory, isn't it?
slot is only a pointer with a zero value, the called sub then allocates memory and stores the pointer in it, sine the pointer value is zero on the first function call.

But the allocated memory is never freed during runtime. When get_vdb_atoms exits we lose the pointer for good, without freeing the memory.

Theoretical I guess we should not accept null pointers for the function at all, so it's more obvious the caller also has to free the memory when done.
Comment 14 SpanKY gentoo-dev 2016-11-18 22:58:45 UTC
(In reply to Alexander Wetzel from comment #13)

there is no memory leak.  slot is a buffer on the stack, and we store its address into the slotp pointer.  the subcall will then not allocate a pointer because a valid one already exists.

there is still a problem as i alluded to earlier in that, if the SLOT file is larger than _Q_PATH_MAX (which can be 4k or 8k), then we'll get memory corruption when we try to use free() on a stack pointer.  but there's no valid case (today) where SLOT files should be that large, so i'm not going to worry about it.  i hope this code will get gutted/rewritten like the TODO says in which case the issue goes away.
Comment 15 SpanKY gentoo-dev 2017-01-26 00:46:16 UTC
*** Bug 607146 has been marked as a duplicate of this bug. ***