168334 – app-portage/portage-utils-0.1.24: /usr/bin/q -r dies with a segfault after emerge --sync

Bug 168334 - app-portage/portage-utils-0.1.24: /usr/bin/q -r dies with a segfault after emerge --sync

Summary: app-portage/portage-utils-0.1.24: /usr/bin/q -r dies with a segfault after em...

Status:	RESOLVED FIXED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Tools (show other bugs)
Hardware:	AMD64 Linux

Importance:	High minor
Assignee:	Portage Utils Team

URL:
Whiteboard:
Keywords:	InVCS

Depends on:
Blocks:

Reported:	2007-02-25 15:36 UTC by Georgi Georgiev
Modified:	2007-04-05 18:43 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
debugging-results.txt (bug.txt,7.88 KB, text/plain) 2007-02-25 15:38 UTC, Georgi Georgiev	Details
emerge--info.txt (einfo.txt,5.94 KB, text/plain) 2007-02-25 15:39 UTC, Georgi Georgiev	Details
q-bug-168334.diff (q-bug-168334.diff,1.01 KB, patch) 2007-02-27 17:56 UTC, solar (RETIRED)	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Georgi Georgiev 2007-02-25 15:36:00 UTC

emerge --sync gives me this message every time:
...
>>> Updating Portage cache:   100%
/etc/portage/postsync.d/q-reinitialize: line 1: 810714 Segmentation fault      /usr/bin/q -r
 * spawn failed of /etc/portage/bin/post_sync

However, if I run /etc/portage/bin/post_sync from the command line there is no problem.

I'll post the rest of my discoveries after posting this (bugzilla just yelled at me that the post is too long).

Comment 1 Georgi Georgiev 2007-02-25 15:38:21 UTC

Created attachment 111201 [details]
debugging-results.txt

Here is the output after I looked for the problem a bit. It has some interesting gdb output... and well, it is my original post (before bugzilla yelled at me that it is too long).

Comment 2 Georgi Georgiev 2007-02-25 15:39:41 UTC

Created attachment 111202 [details]
emerge--info.txt

And here is emerge --info.

Comment 3 solar (RETIRED) gentoo-dev

2007-02-25 17:35:10 UTC

Your CFLAGS looked sane and yet something has gone really wrong for you 
and I can't reproduce it. Best I can say for now is to simply try 
rebuilding the portage-utils. If this is happening on every --sync for you then clearly the best option is to remove the +x bit in the q-reinitialize script.

Comment 4 Georgi Georgiev 2007-02-26 00:57:48 UTC

(In reply to comment #3)
> Your CFLAGS looked sane and yet something has gone really wrong for you 
> and I can't reproduce it. Best I can say for now is to simply try 
> rebuilding the portage-utils.

Could it be amd64 or kernel related? It is reproducible on two separate machines here -- the one I reported for (attachment #111202 [details]) and a dual opteron. Both are running vanilla 2.6.20 but the problem was present earlier.

By the way, I had already tried rebuilding portage-utils (sort of). I don't know how diligently you looked at attachment #111201 [details], but I compiled portage-utils using "make debug" in its $S and with -O0 (so that gdb gives readable output) and it still didn't work.

> If this is happening on every --sync for you then
> clearly the best option is to remove the +x bit in the q-reinitialize script. 

Oh, well, thanks for the advice, but it's not *that* bad. It's actually a good reminder that there is something that needs fixing.

Comment 5 solar (RETIRED) gentoo-dev

2007-02-26 01:25:33 UTC

(In reply to comment #4)
> Oh, well, thanks for the advice, but it's not *that* bad. It's actually a good
> reminder that there is something that needs fixing.

I guess but the hard part to tell is that. It's not resproducable here and it only happens to you when called from portage. We can see from the backtrace that applets is clearly getting messed up. But why I can't say.

I did try on an amd64-multilib host. 

miranda bin # emerge --sync
>>> Starting rsync with rsync://64.127.121.98/gentoo-portage...
Welcome to owl.gentoo.org

Server Address : 64.127.121.98
Contact Name   : mirror-admin@gentoo.org
Hardware       : 4 x Intel(R) Xeon(TM) CPU 2.40GHz, 1024MB RAM


Please note: common gentoo-netiquette says you should not sync more
than once a day.  Users who abuse the rsync.gentoo.org rotation
may be added to a temporary ban list.


MOTD brought to you by motd-o-matic, version 0.3

receiving file list ... done
./
metadata/
metadata/timestamp.chk
deleting .ebuild.x

Number of files: 144717
Number of files transferred: 1
Total file size: 164787760 bytes
Total transferred file size: 32 bytes
Literal data: 32 bytes
Matched data: 0 bytes
File list size: 3368401
File list generation time: 1.319 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 193
Total bytes received: 3368924

sent 193 bytes  received 3368924 bytes  1347646.80 bytes/sec
total size is 164787760  speedup is 48.91

>>> Updating Portage cache:  100%

miranda bin # echo $?      
0

miranda bin # wc -l /usr/portage/.ebuild.x 
22925 /usr/portage/.ebuild.x

miranda bin # qdepends -k CFLAGS app-portage/portage-utils-0.1.24
app-portage/portage-utils-0.1.24: -march=k8 -fomit-frame-pointer -O2 -pipe

Comment 6 Georgi Georgiev 2007-02-26 01:45:08 UTC

I hope I'm not missing the obvious but....

(In reply to comment #5)
> sent 193 bytes  received 3368924 bytes  1347646.80 bytes/sec
> total size is 164787760  speedup is 48.91
> 
> >>> Updating Portage cache:  100%
> 
> miranda bin # echo $?      
> 0

... in the end, shouldn't you be getting a message like
q: Updating ebuild cache ...
q: Finished 22925 entries in 0.255503 second

You sure you have +x on q-reinitialize?

I promise to take a better look at this tonight.

Comment 7 Georgi Georgiev 2007-02-26 09:49:06 UTC

(In reply to comment #6)
> I hope I'm not missing the obvious but....

Indeed, I was. My q-reinitialize was older and did not have the -q.

Anyway, I just tried 
- unpacking a stage3 tarball in a chroot
- emerge portage portage-utils
- chmod +x /etc/portage/postsync-d/q*
- emerge --sync
Could not reproduce the problem.

Just for fun I symlinked /bin/bash in /etc/portage/postsync.d and am trying this and that but without much success. The problem *is* present but I cannot trace it.

I am playing with gdb, but any pointers about what I could try are appreciated.

Comment 8 Georgi Georgiev 2007-02-26 10:16:09 UTC

Alright, gdb payed off (sort of).

I haven't pinpointed the problem, but it has to do with my huge INSTALL_MASK. I also have the feeling that 1024 is a magic number around there. There is something wrong in the make.conf parser, and even though I have no idea why it only gets triggered when called from portage, there is no doubt a problem there.

I confirmed it in both 64-bit and 32-bit chroots.

So, if you want a "steps to reproduce".

1. emerge -u portage portage-utils
2. chmod +x /etc/portage/postsync.d/q-reinitialize
3. add the INSTALL_MASK from attachment #111202 [details] to /etc/make.conf
4. emerge --sync
5. watch it blow up

I'll see if splitting the mask in multiple lines will do any good (or I could simply drop it), but regardless... that would only be a temporary workaround.

Comment 9 Georgi Georgiev 2007-02-26 11:51:43 UTC

Alright, this thing is reproducible without running it from portage. It's just that the critical length is different.

I put in make.conf the following
INSTALL_MASK=""
INSTALL_MASK="${INSTALL_MASK} 123456789 123456789 "... (100 chars total)
repeat the line above and adjust until the problem is triggered and here is some output for different values of INSTALL_MASK:

chutz@possum ~ $ q >/dev/null ; portageq envvar INSTALL_MASK | wc -m
1105
chutz@possum ~ $ q >/dev/null ; portageq envvar INSTALL_MASK | wc -m
q: Unknown applet 'q'
1106
chutz@possum ~ $ q >/dev/null ; portageq envvar INSTALL_MASK | wc -m
q: Unknown applet 'q'
1107
chutz@possum ~ $ q >/dev/null ; portageq envvar INSTALL_MASK | wc -m
Segmentation fault
1108

From a chroot the critical length was much lower -- less than 80 characters.

Anyway, at the risk of becoming annoying, I thought I'd tell you how to easily reproduce the problem.

Comment 10 solar (RETIRED) gentoo-dev

2007-02-26 15:46:21 UTC

INSTALL_MASK=$(perl -e 'print "A " x 16384') q -r
And I cant reproduce.
Even putting your exact same install mask settings in make.conf I still can't reproduce. 

However note. INSTALL_MASK is defined a size of 1024
char install_mask[1024] = "";  on line 101 of main.c

So if you want to set a software break on initialize_portage_env() that would probably help track it down a bit further.

Comment 11 Georgi Georgiev 2007-02-26 16:05:20 UTC

... WHAT! I cannot reproduce this thing *anywhere* anymore. Not in a chroot, not in a real root on either of the machines. And it's not like I've done anything. I just don't know what to think. The only thing that changed since the last time I tried is the date.

I'll try to break this curse by setting it WORKSFORME as that's exactly what is happening right now. I'd like to see that spirit try to prove *that* wrong.

Comment 12 Georgi Georgiev 2007-02-27 01:24:34 UTC

The curse is broken, the problem is back :-D

It is still hard for me to reproduce the exact conditions, but the problem is a simple overflow related to the maximum length of INSTALL_MASK.
More precisely, the problem appears to be in strincr_var.

Say you have two lines in /etc/make.conf like:
INSTALL_MASK="some dummy looooong value (>1024 bytes)"
INSTALL_MASK="foo foo bar bar"
what happens when the first line is parsed is that strincr_var properly sets vars_to_read[1].value to the dummy looong value, truncated at 1024 bytes.

What happens when the next line is parsed, is that strincr_var is called with an already full vars_to_read[1}.value. Nevertheless, it immediately appends a space and the full value of the currently being parsed line. This is not correct. Furthermore, it *appends* the value of the line, even though it rather has to overwrite the old value.

I am pretty sure that I only had a single line in make.conf when I reported this, but what about fixing the problems one at a time.

Comment 13 solar (RETIRED) gentoo-dev

2007-02-27 17:41:59 UTC

Ok now we are starting to get somewhere. I can reproduce undesired behavior. Still no segv however.

(echo INSTALL_MASK=\"$(perl -e 'print "A" x 1024')\" ; echo  INSTALL_MASK=\"\${INSTALL_MASK} $(perl -e 'print "A" x 1024')\" ;  echo  INSTALL_MASK=\"\${INSTALL_MASK} $(perl -e 'print "A" x 1024')\") >> /etc/make.conf

Then 
DEBUG=1 q

For me I can see PORTDIR= got overwritten. I'll see what I can do..

Comment 14 solar (RETIRED) gentoo-dev

2007-02-27 17:56:38 UTC

Created attachment 111448 [details, diff]
q-bug-168334.diff

Give this a spin please..
I added a sanity check and raised the default size of install_mask quite a bit.
Then raised the size of some other buffers such as binhost,features..

cvs -d:pserver:anonymous@anoncvs.gentoo.org:/var/cvsroot -q co -R gentoo-projects/portage-utils

Comment 15 solar (RETIRED) gentoo-dev

2007-02-27 23:58:20 UTC

I pushed this patch to cvs. So simply cvs co/up and see if all is fine.. It should be.

Comment 16 Georgi Georgiev 2007-02-28 03:55:06 UTC

I'll try the patch as soon as I get home (in 6 hours I guess). Until then...
I recall that INSTALL_MASK was getting appended to even without having ${INSTALL_MASK}. This may be a separate problem, but thought I'd make it clear. Steps to reproduce:

echo INSTALL_MASK=foo >> make.conf
echo INSTALL_MASK=bar >> make.conf
env DEBUG=1 q >/dev/null
look at INSTALL_MASK="foo bar"

Comment 17 Georgi Georgiev 2007-02-28 10:14:41 UTC

(In reply to comment #15)
> I pushed this patch to cvs. So simply cvs co/up and see if all is fine.. It
> should be.

OK, I can confirm that it doesn't segfault now. Which is great and I thank you for that.

However, I cannot say that it works, because the parser certainly doesn't behave normally
- it can only increment INSTALL_MASK and doesn't care about the presence of
  ${INSTALL_MASK} (short of removing it)
- it doesn't support ${OTHERVAR} either (as taken from a comment)
- CONFIG_PROTECT is defined as _Q_STR (same as ARCH) while INSTALL_MASK is
  _Q_ISTR (same as FEATURES). I am pretty sure that CONFIG_PROTECT should be
  _Q_ISTR as well.
- personally, I believe that if the parser is properly written, there should be
  no need for _Q_STR *and* _Q_ISTR

Is there any good reason not to use "portageq envvar" and avoid reading the profiles altogether?

Comment 18 solar (RETIRED) gentoo-dev

2007-02-28 15:13:28 UTC

(In reply to comment #17)
> (In reply to comment #15)
> > I pushed this patch to cvs. So simply cvs co/up and see if all is fine.. It
> > should be.
> 
> OK, I can confirm that it doesn't segfault now. Which is great and I thank you
> for that.


good

> However, I cannot say that it works, because the parser certainly doesn't
> behave normally

well thats somewhat debateable. "normally in the terms of q" or normally in the terms of pythons shellx code. or normally in the terms of a shell..

> - it can only increment INSTALL_MASK and doesn't care about the presence of
>   ${INSTALL_MASK} (short of removing it)

I'll look into that when I get a chance. 


> - it doesn't support ${OTHERVAR} either (as taken from a comment)

Thats expected and I don't expect that to change anytime soon.

> - CONFIG_PROTECT is defined as _Q_STR (same as ARCH) while INSTALL_MASK is
>   _Q_ISTR (same as FEATURES). I am pretty sure that CONFIG_PROTECT should be
>   _Q_ISTR as well.

Yeah ok. _Q_ISTR was coded after the orig _Q_STR code.


> - personally, I believe that if the parser is properly written, there should be
>   no need for _Q_STR *and* _Q_ISTR

It's not exactly easy to write a bash parser in c.

> Is there any good reason not to use "portageq envvar" and avoid reading the
> profiles altogether?

Yeah we don't want to shell out and make python calls.

Comment 19 solar (RETIRED) gentoo-dev

2007-04-05 18:42:15 UTC

This is released in 0.1.25

Bug #168334 ; q -r dies with a segfault after emerge --sync
Bug #168442 ; does not  properly parse the profile location
Bug #170795 ; add a -E/--eclass option to qgrep
Bug #170797 ; add a -s/--skip-comments option to qgrep
Bug #171024 ; opening '/usr/portage/.metadata.x' failed
Bug #171374 ; Misc enhancements for qgrep
Bug #172240 ; -A/-B options for qgrep (context lines) 
Bug #172338 ; qgrepping through installed ebuilds (in the VDB) 
Bug #173005 ; Colorized output for qgrep.

Comment 20 solar (RETIRED) gentoo-dev

2007-04-05 18:43:01 UTC

Closing