Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 355731 - media-libs/gegl-0.1.6: fails to compile
Summary: media-libs/gegl-0.1.6: fails to compile
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: High minor (vote)
Assignee: Hanno Böck
URL:
Whiteboard:
Keywords:
Depends on: 356845
Blocks: 360595
  Show dependency tree
 
Reported: 2011-02-20 19:17 UTC by Alex Brandt (RETIRED)
Modified: 2011-04-04 19:20 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Build Log (build.log,36.26 KB, text/plain)
2011-02-20 19:26 UTC, Alex Brandt (RETIRED)
Details
emerge --info =media-libs/gegl-0.1.6 (emerge.info,4.91 KB, text/plain)
2011-02-20 19:27 UTC, Alex Brandt (RETIRED)
Details
emerge -pqv =media-libs/gegl-0.1.6 (emerge.txt,116 bytes, text/plain)
2011-02-20 19:27 UTC, Alex Brandt (RETIRED)
Details
backtrace of write to malloc_f (babl-malloc_f-write.bt,3.63 KB, text/plain)
2011-02-28 19:27 UTC, Martin von Gagern
Details
watchpoint for babl (babl-malloc_f-watchpoint.patch,4.34 KB, patch)
2011-02-28 19:38 UTC, Martin von Gagern
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Brandt (RETIRED) gentoo-dev 2011-02-20 19:17:53 UTC
media-libs/gegl-0.1.6 fails to compile with a segmentation fault.  Appears to be due to the doc use flag as this package compiles fine with USE=-doc.

Reproducible: Always

Steps to Reproduce:
1.USE=doc emerge gegl
2.Segmentation Fault
3.Error stopping compilation

Actual Results:  
Segmentation fault in documentation generation.

Expected Results:  
Successful compilation of the package.
Comment 1 Alex Brandt (RETIRED) gentoo-dev 2011-02-20 19:26:32 UTC
Created attachment 263187 [details]
Build Log
Comment 2 Alex Brandt (RETIRED) gentoo-dev 2011-02-20 19:27:01 UTC
Created attachment 263189 [details]
emerge --info =media-libs/gegl-0.1.6
Comment 3 Alex Brandt (RETIRED) gentoo-dev 2011-02-20 19:27:20 UTC
Created attachment 263191 [details]
emerge -pqv =media-libs/gegl-0.1.6
Comment 4 Rafał Mużyło 2011-02-20 19:46:08 UTC
Seems to be a dupe of bug 287007, but as this one is better documented and more recent, not resolving as such.

Can you reproduce the segfault after emerging gegl, if you run that command from command line ?
Comment 5 Rafał Mużyło 2011-02-20 19:46:58 UTC
Sorry, I meant bug 291374.
Comment 6 Alex Brandt (RETIRED) gentoo-dev 2011-02-21 18:11:20 UTC
I have a hard time following some of the autotools Makefiles.  This is how I reproduced the segfault: cd /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/docs/gallery && make.  

When trying to run this specific command: GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl OpenRaster-00.xml -o `echo OpenRaster-00.png | sed s?./??` > `echo OpenRaster-00.png | sed s?./?? | sed -e s/png/txt/`, I get the segfault as well.

The segfault appears to be coming from the gegl script since running the following also segfaults: GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl OpenRaster-00.xml.

If you need any more information let me know.
Comment 7 Rafał Mużyło 2011-02-21 18:51:28 UTC
No, I meant something like:
after installing gegl, extract OpenRaster-00.xml and data/romedalen.jpg
and run 'GEGL_SWAP=RAM GEGL_DEBUG_TIME=yes gegl OpenRaster-00.xml -o OpenRaster-00.png'
if that doesn't crash, then probably problem is eitheer ebuild or portage related.
Comment 8 Alex Brandt (RETIRED) gentoo-dev 2011-02-21 20:05:21 UTC
Ok, that makes sense.  Sorry I wasn't following that before.  When I run the command with the installed binary I get the same result (which makes since because that binary is the one installed on my system right now).

elijah gallery # GEGL_SWAP=RAM GEGL_DEBUG_TIME=yes gegl OpenRaster-00.xml -o
zsh: segmentation fault  GEGL_SWAP=RAM GEGL_DEBUG_TIME=yes gegl OpenRaster-00.xml -o
Comment 9 Rafał Mużyło 2011-02-22 18:33:56 UTC
I'm on x86 and on command line it works fine, but could this be a shell problem ?
Does it work with bash ?
Comment 10 Alex Brandt (RETIRED) gentoo-dev 2011-02-22 20:40:46 UTC
I made sure to run it with bash and received the same results.  I had considered the zsh usage to be a problem and wanted to rule that out.  The only other difference is that I'm on ~mad64.  Could it be something in that build?  Do you need any other information that would help you track it down?
Comment 11 Martin von Gagern 2011-02-23 21:26:22 UTC
Running the last command from comment #6 using bash identifies this command segfaulting:
exec /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/bin/.libs/gegl
     OpenRaster-00.xml

So I suggest this command to debug the issue:
cd /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/docs/gallery; \
GEGL_PATH=../../operations LD_LIBRARY_PATH=\
/var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/gegl/.libs:/usr/lib64 \
gdb --args ../../bin/.libs/gegl OpenRaster-00.xml

(gdb) run
Starting program: /var/tmp/portage/media-libs/gegl-0.1.6/work/
                  gegl-0.1.6/bin/.libs/gegl OpenRaster-00.xml
[Thread debugging using libthread_db enabled]
Traceback (most recent call last):
  File "/usr/share/gdb/auto-load/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.2/
        libstdc++.so.6.0.14-gdb.py", line 59, in <module>
    from libstdcxx.v6.printers import register_libstdcxx_printers
ImportError: No module named libstdcxx.v6.printers

Program received signal SIGSEGV, Segmentation fault.
g_static_private_get (private_key=0x3b398031a0) at gthread.c:1661
1661      else if (private_key->index <= array->len)
(gdb) bt
#0  g_static_private_get
    (private_key=0x3b398031a0) at gthread.c:1661
#1  0x0000003b39601abc in g_module_open
    (file_name=0x683ec0
     "../../operations/workshop/external/.libs/gtk-display.so",
     flags=0) at gmodule.c:468
#2  0x00007ffff7dd0bb4 in gegl_module_open
    (module=0x649c10) at geglmodule.c:386
#3  0x00007ffff7dd125f in gegl_module_load
    (module=0x649c10) at geglmodule.c:135
#4  0x00007ffff7dd143b in gegl_module_new
    (filename=<value optimized out>, load_inhibit=0, verbose=0)
    at geglmodule.c:226
#5  0x00007ffff7dd19d4 in gegl_module_db_module_initialize
    (file_data=<value optimized out>, user_data=<value optimized out>)
    at geglmoduledb.c:343
#6  0x00007ffff7dd0aa9 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:234
#7  0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#8  0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#9  0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#10 0x00007ffff7da6560 in gegl_post_parse_hook
    (context=<value optimized out>, group=<value optimized out>,
     data=<value optimized out>, error=<value optimized out>)
    at gegl-init.c:532
#11 0x0000003b342515e8 in g_option_context_parse
    (context=0x608840, argc=<value optimized out>,
     argv=0x7fffffffbe60, error=0x7fffffffbe20) at goption.c:1962
#12 0x00007ffff7da5ed2 in gegl_init
    (argc=0x7fffffffbe6c, argv=0x7fffffffbe60) at gegl-init.c:193
#13 0x00000000004018bb in main
    (argc=2, argv=0x7fffffffcfa8) at gegl.c:99
(gdb) print array
$2 = (GArray *) 0x1

The function in question, g_static_private_get, seems to come from /usr/lib64/libglib-2.0.so.0.2600.1 which belongs to dev-libs/glib-2.26.1-r1. The array pointer most certainly should point to some other (i.e. valid) address, I guess. So that's where the segfault actually occurs. As to who is to blame, I haven't understood enough to know. Not sure I'll find time to investigate.
Comment 12 Martin von Gagern 2011-02-24 11:34:23 UTC
Using hardware watchpoints it seems that the nvidia OpenGL driver might be part of the problem. It's he who's setting the value which g_static_private_get interprets as self->private_data. However I'm not sure whether nvidia libGL accidentially overwrites data it has no right to, or whether some other problem makes memory regions overlap in a way they should not.

(gdb) run
Starting program: /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/bin/.libs/gegl OpenRaster-00.xml
[Thread debugging using libthread_db enabled]
Hardware watchpoint 1: $1->private_data

Old value = <unreadable>
New value = (gpointer) 0x0
__libc_calloc (...) at malloc.c:4131
4131          *(d+4) = 0;
(gdb) bt 6
#0  __libc_calloc (...) at malloc.c:4131
#1  0x0000003b36e7a212 in ?? () from /usr/lib64/libGL.so.1
#2  0x0000003b36e7a511 in ?? () from /usr/lib64/libGL.so.1
#3  0x0000003b36e7a814 in ?? () from /usr/lib64/libGL.so.1
#4  0x0000003b2da0dae9 in call_init (...) at dl-init.c:70
#5  0x0000003b2da0dc27 in _dl_init (...) at dl-init.c:134
(More stack frames follow...)
(gdb) c
Continuing.
Hardware watchpoint 1: $1->private_data

Old value = (gpointer) 0x0
New value = (gpointer) 0x1
0x0000003b36e7a243 in ?? () from /usr/lib64/libGL.so.1
(gdb) bt 6
#0  0x0000003b36e7a243 in ?? () from /usr/lib64/libGL.so.1
#1  0x0000003b36e7a511 in ?? () from /usr/lib64/libGL.so.1
#2  0x0000003b36e7a814 in ?? () from /usr/lib64/libGL.so.1
#3  0x0000003b2da0dae9 in call_init (...) at dl-init.c:70
#4  0x0000003b2da0dc27 in _dl_init (...) at dl-init.c:134
#5  0x0000003b2da1204e in dl_open_worker (...) at dl-open.c:480
(More stack frames follow...)
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff70ebe8c in g_static_private_get (private_key=0x7ffff77a3150) at gthread.c:1661
1661      else if (private_key->index <= array->len)
(gdb) p $1
$7 = (GRealThread *) 0x6c53b0
(gdb) p self
$8 = (GRealThread *) 0x6c53b0
Comment 13 Martin von Gagern 2011-02-24 18:45:36 UTC
Cross reference: this issue here seems to have a lot in common with bug #353224: it looks like an issue with thread local storage which got initialized incorrectly during dlopen. In both cases gtk and opengl seem to be involved in some way. It's not the same, though, as I've applied the fix for that bug to my system here, and still encounter the issue. If the original reporter did something similar, the fix itself might be broken in some way. Otherwise I guess it's a different but probably related issue.

Some more details on the issue: the functions pthread_getspecific and setspecific seem to play a crucial role. They are used in the gthread-posix.c implementation of glib. It seems that during the lifetime of the application, the key 0 has different meanings. In my run, the first call to g_thread_self after threads have been initialized will set the value for key 0 to 0x606d90. After that I get the following:

Old value = (void *) 0x606d90
New value = (void *) 0x0
0x0000003b2fa0c646 in __pthread_getspecific (key=<value optimized out>) at pthread_getspecific.c:63
63              result = data->data = NULL;
(gdb) bt
#0  0x0000003b2fa0c646 in __pthread_getspecific
    (key=<value optimized out>) at pthread_getspecific.c:63
#1  0x00007ffff739cd9e in g_private_get_posix_impl
    (private_key=0x607670) at gthread-posix.c:290
#2  0x00007ffff70ecce2 in g_thread_self
    () at gthread.c:2113
#3  0x00007ffff70ebe53 in g_static_private_get
    (private_key=0x7ffff77a3150) at gthread.c:1652
#4  0x00007ffff75a223f in g_module_error
    () at gmodule.c:604
#5  0x00007ffff75a21e5 in g_module_close
    (module=0x68a330) at gmodule.c:590
#6  0x00007ffff7dd0dcc in gegl_module_close
    (module=<value optimized out>) at geglmodule.c:406
#7  gegl_module_unload
    (module=<value optimized out>) at geglmodule.c:192
#8  0x00007ffff7dd1452 in gegl_module_new
    (filename=<value optimized out>, load_inhibit=0, verbose=0)
    at geglmodule.c:227
#9  0x00007ffff7dd19d4 in gegl_module_db_module_initialize
    (file_data=<value optimized out>, user_data=<value optimized out>)
    at geglmoduledb.c:343
#10 0x00007ffff7dd0aa9 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:234
#11 0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#12 0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#13 0x00007ffff7dd0ae6 in gegl_datafiles_read_directories
    (path_str=<value optimized out>, flags=G_FILE_TEST_EXISTS, 
     loader_func=0x7ffff7dd1940 <gegl_module_db_module_initialize>,
     user_data=0x6102f0) at gegldatafiles.c:207
#14 0x00007ffff7da6560 in gegl_post_parse_hook
    (context=<value optimized out>, group=<value optimized out>,
     data=<value optimized out>, 
     error=<value optimized out>) at gegl-init.c:532
#15 0x00007ffff70cddee in g_option_context_parse
    (context=0x608840, argc=0x7fffffffbe6c, argv=0x7fffffffbe60,
     error=0x7fffffffbe20)
    at goption.c:1962
#16 0x00007ffff7da5ed2 in gegl_init
    (argc=0x7fffffffbe6c, argv=0x7fffffffbe60) at gegl-init.c:193
#17 0x00000000004018bb in main
    (argc=2, argv=0x7fffffffcfa8) at gegl.c:99
(gdb) frame 2
#2  0x00007ffff70ecce2 in g_thread_self () at gthread.c:2113
2113      GRealThread* thread = g_private_get (g_thread_specific_private);
(gdb) finish
Run till exit from #2  0x00007ffff70ecce2 in g_thread_self () at gthread.c:2113
Value returned is $5 = (GThread *) 0x68a330

Old value = (void *) 0x0
New value = (void *) 0x6c53b0
__pthread_setspecific (key=0, value=0x6c53b0) at pthread_setspecific.c:90
90        level2->seq = seq;
(gdb) bt
#0  __pthread_setspecific (key=0, value=0x6c53b0) at pthread_setspecific.c:90
#1  0x0000003b36e7a54a in ?? () from /usr/lib64/libGL.so.1
#2  0x0000003b36e7a814 in ?? () from /usr/lib64/libGL.so.1
#3  0x0000003b2da0dae9 in call_init
    (l=0x681520, argc=2, argv=0x7fffffffcfa8, env=0x7fffffffcfc0)
    at dl-init.c:70
#4  0x0000003b2da0dc27 in _dl_init
    (main_map=0x6836d0, argc=2, argv=0x7fffffffcfa8, env=0x7fffffffcfc0)
    at dl-init.c:134
#5  0x0000003b2da1204e in dl_open_worker (a=<value optimized out>)
    at dl-open.c:480
...

So it seems that gegl/glib tries to access the thread-local g_thread_self pointer after it has been invalidated (due to some kind of close function) and reclaimed by the nvidia OpenGL implementation. I'd believe the reason for this bug somewhere in the call stack of that first trace. Why does gegl_module_new call gegl_module_unload? Haven't looked at the code there yet, won't do so today.
Comment 14 Martin von Gagern 2011-02-25 06:27:15 UTC
New observation: __pthread_key_create and pthread_key_delete are interesting breakpoints. The key in question is key 0, which is used twice, once by glib and once by libGL. There is indeed a key delete in between, which is the reason that this duplicate use can happen in the first place. It involves a broken stack and /usr/lib/qt4/libQtCore.so.4 from x11-libs/qt-core-4.7.1-r1:

Breakpoint 2, pthread_key_delete (key=0) at pthread_key_delete.c:31
31        if (__builtin_expect (key < PTHREAD_KEYS_MAX, 1))
(gdb) bt
#0  pthread_key_delete (key=0) at pthread_key_delete.c:31
#1  0x0000003b2ee353b5 in __cxa_finalize (d=0x3b35aa0700) at cxa_finalize.c:56
#2  0x0000003b3565e6e6 in ?? () from /usr/lib/qt4/libQtCore.so.4
#3  0x00007fffffffb358 in ?? ()
#4  0x0000000000000012 in ?? ()
#5  0x00007fffffffb570 in ?? ()
#6  0x0000003b357d0ce1 in _fini () from /usr/lib/qt4/libQtCore.so.4
#7  0x0000000000000038 in ?? ()
#8  0x0000003b2da12b12 in _dl_close_worker (map=<value optimized out>)
    at dl-close.c:271
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Comment 15 Martin von Gagern 2011-02-28 11:39:00 UTC
Finally nailed this thing down, I think. It's a bug in x11-libs/qt-core-4.7.1-r1. When compiled with -O0 and USE=debug, I get this backtrace:

#0  destroy_current_thread_data_key () at thread/qthread_unix.cpp:141
#1  0x00007ffff4c31a1d in destroy_current_thread_data_key__dest_class__::~destroy_current_thread_data_key__dest_class__ (this=0x7ffff50d7ca8, 
    __in_chrg=<value optimized out>) at thread/qthread_unix.cpp:143
#2  0x0000003b2ee353b5 in __cxa_finalize (d=0x7ffff50d76e0)
    at cxa_finalize.c:56
#3  0x00007ffff4c0ad26 in ?? () from /usr/lib/qt4/libQtCore.so.4
#4  0x00007fffffffb358 in ?? ()
#5  0x0000000000000012 in ?? ()
#6  0x00007fffffffb570 in ?? ()
#7  0x00007ffff4dcb9c1 in _fini () from /usr/lib/qt4/libQtCore.so.4
#8  0x0000000000000038 in ?? ()
#9  0x0000003b2da12b12 in _dl_close_worker (map=<value optimized out>)
    at dl-close.c:271
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

The matching create_current_thread_data_key has never been called, as a breakpoint confirms. This issue has been reported at QT upstream:
http://bugreports.qt.nokia.com/browse/QTBUG-10861
I've filed Gentoo bug #356845 about it.

Unfortunately, with this issue addressed, the gegl build still segfaults. What's worse, it doesn't segfault when run through gdb --args. So I'm somewhat clueless as to where the error occurs now. In any case, this whole mess still isn't fixed, and requires even more work.
Comment 16 Martin von Gagern 2011-02-28 13:53:55 UTC
OK, got a core dump, which looks like an infinite call loop inside media-libs/babl-0.1.4:

#0  0x00007f7d535534c6 in extension_new (path=0x7f7d53566a20 "",
    dl_handle=0x0, destroy=Cannot access memory at address 0x7ffff8c4cff8)
    at babl-extension.c:59
#1  0x00007f7d53553582 in babl_extension_quiet_log () at babl-extension.c:81
#2  0x00007f7d5355c586 in real_babl_log (file=0x7f7d535691be "babl-memory.c",
    line=94, function=0x7f7d53569410 "functions_sanity", 
    fmt=0x7f7d53569188 "babl memory function(s) attempted switched on the fly")
    at babl-internal.h:140
#3  0x00007f7d5355c727 in functions_sanity () at babl-memory.c:94
#4  0x00007f7d5355c791 in babl_malloc (size=41) at babl-memory.c:111
#5  0x00007f7d535534e2 in extension_new (path=0x7f7d53566a20 "", dl_handle=0x0,
    destroy=0) at babl-extension.c:62
...

I read this as babl trying to report a problem with its memory management subsystem, and the reporting facility itself using the memory management subsystem. Bad!

To remedy this problem, let's replace the call to babl_fatal in function_sanity with a simple abort call. With this workaround, we can get a backtrace to where the error first occurs:
#0  in raise (sig=<value optimized out>) at raise.c:64
#1  in abort () at abort.c:92
#2  in functions_sanity () at babl-memory.c:96
#3  in babl_malloc (size=117) at babl-memory.c:114
#4  in babl_calloc (nmemb=1, size=117) at babl-memory.c:260
#5  in babl_fish_path (source=0x2117530, destination=0x2117770)
    at babl-fish-path.c:266
#6  in babl_fish (source=0x2117530, destination=0x2117770)
    at babl-fish.c:224
#7  in gegl_buffer_iterate (buffer=0x213fb10, scale=1.0000000000081855, 
    rect=0x7fff039f487c, format=0x2117770, dest_buf=0x7f6221354010,
    rowstride=2912) at gegl-buffer-access.c:395
#8  gegl_buffer_get_unlocked (buffer=0x213fb10, scale=1.0000000000081855,
    rect=0x7fff039f487c, format=0x2117770, dest_buf=0x7f6221354010,
    rowstride=2912) at gegl-buffer-access.c:985
#9  in spawnrender (data=<value optimized out>, foo=<value optimized out>)
    at gegl-node.c:896
#10 in gegl_node_blit (self=0x2144080, scale=<value optimized out>, 
    roi=<value optimized out>, format=<value optimized out>,
    destination_buf=0x7f6221354010, rowstride=<value optimized out>,
    flags=GEGL_BLIT_DEFAULT) at gegl-node.c:1011
#11 in render_rectangle (processor=0x2367000, progress=0x0)
    at gegl-processor.c:478
#12 gegl_processor_render (processor=0x2367000, progress=0x0)
    at gegl-processor.c:616
#13 gegl_processor_work (processor=0x2367000, progress=0x0)
    at gegl-processor.c:726
#14 in gegl_node_process (self=<value optimized out>) at gegl-node.c:1752
#15 in main (argc=4, argv=0x7fff039f5e98) at gegl.c:247

So it seems that babl is trying to report a problem that shouldn't occur in the first place, and that doesn't occur when running under gdb. Still trying to understand why it does occur in the standalone app.
Comment 17 Martin von Gagern 2011-02-28 14:15:51 UTC
Added some debug output to babl:

babl_set_malloc:
 first_malloc_used: (nil)
 old malloc_f:      0x3b2ee77d60
 malloc_function:   0x3b2ee77d60
babl_set_free:
 first_free_used:   (nil)
 old free_f:        0x3b2ee77c80
 free_function:     0x3b2ee77c80
First use of malloc:
 first_malloc_used: (nil)
 malloc_f:          0x3b2ee77d60
 first_free_used:   (nil)
 free_f:            0x3b2ee77c80
babl memory function(s) attempted switched on the fly:
 first_malloc_used: 0x3b2ee77d60
 malloc_f:          0x3b37200960
 first_free_used:   0x3b2ee77c80
 free_f:            0x3b372008a0

Address range 0x0000003b37200550 to 0x0000003b37200b13 belongs to /usr/lib64/opengl/nvidia/lib/libnvidia-tls.so.260.19.36 according to
(gdb) info sharedlibrary

So something or other is replacing the malloc and free functions for babl without calling babl_set_malloc/babl_set_free. The function used seems to belong to some thread-local data from nvidia. Last time I accused that driver (in comment #12) I've been wrong. Not sure who's to blame this time.

If only I could reproduce this within gdb, I could add a hardware watchpoint to see where those symbols change. Strange that I can't.
Comment 18 Martin von Gagern 2011-02-28 19:27:47 UTC
Created attachment 264203 [details]
backtrace of write to malloc_f

OK, got a watchpoint without gdb, and it indeed is the nvidia driver that does write to the memory in question, i.e. to the static malloc_f variable of babl. Not really surprising, as the written address points to some nvidia tls memory. Nevertheless problematic for further investigation: lacking the source it's really hard to tell where that var came from.
Comment 19 Martin von Gagern 2011-02-28 19:38:41 UTC
Created attachment 264205 [details, diff]
watchpoint for babl

This patch adds a watchpoint and some debug output to the babl source code. Posting it here for reference, in case someone else wants to investigate this.

1. Run "ulimit -c unlimited" to enable core dumps.
2. Unpack babl, apply this patch, run autogen.sh, configure, make.
3. cd /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/docs/gallery; \
   GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations \
   LD_LIBRARY_PATH=/var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/\
gegl/.libs:/PATH/TO/BABL/babl/.libs:/usr/lib64 \
   /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/bin/.libs/gegl \
   OpenRaster-00.xml -o OpenRaster-00.png
4. App forks, parent exits, child dumps core on SIGTRAP
5. GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations \
   LD_LIBRARY_PATH=/var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/\
gegl/.libs:/PATH/TO/BABL/babl/.libs:/usr/lib64 gdb \
   /var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/bin/.libs/gegl core

Now you have a gdb instance to examine the situation when the core dump occurred. "objdump -d /usr/lib64/libGL.so.1" will help translate any code addresses at least into assembly. Don't know how well that works without prelink, though.
Comment 20 Martin von Gagern 2011-03-04 08:55:31 UTC
OK, I tracked this back to the beginning of the function in the libGL assembly code. At address 3b36e94bf0 <glXCreateWindow+0x28720> the %rdi register seems to already point to some address within babl memory. This is later offset by the value from %ecx at call time, but I'm pretty sure the error is in %rdi.

So where does that value come from? As there are 4 places where that function gets called, things become increasingly more difficult. Does anyone now a good tool to follow information flow through such code?

I'd inform nvidia linux support about this, but http://www.nvnews.net/ seems out of order at the moment.
Comment 21 Petr Zima 2011-04-02 20:52:55 UTC
We are mixing two different issues related to USE=doc here:

1) Segfault due to a bug in Nvidia's opengl, already addressed in Bug 302107.

2) Another segfault, which I am also experiencing even on ~x86 (32bit) with xf86-video-ati and mesa opengl. This one happens only with USE=-png, otherwise the docs are generated fine. Hence "doc" should imply "png" (the offending command asks for png output). Of course, it should not segfault anyway, but I am not going to debug it thoroughly at the moment.

There is also a duplicate Bug 291374 of one the above and a related Bug 287007.
Comment 22 Martin von Gagern 2011-04-03 10:14:51 UTC
(In reply to comment #21)
> We are mixing two different issues related to USE=doc here:
> 
> 1) Segfault due to a bug in Nvidia's opengl, already addressed in Bug 302107.

Looks like my comment #16 through comment #20 might be that bug. Can someone please mark this one here as depending on that one?

I'm still pretty sure that my comment #11 thorugh comment #15 are due to bug #356845, so I guess that might be a third issue mixed into all of this.

> 2) Another segfault [...] happens only with USE=-png

This is the first time I read about the USE=-png scenario. According to comment #3, the original report was USE=png. So I'd say that's yet another issue, and should be filed separately, with appropriate cross references. In particular as you've got a solution to offer, that might be resolved early.
Comment 23 Petr Zima 2011-04-03 16:07:42 UTC
(In reply to comment #22)
> > 2) Another segfault [...] happens only with USE=-png
> 
> This is the first time I read about the USE=-png scenario. According to comment
> #3, the original report was USE=png. So I'd say that's yet another issue, and
> should be filed separately, with appropriate cross references. In particular as
> you've got a solution to offer, that might be resolved early.

Okay, filed separate Bug 361803 for the USE="doc -png" issue.
Comment 24 Jeroen Roovers (RETIRED) gentoo-dev 2011-04-04 11:15:43 UTC
*** Bug 361803 has been marked as a duplicate of this bug. ***
Comment 25 Kobboi 2011-04-04 14:59:52 UTC
Not sure if I'm talking about the same thing but my gegl-0.1.6 fails with

make[4]: Entering directory `/var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/docs/gallery'
--[Updating sample compositions]--
./OpenRaster-00.xml
/bin/sh: line 1: 21309 Killed                  GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl OpenRaster-00.xml -o `echo OpenRaster-00.png | sed s?./??` > `echo OpenRaster-00.png | sed s?./?? | sed -e s/png/txt/`
make[5]: *** [OpenRaster-00.png] Error 137
./OpenRaster-01.xml
/bin/sh: line 1: 21336 Killed                  GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl OpenRaster-01.xml -o `echo OpenRaster-01.png | sed s?./??` > `echo OpenRaster-01.png | sed s?./?? | sed -e s/png/txt/`
make[5]: *** [OpenRaster-01.png] Error 137
./OpenRaster-04.xml
/bin/sh: line 1: 21363 Killed                  GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl OpenRaster-04.xml -o `echo OpenRaster-04.png | sed s?./??` > `echo OpenRaster-04.png | sed s?./?? | sed -e s/png/txt/`
make[5]: *** [OpenRaster-04.png] Error 137
./clones.xml
/bin/sh: line 1: 21390 Killed                  GEGL_DEBUG_TIME=yes GEGL_SWAP=RAM GEGL_PATH=../../operations ../../bin/gegl clones.xml -o `echo clones.png | sed s?./??` > `echo clones.png | sed s?./?? | sed -e s/png/txt/`
make[5]: *** [clones.png] Error 137
make[4]: *** [images.stamp] Error 2
make[4]: Leaving directory `/var/tmp/portage/media-libs/gegl-0.1.6/work/gegl-0.1.6/docs/gallery'
make[3]: *** [all-recursive] 

and the kernel traces show oom at work

Error 1[60089.396962] gegl invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
[60089.396967] Pid: 6490, comm: gegl Tainted: P            2.6.36-gentoo-r5 #1
[60089.396973] Call Trace:
[60089.396978]  [<ffffffff810b742a>] dump_header.clone.8+0x7a/0x200
[60089.396982]  [<ffffffff815fa472>] ? _raw_spin_unlock_irqrestore+0x12/0x40
[60089.396985]  [<ffffffff812183f2>] ? ___ratelimit+0x92/0x120
[60089.396988]  [<ffffffff810b779e>] oom_kill_process.clone.9+0x7e/0x1f0
[60089.396990]  [<ffffffff810b7ba3>] out_of_memory+0x113/0x3d0
[60089.396993]  [<ffffffff810bb902>] __alloc_pages_nodemask+0x6a2/0x6c0
[60089.396996]  [<ffffffff810edc78>] alloc_page_vma+0x68/0x130
[60089.396999]  [<ffffffff810d491b>] handle_mm_fault+0x73b/0xa60
[60089.397002]  [<ffffffff8100386e>] ? invalidate_interrupt1+0xe/0x20
[60089.397005]  [<ffffffff8102b115>] do_page_fault+0x115/0x280
[60089.397007]  [<ffffffff815fafcf>] page_fault+0x1f/0x30
Comment 26 Tomáš Chvátal (RETIRED) gentoo-dev 2011-04-04 19:20:32 UTC
Due to serious issues with the doc useflag it was dropped thus closing this bug as fixed. If somebody is interested in doc features they need to open feature request with relyable working approach for them.