278760 – x11-base/xorg-server-1.6.2-r1: SIGSEGV with sys-apps/hal-0.5.13

Bug 278760 - x11-base/xorg-server-1.6.2-r1: SIGSEGV with sys-apps/hal-0.5.13

Summary: x11-base/xorg-server-1.6.2-r1: SIGSEGV with sys-apps/hal-0.5.13

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	High minor (vote)
Assignee:	Gentoo X packagers

URL:
Whiteboard:
Keywords:

Duplicates (1):	271833 (view as bug list)
Depends on:
Blocks:

Reported:	2009-07-22 20:06 UTC by Martin von Gagern
Modified:	2009-07-28 15:41 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Handle errors for libhal_find_device_by_capability (gentoo278760.patch,1.15 KB, patch) 2009-07-22 21:01 UTC, Martin von Gagern	Details \| Diff
Handle errors for libhal_find_device_by_capability (gentoo278760.patch,1.17 KB, patch) 2009-07-22 21:08 UTC, Martin von Gagern	Details \| Diff
Handle and free errors for libhal_find_device_by_capability (gentoo278760c.patch,1.59 KB, patch) 2009-07-23 09:04 UTC, Martin von Gagern	Details \| Diff
0001-config-add-HAL-error-checks.patch (0001-config-add-HAL-error-checks.patch,3.15 KB, patch) 2009-07-27 12:06 UTC, Rémi Cardona (RETIRED)	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin von Gagern 2009-07-22 20:06:09 UTC

This is related to bug #278642, but my request is a different one.

After updating hal to hal-0.5.13, my Xorg server kept crashing with a segmentation fault. I ran it through gdb and got the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7fce6c0 (LWP 7387)]
0x080a7f5b in connect_and_register (connection=<value optimized out>, info=0x81d9424) at hal.c:511
511     hal.c: No such file or directory.
        in hal.c
(gdb) bt
#0  0x080a7f5b in connect_and_register (connection=<value optimized out>, info=0x81d9424) at hal.c:511
#1  0x080a8062 in connect_hook (connection=0x9a94a20, data=0x81d9424) at hal.c:623
#2  0x080a665f in reconnect_timer (timer=0x98255e8, time=2142542, arg=0x0) at dbus-core.c:172
#3  0x081261a7 in DoTimer (timer=0x98255e8, now=0, prev=0x81dca24) at WaitFor.c:419
#4  0x08126786 in WaitForSomething (pClientsReady=0x9a93ea8) at WaitFor.c:276
#5  0x0808972b in Dispatch () at dispatch.c:367
#6  0x0807011d in main (argc=1, argv=0xbfc2c424, envp=0x9a96aa0) at main.c:397
(gdb) c
Continuing.

Backtrace:
0: /usr/bin/Xorg(xorg_backtrace+0x3c) [0x812881c]
1: /usr/bin/Xorg(xf86SigHandler+0x4e) [0x80b3b80]
2: [0xffffe400]
3: /usr/bin/Xorg [0x80a8062]
4: /usr/bin/Xorg [0x80a665f]
5: /usr/bin/Xorg [0x81261a7]
6: /usr/bin/Xorg(WaitForSomething+0x4ee) [0x8126786]
7: /usr/bin/Xorg(Dispatch+0x7e) [0x808972b]
8: /usr/bin/Xorg(main+0x3b5) [0x807011d]
9: /lib/libc.so.6(__libc_start_main+0xe5) [0x4c547a65]
10: /usr/bin/Xorg [0x806f601]

Fatal server error:
Caught signal 11.  Server aborting

Then I found the report for bug #278642, downgraded hal with its config file, upgraded hal but kept the old config, and at least the crashes went away (though I still have other troubles I'm currently investigating).

So while the immediate issue for me was resolved by fiddeling hal, the fact remains that a broken configuration can cause Xorg to segfault. This should never happen, in my opinion. Configurations should be checked before they are relied upon. An error message and failure to start are acceptable, but a crash probably indicates a bug.

So where bug #278642 provides a solution for X failing to start, I here ask for an investigation and fix to the crash.

Feel free to delegate this issue upsream, or ask me to do so, if you consider the diference between crash and fatal error message to bee too small to warrant addressing at the distro level.

Comment 1 Samuli Suominen (RETIRED) gentoo-dev

2009-07-22 20:09:41 UTC

Don't use HAL 0.5.13 _at all_, upgrade to 0.5.13-r1 and make sure the config files are updated as well. I didn't package.mask it without reason.

Comment 2 Martin von Gagern 2009-07-22 21:01:13 UTC

Created attachment 198837 [details, diff]
Handle errors for libhal_find_device_by_capability

I wrote this bug not because hal-0.5.13 is broken, but because I consider x11-base/xorg-server-1.6.2-r1 to be broken as well.

Yes, updating to hal-0.5.13-r1 including the new config does solve the issue.
Yes, if I had synced the pmask to my system, I'd never emerged 0.5.13 and therefore never encountered this problem.

BUT applications should in my opinion not crash, whatever copnfiguration files say. They may fail to start, but they should complain and give an indication as to what went wrong.

The only difference between hal 0.5.13 and 0.5.13-r1 is the config, so I could update and then manually (and deliberately) edit the config in order to reproduce this issue. The backtrace occurs in the X server itself, not some dynamically linked library, so the hal version doesn't directly affect the instructions around the point of failure.

Loking at http://cgit.freedesktop.org/xorg/xserver/tree/config/hal.c?id=6f1aff5a2b45bc2985081abc240a8fed37170386#n506 I see these lines:

    devices = libhal_find_device_by_capability(info->hal_ctx, "input",
                                               &num_devices, &error);
    /* FIXME: Get default devices if error is set. */
    for (i = 0; i < num_devices; i++)
        device_added(info->hal_ctx, devices[i]);

So xorg-server is completely ignoring any error indicated by the hal library. The HAL library, on the other hand, seems to return NULL in some error scenarios, setting the error but not num_devices. Unless num_devices is accidentially initialized to a non-positive number, the loop body actually does get executed in case of an error, even though the array start is NULL.

The FIXME comment does indicate that there is at least some awareness upstream that this does require some more work. So resolve this UPSTREAM if you consider this a bug to be addressed by xorg devs, or resolve it WONTFIX if you don't care about crashes due to broken configs, or apply my patch and mark it FIXED.

Comment 3 Martin von Gagern 2009-07-22 21:04:09 UTC

REOPENing and awaiting a resolution different from the current TEST-REQUEST.
Alternatives UPSTREAM, WONTFIX or FIXED as outlined above.

Comment 4 Martin von Gagern 2009-07-22 21:08:48 UTC

Created attachment 198841 [details, diff]
Handle errors for libhal_find_device_by_capability

I should have used dbus_error_is_set, as far as I can tell from other, similar code. I'll recompile my xorg server with this patch tomorrow, and see if I can get useful error messages from it even with broken hal config.

Comment 5 Rémi Cardona (RETIRED) gentoo-dev

2009-07-23 08:31:23 UTC

Indeed, I'll see what upstream thinks about this.

Thanks

Comment 6 Martin von Gagern 2009-07-23 09:04:15 UTC

Created attachment 198883 [details, diff]
Handle and free errors for libhal_find_device_by_capability

OK, tested this, and got the following two messages in case of a broken config:

(EE) config/hal: couldn't find input device: org.freedesktop.DBus.Error.AccessDenied (Rejected send message, 1 matched rules; type="method_call", sender=":1.11" (uid=0 pid=1122 comm="/var/tmp/portage/x11-base/xorg-server-1.6.2-r1/ima") interface="org.freedesktop.Hal.Manager" member="FindDeviceByCapability" error name="(unset)" requested_reply=0 destination="org.freedesktop.Hal" (uid=0 pid=901 comm="/usr/sbin/hald --use-syslog --verbose=no ")))

process 1122: arguments to dbus_move_error() were incorrect, assertion "(dest) == NULL || !dbus_error_is_set ((dest))" failed in file dbus-errors.c line 278.
This is normally a bug in some application using the D-Bus library.
  D-Bus not built with -rdynamic so unable to print a backtrace

After this Xorg aborted using SIGABRT.

So the first is just the kind of error message I'd have expected all along. The second is probably due to the fact that a dbus error has to be freed before it may be passed to receive another error message. My coded didn't do this, but neither did the code I copied from.

So here is another patch, preventively freeing any errors that might have occurred. As an alternative, one could free errors directly after catching them, which would seem slightly cleaner to me, but would largely increase the footprint of this patch. So I wrote it this way.

With this patch in place, my Xorg no longer crashed in the presence, but continued running after printing the first error message about the denied dbus access. So one ends up with a running X server without any input devices at all. This might be bad, as it might leave many users without means to recover the situation, as Ctrl+Alt+F# don't seem to work either. On the other hand, this might be intended behaviour in cases where input devices will be plugged in and autoconfigured at a later time. From a usability point of view I'd wish for a fatal error message as the default, but an option to disable it. I guess all of this "no input devices" handling is a diferent issue, though.

Comment 7 Tibor Vago 2009-07-23 09:09:20 UTC

(In reply to comment #2)
> Yes, updating to hal-0.5.13-r1 including the new config does solve the issue.
> 
> BUT applications should in my opinion not crash, whatever copnfiguration files
> say. They may fail to start, but they should complain and give an indication as
> to what went wrong.
> 
> The only difference between hal 0.5.13 and 0.5.13-r1 is the config, so I could
> update and then manually (and deliberately) edit the config in order to
> reproduce this issue. The backtrace occurs in the X server itself, not some
> dynamically linked library, so the hal version doesn't directly affect the
> instructions around the point of failure.

Upgrading to hal-0.15.3-r1 was solve the Backtrace error about X on my laptop.

Thx.

Comment 8 Rémi Cardona (RETIRED) gentoo-dev

2009-07-27 12:06:45 UTC

Created attachment 199329 [details, diff]
0001-config-add-HAL-error-checks.patch

Martin, could I get you to try this patch?

Thanks :)

Comment 9 Martin von Gagern 2009-07-27 12:50:05 UTC

(In reply to comment #8)
> Created an attachment (id=199329) [edit]
> 0001-config-add-HAL-error-checks.patch
> 
> Martin, could I get you to try this patch?

Tried it, gives me a nive error message and no crash, neither segfault nor abort. So it works for me. Is this patch intended for both Gentoo patchset and upstream?

Comment 10 Martin von Gagern 2009-07-27 12:50:52 UTC

(In reply to comment #9)
> gives me a nive error message and no crash

s/nive/nice/

Comment 11 Rémi Cardona (RETIRED) gentoo-dev

2009-07-27 12:52:41 UTC

Yes, I'm going to send this patch to upstream (and hopefully, they'll ack it).

Thanks for testing

Comment 12 Rémi Cardona (RETIRED) gentoo-dev

2009-07-28 13:06:10 UTC

I've pushed the patches to git master and I've nominated them for inclusion in the 1.6 branch. In the mean time, they're in 1.6.2.901.

Closing. Thanks

Comment 13 Rémi Cardona (RETIRED) gentoo-dev

2009-07-28 15:41:28 UTC

*** Bug 271833 has been marked as a duplicate of this bug. ***