Created attachment 351696 [details] WARN/stack dump/oops in the bluetooth rfcomm code introduced in raw kernel 3.8.x This is an *UPSTREAM* bug and also some collation of known information of the nature of this bug. A bug that was introduced upstream by the bluetooth developers in 3.8.x which remains in 3.9.x will cause the machine to crash with an oops when rfcomm is disconnected while a tty is connected. This is unexpected behavior. While in 3.10-rc5 the behavior changed, the bug still exists. The initial method to trigger this bug was listed in http://forums.gentoo.org/viewtopic-t-961421-highlight-.html In brief, set up any bluetooth rfcomm connection and then rip up the bluetooth connection (/etc/init.d/bluetooth stop, rfcomm release, use blueman to disconnect Dial Up Networking/Serial). (I believe that pulling the bluetooth USB device from the plug also will trigger this issue, but I'd call that nonnatural behavior.) The kernel will then stomp over another kernel structure and cause the kernel to get corrupted, making other subsystem oops. As Gentoo appears to not have bluetooth setup for networkmanager, it should not be affected unless someone is using rfcomm directly to communicate with a bluetooth serial device, say over minicom for a bluetooth device or using pppd directly to access a bluetooth modem. I hit the bug because I have a /etc/portage/patches/net-misc/networkmanager patch file to allow bluetooth rfcomm links. As far as I can tell and from reports/tests upstream, this is probably due to bluetooth rfcomm not following standard tty procedures ripping up connected applications if the bluetooth link is torn down without cleaning up the tty. A patch to expose the bad rfcomm behavior was posted on LKML on 2013 May 15, which also prevents the machine from hanging/crashing by stopping the memory corruption. It does not fix the problem, merely instruments it (and also prevents other subsystems from dying, causing potential data loss). The patch that Peter Hurley wrote was: diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c index 6d9e0b2..a4f4fa9 100644 --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -140,6 +140,10 @@ EXPORT_SYMBOL(tty_port_destroy); static void tty_port_destructor(struct kref *kref) { struct tty_port *port = container_of(kref, struct tty_port, kref); + + /* check if last port ref was dropped before tty release */ + if (WARN_ON(port->itty)) + return; if (port->xmit_buf) free_page((unsigned long)port->xmit_buf); tty_port_destroy(port); Attached is the warnings and errors generated when I disable rfcomm from blueman with the above patch showing the correct trace. Without the above patch, corruption will tend to make other functions show incorrect information and tends to completely crash/hang the machine shortly after disconnection.
Seems the upstream patch has been rejected; and on Launchpad, people wait too.
There seems to have been a flame war on this on LKML on what to do when the illegal situation arises. Alex's original patch was to BUG() when the problem occurs, got rejected, but Peter suggested to WARN() on the issue. Either should be fine to notify in syslog/console trace when the improper procedure to tear down the tty occurs. Either way, it's still just instrumenting the bug. There still appears to be no true fixes are in sight. On June 25 LKML there was a message reply to the subject "BUG: tty: memory corruption through tty_release/tty_ldisc_release" that indicates the bug can also be triggered by having the link open and suspend/resume the machine (ouch). I suppose it's just because not many people use BT else this would be a fairly serious bug...
Saw a message come by on linux-bluetooth/linux-serial mail lists, dated Jul 6 2013 from Gianluka Anzolin who submitted a patch versus 3.10 which may fix this issue, but the same doubts exist - people aren't sure how this piece of software really works :( Though it was suggested that it stopped the crash from happening.
(In reply to Ben from comment #3) > Saw a message come by on linux-bluetooth/linux-serial mail lists, dated Jul > 6 2013 from Gianluka Anzolin who submitted a patch versus 3.10 which may fix > this issue, but the same doubts exist - people aren't sure how this piece of > software really works :( > > Though it was suggested that it stopped the crash from happening. Do you have a link to this particular message? Thank you for finding it back in advance.
Here's the latest thread I saw on the mailing list: I hope Peter's mail is indicating this will be committed into a kernel soon: http://marc.info/?l=linux-bluetooth&m=137511052832458&w=2
Gianluca Anzolin's patches have now been merged to bluetooth-next. Eventually they should find their way to mainline and stable kernels. http://marc.info/?l=linux-bluetooth&m=137699050920055&w=2 https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=1f088c00f11cd5b09e215cf31010ed3854f62b9a https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=befa7d049165e6d47859fb827ee5671354f30284 https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=33040aa77f9ba8f0e3120f2e15917a74aef7ee07 https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=e5e5db0dcfb07cf40cbec7e198443a8f67a844c2 https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=77f577d52aefb92c350f65c4228958415a05510f https://git.kernel.org/cgit/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=288f2fc4203559d225d84f1a0308198ad7a06c65
(In reply to Jussi Saarinen from comment #6) > Gianluca Anzolin's patches have now been merged to bluetooth-next. > Eventually they should find their way to mainline and stable kernels. > > ... Cool, will see if these apply on top of genpatches.
The patch series is apparently "too extensive to consider for -stable" [1]. So another solution is required for stable kernels. Gianluca's fix should eventually end up in mainline though (3.12 hopefully). [1] http://marc.info/?l=linux-bluetooth&m=137762583515880&w=2
Gianluca Anzolin writes on bluetooth-linux mailing list that though his tty refcount patch series is needed, more work is required to fix the problem. If I understood his mailing list message correctly, the system locks up when the device is released even after his patches have been applied. Source: http://marc.info/?l=linux-bluetooth&m=137788497602145&w=2
Thank you for reporting the status of the patches, we will await them.
Gianluca Azolin's patches were merged to net-next day before yesterday. And yesterday they were merged to Linus' master branch. So patches will be in 3.12 rc1.
Sorry, forgot the links to the commits: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e7abfe40928f4f8c1aa908477c36c13843bd1a57 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cc998ff8811530be521f6b316f37ab7676a07938
I've tried 3.12-rc2, as far as I can tell just opening and closing the rfcomm in BT seems to no longer crash the box - however, some characteristics changed and NetworkManager no longer accepts /dev/rfcomm* as a valid communications device as before, so I can't fully test it. A bit tricky - using the same userland, Linux-3.5.7-gentoo - works flawlessly Linux-3.8.13-gentoo - crashes when BT rfcomm is closed Linux-3.12-rc2 (raw from kernel.org) - blueman able to set up rfcomm but networkmanager does not notice that rfcomm was setup. Supposedly NM should notify blueman via dbus that it acknowledges the device, but blueman times out waiting and NM does not recognize the device. In trying to diagnose the problem I tried using busybox microcom on /dev/rfcomm0 directly. I was able to send my modem AT commands like on previous kernels, indicating the bluetooth link indeed works. I was able to shutdown the rfcomm link on blueman as well after sending those bytes through. No crash - a positive sign... but it may depend on the number of bytes sent. I'll need to see why NM does not like the rfcomm device now, but I am no NM or dbus expert...
Seems that is indeed in all 3.12 related versions. # git tag --contains cc998ff8811530be521f6b316f37ab7676a07938 v3.12 v3.12-rc1 v3.12-rc2 v3.12-rc3 v3.12-rc4 v3.12-rc5 v3.12-rc6 v3.12-rc7 Can you try to see if the rest of the odd behavior has since been fixed?
Drat. 3.12-release still reports that the "connection is unusable" in blueman whereas 3.6.11 (last kernel I have built that works)... still works... hmm...needs more debug now...
(In reply to Ben from comment #15) > Drat. 3.12-release still reports that the "connection is unusable" in > blueman whereas 3.6.11 (last kernel I have built that works)... still > works... > > hmm...needs more debug now... Aw, then this isn't the right patch to backport; can you check if this still happens in more recent versions? (gentoo-sources and git-sources)
Yes, 3.12-release has the same behavior as the release candidates - they no longer crash the machine but it was not fixed correctly/completely (meaning that behavior is not quite correct.) Will have to check future versions to find one that behaves correctly. Debugging networkmanager will be ugly... Sigh.
I think we finally have a winner patchset here. On Linux-Bluetooth there are two patches that showed up that, when patched against 3.12.6, seemingly completely fixes the longstanding problem. I don't know when these will show up in mainline. The patch names are "rfc3.patch" and "modman.patch" from Gianluca Anzolin. I'll attach the patches here. Thanks for all of Gentoo staff for tolerating bugs like this. I've been posting in the Ubuntu forums about this and all I get is flak.
Created attachment 366926 [details, diff] part one of 3.12.6 userspace bug of rfcomm
Created attachment 366928 [details, diff] part two of 3.12.6 patch to fix userspace differences in rfcomm
Created attachment 368038 [details, diff] Patch for inclusion - Part 1 Ben, can you test the 4 part series of which this is part 1? I backported the four from upstream.
Created attachment 368044 [details, diff] Patch for inclusion 2/4
Created attachment 368046 [details, diff] Patch for inclusion 3/4
Created attachment 368048 [details, diff] Patch for inclusion 42/4
What kernel version will these patches be against? I'm not sure what patches that went into 3.12.6 fixed the large part of the issue (i.e. the crashing) ...
Sorry, please apply against 3.12.8.
The patches apply against 3.12.8 fine and NetworkManager finds the rfcomm interface just fine once more. (Now I just need to move to systemd/gnome3...ugh...) Thanks!
Ben, it's been awhile and I can't imagine the appropriate patches aren't in 3.14. Please comment if that is not the case and there are still issues.