Created attachment 278227 [details] manual config 2.6.39-r1 I had this problem for 2 days now -.- If I copy files over the network (tested nfs mount and scp), the system becomes unresponsive and freezes on me. My testing with gentoo-sources-2.6.39-r2 (genkernel and my own config): KDE - nfs mounted via autofs (Dolphin) - complete freeze of the desktop (every now and then I could move the mouse 1-2cm) KDE - nfs mounted via autofs (Konsole via cp) - complete freeze of the desktop (every now and then I could move the mouse 1-2cm) XFCE4 - nfs mounted via autofs (Thunar) - complete freeze of the desktop (every now and then I could move the mouse 1-2cm) XFCE4 - nfs mounted via autofs (Terminal via cp) - complete freeze of the desktop (every now and then I could move the mouse 1-2cm) XFCE4 - scp command in Terminal - complete freeze of the desktop (every now and then I could move the mouse 1-2cm) TTY1(Console with cp) - nfs mounted via autofs - Alt+F2(to switch tty) doesn't respond anymore or every now and then, keyboard input is delayed or not taken No problems with gentoo-sources-2.6.39-r1 ^^ I attach emerge --info, and my current manual config, which works with gentoo-sources-r1.
Created attachment 278229 [details] dmesg
Created attachment 278231 [details] emerge --info
In the patch are a lot of changes to intel-iommu. Maybe dma is broken on my system with this kernel? This would explain, why all the input devices stop working properly or very delayed if there is another device with a constant stream to memory, it just doesn't get an interrupt ^^
I disabled GART iommu and then unticked supported Vendor models all but Intel and it's still the same issue. I can freeze my system by trying to copy about 3GB of data over the network... this disables GART iommu as well: [*] Supported processor vendors ---> [*] Support Intel processors only for AMD processors: [*] GART IOMMU support (NEW)
Which network driver do you use? I can confirm such problems here on my laptop using the jme LAN driver. And it does not matter whether I use GBIt or 100 MBit/s. I will try to revert to r1 too -Marc
jme here too... but there was nothing about wired network drivers in the patch. :/
Can one of you guys do a bisect between vanilla-sources-2.6.29.1 and vanilla-sources-2.6.29.2 ? Also, can you test the latest git-sources to see if this has been addressed?
Mike, I am in contact with the upstream author of the jme driver and I will report back here if we found something. If I find the time I will bisect in between too. (.39 btw not .29 ;))
I can reproduce the problem with sys-kernel/git-sources-3.0_rc5, same happens using cp via nfs in Terminal running xfce4 desktop. input devices become unresponsive.
I am sorry, I test the first build to bisect, then I mark it as bad and the next kernel I get kernel panic...
my fault, i keep going to tell git on the first bisect 'bad' but always end up with 3.0.0-rc5-0063-g0d72c6f should I get any output on git bisect bad?
Created attachment 278727 [details] bisect 2.6.391 and 2.6.39.2 sorry, I had too much beer yesterday :( This worked very well this time, I tested by attempting to copy a ~10GB folder over the network.
here is a nice log ^^ disi-bigtop linux-2.6.39 # git bisect log git bisect start # bad: [62b218cb13724881b5314f10ac0f177f4fdef8b6] Linux 2.6.39.2 git bisect bad 62b218cb13724881b5314f10ac0f177f4fdef8b6 # good: [cf29f916c310c9b13c19514b496700c549597e11] Linux 2.6.39.1 git bisect good cf29f916c310c9b13c19514b496700c549597e11 # good: [cf29f916c310c9b13c19514b496700c549597e11] Linux 2.6.39.1 git bisect good cf29f916c310c9b13c19514b496700c549597e11 # bad: [a4d37345244dea111a49dda25cc30b2ae7dab05c] x86/amd-iommu: Use only per-device dma_ops git bisect bad a4d37345244dea111a49dda25cc30b2ae7dab05c # bad: [0db9466ed48263ab2951e89240b482912695c4a6] iwl4965: fix 5GHz operation git bisect bad 0db9466ed48263ab2951e89240b482912695c4a6 # bad: [646543453327a2b85083f4012d3bbeb5dabdabb8] arch/tile: allocate PCI IRQs later in boot git bisect bad 646543453327a2b85083f4012d3bbeb5dabdabb8 # good: [3a2bc9ae5ee092a0db8aa07d695e15b14a3fe2a4] intel-iommu: Speed up processing of the identity_mapping function git bisect good 3a2bc9ae5ee092a0db8aa07d695e15b14a3fe2a4 # bad: [b8f794de1463ab32ed90c97ad6edbcecd931abed] intel-iommu: Remove Host Bridge devices from identity mapping git bisect bad b8f794de1463ab32ed90c97ad6edbcecd931abed # bad: [80ebe0ace73cb376f66bdeeb92f4e7b5d4a3f8fb] intel-iommu: Use coherent DMA mask when requested git bisect bad 80ebe0ace73cb376f66bdeeb92f4e7b5d4a3f8fb # bad: [87cc4d1e3e05af38c7c51323a3d86fe2572ab033] intel-iommu: Dont cache iova above 32bit git bisect bad 87cc4d1e3e05af38c7c51323a3d86fe2572ab033
Ok, so reverting this commit resolves your problem? Marc can you confirm that?
(In reply to comment #14) > Ok, so reverting this commit resolves your problem? > > Marc can you confirm that? I haven't tried to remove a single patch and build 2.6.39.2, but this is what bisect did for me and I only said good if I was able to copy the complete 10GB folder over the network on the command line without keyboard freeze. As you see in the attached log, that worked 3 times during the bisect. When it was bad, it interrupted the input devices after about 1min of copying over a Gigbit network. Constantly hitting ctrl+c stopped the copying after ~30 seconds and the input devices (keyboard) slowly gained back control, so I could do the next bisect. Actual I would have to read how to do this with patch -? blubb.diff etc.? :) Still running 2.6.39.1 and copied yesterday (in Terminal running xfce4 desktop) the whole CentOS 6.0 DVD release (~5GB) over the network, no problems.
(In reply to comment #14) > Ok, so reverting this commit resolves your problem? > > Marc can you confirm that? Yes. To be sure I bisected the kernels myself: v2.6.39.1 vs. v2.6.39.2 Result: same bisect log Then I checked out v2.6.39.2 again and reverted commit 87cc4d1e3e05af38c7c51323a3d86fe2572ab033, rebuild the kernel and tested again. So, by reverting this single commit the kernel is working again. My Testcase: 0) be in console only, no X 1) make sure eth0 is at 1 GBit/s (which is often 100MB/s here...) 2) mount an nfsv3 share 3) pv /path/to/big/file/on/nfs > /dev/null For working kernels this worked fine for a 3GB file. For broken kernels this always stopped at about 1.3GB and machine was frozen after that (Only SysRQ reboot possible) So the bad commit is: commit 87cc4d1e3e05af38c7c51323a3d86fe2572ab033 Author: Chris Wright <chrisw@sous-sol.org> Date: Sat May 28 13:15:04 2011 -0500 intel-iommu: Dont cache iova above 32bit commit 1c9fc3d11b84fbd0c4f4aa7855702c2a1f098ebb upstream. Mike Travis and Mike Habeck reported an issue where iova allocation would return a range that was larger than a device's dma mask. https://lkml.org/lkml/2011/3/29/423 The dmar initialization code will reserve all PCI MMIO regions and copy those reservations into a domain specific iova tree. It is possible for one of those regions to be above the dma mask of a device. It is typical to allocate iovas with a 32bit mask (despite device's dma mask possibly being larger) and cache the result until it exhausts the lower 32bit address space. Freeing the iova range that is >= the last iova in the lower 32bit range when there is still an iova above the 32bit range will corrupt the cached iova by pointing it to a region that is above 32bit. If that region is also larger than the device's dma mask, a subsequent allocation will return an unusable iova and cause dma failure. Simply don't cache an iova that is above the 32bit caching boundary. Reported-by: Mike Travis <travis@sgi.com> Reported-by: Mike Habeck <habeck@sgi.com> Acked-by: Mike Travis <travis@sgi.com> Tested-by: Mike Habeck <habeck@sgi.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> :040000 040000 fdd2ca77df8333e2888f326c7ea26b6d7dbcc2c1 fe5353f31fc5d54a5068517e07c533c2e59d9f42 M drivers
Ok, Marcus do you want to notify upstream about the guilty commit? Either report at bugzilla.kernel.org, or/and send email at linux-kernel@vger.kernel.org, and stable@kernel.org. Be sure to include your dmesg, config, and bisect log, and CC kernel@gentoo.org. Thanks.
(In reply to comment #17) > Ok, Marcus do you want to notify upstream about the guilty commit? Either > report at bugzilla.kernel.org, or/and send email at > linux-kernel@vger.kernel.org, and stable@kernel.org. > > Be sure to include your dmesg, config, and bisect log, and CC > kernel@gentoo.org. > > Thanks. Done and totally scared now, don't want any trouble with Linus :( https://bugzilla.kernel.org/show_bug.cgi?id=39312 I also sent an email as you suggested...
Thank you guys for pin point the root cause! :)
Marcus, can you please add kernel@gentoo.org to the CC list at bugzilla.kernel.org?
We're going to follow the upstream bug, and reflect any updates/changes here.
Created attachment 280345 [details, diff] Do not use DMA address over 32bit range At second thought, I haven't heard any other bug report that suggest the high address cause other hardware(which also use high address) unstable. I suspect there might be something wrong with the JMicron Ethernet hardware. Trying not to use the Address over 32bit range see if it works. Could anyone help me testing this patch? I can not reproduce the issue here.
I can confirm this problem with my laptop, (Clevo P150HM), which uses the JMicron JMC250 PCIE GigE Controller (rev 05). Any large copies, (using KDE and Dolphin), cause the system sluggishness behavior described, as well as the copy starting off at ~70-80mb/s, but after a couple of seconds throttles down to about ~25k/s. This is using kernel 2.6.39-r2 previously and 2.6.39-r3 currently, and the jme 1.0.8 driver, which appears to have the "Do not use DMA.." patch code already incorporated in the 2.6.39-r3 tree. Strange thing is that if I turn off the wired ethernet, and just use WiFi, (Intel 6230), and try to copy the same files, the same way, the network copy performance is poor, (about 700K), but there is no hang of the overall system. That behavior just happens with the JMC250.
(In reply to comment #22) > Created attachment 280345 [details, diff] > Do not use DMA address over 32bit range > > At second thought, I haven't heard any other bug report > that suggest the high address cause other hardware(which > also use high address) unstable. I suspect there might be > something wrong with the JMicron Ethernet hardware. > > Trying not to use the Address over 32bit range see if it works. > > Could anyone help me testing this patch? > I can not reproduce the issue here. Hi Guo-Fu, thanks for the patch. I tested it against the 2.6.39-r3 gentoo kernel. I am sorry, but this does not fix the issue, but I the behavior of the system during failure is different: The machine does not freeze anymore: Input responsive all the time while doing the test. But: The network copy still stops at the same time after about 1.3G have been copied. Hitting Ctrl-C then makes the blocked process stop after about a minute or so but I can switch to another tty and login without problems all the time. The interesting part now may be the kernel messages that I am seeing then: (complete file attached after this post) reg 3 DMAR:[DMA Read] Request device [03:00.0] fault addr 0 DMAR:[fault reason 06] PTE Read access is not set DRHD: handling fault status reg 3 [... many times] ------------[ cut here ]------------ WARNING: at drivers/pci/intel-iommu.c:2761 intel_unmap_page+0x14c/0x180() Hardware name: P150HMx Driver unmaps unmatched page at PFN 0 Modules linked in: tun ip6_tables iptable_filter ip_tables x_tables coretemp nfsd ipv6 microcode snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss sha256_generic aesni_i ntel cryptd aes_x86_64 aes_generic cbc kvm_intel kvm acpi_cpufreq mperf snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device snd_hda_codec_hdmi snd_hda_codec_realtek arc4 snd_hda_i ntel ecb snd_hda_codec snd_hwdep snd_pcm snd_timer iwlagn mac80211 snd cfg80211 firewire_ohci sdhci_pci sdhci mmc_core rfkill uvcvideo videodev pcspkr video backlight snd_page_alloc in tel_agp intel_gtt i2c_i801 media firewire_core battery agpgart processor rtc_cmos jme mii ac thermal button xhci_hcd v4l2_compat_ioctl32 i2c_core rtc_core rtc_lib scsi_transport_iscsi fuse nfs nfs_acl auth_rpcgss lockd sunrpc zlib_deflate raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 dm_snapshot dm_crypt dm_mirror dm_reg ion_hash dm_log dm_mod scsi_wait_scan hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid ohci_hcd uhci_hcd usb_ storage ehci_hcd usbcore sg ata_piix ahci libahci pata_pcmcia pcmcia pcmcia_core pata_mpiix libata Pid: 0, comm: swapper Tainted: G W 2.6.39-gentoo-r3 #1 Call Trace: <IRQ> [<ffffffff8104239b>] ? warn_slowpath_common+0x7b/0xc0 [<ffffffff81042495>] ? warn_slowpath_fmt+0x45/0x50 [<ffffffff813025ac>] ? intel_unmap_page+0x14c/0x180 [<ffffffffa0156f79>] ? jme_free_rx_resources+0x69/0x1a0 [jme] [<ffffffffa015a3f3>] ? jme_link_change_tasklet+0x583/0xec0 [jme] [<ffffffff810488de>] ? tasklet_action+0x5e/0x100 [<ffffffff81048f40>] ? __do_softirq+0xa0/0x1b0 [<ffffffff8109d39f>] ? handle_irq_event_percpu+0x9f/0x1f0 [<ffffffff8149840c>] ? call_softirq+0x1c/0x30 [<ffffffff810048dd>] ? do_softirq+0x4d/0x80 [<ffffffff810492b6>] ? irq_exit+0x96/0xb0 [<ffffffff8100455c>] ? do_IRQ+0x5c/0xd0 [<ffffffff81496c13>] ? common_interrupt+0x13/0x13 <EOI> [<ffffffff81496c0e>] ? common_interrupt+0xe/0x13 [<ffffffff813d1b27>] ? poll_idle+0x17/0x70 [<ffffffff813d1b1a>] ? poll_idle+0xa/0x70 [<ffffffff813d1c2b>] ? cpuidle_idle_call+0xab/0x1f0 [<ffffffff81001216>] ? cpu_idle+0x96/0xe0 [<ffffffff816d0b64>] ? start_kernel+0x394/0x39f [<ffffffff816d040f>] ? x86_64_start_kernel+0xf4/0xfa ---[ end trace 7b8527fe8e683c20 ]--- jme 0000:03:00.0: eth0: Link is down jme 0000:03:00.0: eth0: Link is up at ANed: 1000 Mbps, Full-Duplex, MDI Allocating 1-page iova for 0000:03:00.0 failed Device 0000:03:00.0 request: 1@225583840 dir 2 --- failed Allocating 1-page iova for 0000:03:00.0 failed Device 0000:03:00.0 request: 1@225583040 dir 2 --- failed Allocating 1-page iova for 0000:03:00.0 failed Device 0000:03:00.0 request: 1@224969840 dir 2 --- failed Allocating 1-page iova for 0000:03:00.0 failed [... many times] jme: Allocating resources for TX error, Device STOPPED! jme 0000:03:00.0: eth0: Link is down jme 0000:03:00.0: eth0: Link is up at ANed: 1000 Mbps, Full-Duplex, MDI Allocating 1-page iova for 0000:03:00.0 failed Device 0000:03:00.0 request: 1@225e51040 dir 2 --- failed Allocating 1-page iova for 0000:03:00.0 failed Device 0000:03:00.0 request: 1@1bef6a840 dir 2 --- failed [... and so on]
Created attachment 280361 [details] dmesg with patch applied kernel messages that appeared during test with applied patch from comment #22
Created attachment 280391 [details] /var/log/messages with patch applied I shorten the attachment (was 40mb :)) It didn't lock the input but floated the logfile with those errors, I lost the device twice during the copy. It automatically brought the device back up and acquired an IP via dhcp. Then it wouldn't like to get an IP any more and I gave up...
(In reply to comment #24) > (In reply to comment #22) > > Created attachment 280345 [details, diff] > > Do not use DMA address over 32bit range > > > > At second thought, I haven't heard any other bug report > > that suggest the high address cause other hardware(which > > also use high address) unstable. I suspect there might be > > something wrong with the JMicron Ethernet hardware. > > > > Trying not to use the Address over 32bit range see if it works. > > > > Could anyone help me testing this patch? > > I can not reproduce the issue here. > > Hi Guo-Fu, > > thanks for the patch. I tested it against the 2.6.39-r3 gentoo kernel. > > I am sorry, but this does not fix the issue, but I the behavior of the system > during failure is different: > > The machine does not freeze anymore: Input responsive all the time while doing > the test. Right, this is because the scan is both smaller and starting from a cached point. > But: The network copy still stops at the same time after about 1.3G have been > copied. Hitting Ctrl-C then makes the blocked process stop after about a minute > or so but I can switch to another tty and login without problems all the time. > > The interesting part now may be the kernel messages that I am seeing then: > (complete file attached after this post) > > reg 3 > DMAR:[DMA Read] Request device [03:00.0] fault addr 0 This is showing that the driver failed to allocate a dma mapping in the IOMMU. The driver told the device to DMA to address 0, but there is no mapping for that address. The driver can catch this by checking the return value of pci_map_page() with pci_dma_mapping_error(). However, this is just a symptom. I believe the cause is the driver not unmapping dma descriptors correctly. Guo-Fu Tseng, can you review the unmapping path carefully? I think we're missing one descriptor per tx unmap cycle.
Created attachment 280423 [details] unmap fiirst descriptor, not just the frags This is an example of what I'm referring to.
Thank you Chris! Your information is very useful. I should check the return value, and I did missed a unmap! Michał Mirosław sent a Patch to lkml-netdev on Jul 11, also pointed the unmap issue. "[PATCH v2 10/46] net: jme: convert to generic DMA API" I'll soon format a patch and submit to lkml-netdev. Thank you all for the helping! :)
I can confirm that the Michał Mirosław patch, entitled "[PATCH v2 10/46] net: jme: convert to generic DMA API", referenced by Guo-Fu, that was sent to to lkml-netdev mailing list on 07/11/11, shown here; http://www.spinics.net/lists/netdev/msg169620.html does indeed fix this problem for my system. I can now do full rate copies with no system sluggishness. Thanks all..
(In reply to comment #30) > I can confirm that the Michał Mirosław patch, entitled "[PATCH v2 10/46] net: > jme: convert to generic DMA API", referenced by Guo-Fu, that was sent to to > lkml-netdev mailing list on 07/11/11, shown here; > > http://www.spinics.net/lists/netdev/msg169620.html > > does indeed fix this problem for my system. I can now do full rate copies with > no system sluggishness. > > Thanks all.. Thanks you for the testing. But however the Michał Mirosław's patch is _NOT_CORRECT_. I'll soon paste another one.
Created attachment 280475 [details, diff] DMA unmap fix I haven't got time to run the basic test. Kind of busy recently. But according to the report, I believe this patch should fix the issue. Could anyone kindly help me test it?
Created attachment 280477 [details, diff] DMA unmap fix Just adding compiler hint against last patch.
(In reply to comment #33) > Created attachment 280477 [details, diff] > DMA unmap fix > > Just adding compiler hint against last patch. Stupid question :) could you create the diff against gentoo-sources-r2 or 3 or will it go into r4?
(In reply to comment #33) > Created attachment 280477 [details, diff] > DMA unmap fix > > Just adding compiler hint against last patch. Hi Guo-Fu, The patch applied with one hunk for me against vanilla 2.6.39.3 The issue seems to be fixed for me with that patch. I copied the 3GB file via NFS several times now without any problem and at full speed (~88 MB/s) which did not work a single time before. Thanks! -Marc
Only submitted upstream patches from Linus' tree go into gentoo-sources. http://dev.gentoo.org/~mpagano/genpatches/faq.htm
(In reply to comment #35) > (In reply to comment #33) > > Created attachment 280477 [details, diff] > > DMA unmap fix > > > > Just adding compiler hint against last patch. > > Hi Guo-Fu, > > The patch applied with one hunk for me against vanilla 2.6.39.3 > > The issue seems to be fixed for me with that patch. > > I copied the 3GB file via NFS several times now without any problem and at full > speed (~88 MB/s) which did not work a single time before. > > Thanks! > > -Marc Thank you Marc! I'll submit the patch to upstream kernel today. :)
I've submitted to netdev. Here is that status of the patch: http://patchwork.ozlabs.org/patch/105878/
I can also confirm that Guo-Fu's patch, from; http://patchwork.ozlabs.org/patch/105878/ also applied to gentoo-sources-2.6.39-r3 without issue. It has resolved any issues that I had with both performance and system lag, when doing large copies. Thanks Guo.
(In reply to comment #39) > I can also confirm that Guo-Fu's patch, from; > > http://patchwork.ozlabs.org/patch/105878/ > > also applied to gentoo-sources-2.6.39-r3 without issue. It has resolved any > issues that I had with both performance and system lag, when doing large > copies. > > Thanks Guo. same here with clean gentoo-sources-2.6.39-r3 sources, copied ~6GB over the network via nfs, no problems. Thanks :)