After exchanging the MoBo and CPU to upgrade from X2 6000+ to Phenom 9850, I get error on Phenom when copying large files from internet (ftp, http) to local disk. After some time while copying, copying stops, I can't write anything to any local disk, and network is inacesible. That is, I can ifconfig eth0, but just can't sent anything over it. dmesg reveals error: "PCI-DMA: Out of IOMMU space for 7222 bytes at device 0000:03:06" This line is listed many times, with one or two variation in device number. It seems that IOMMU address space at some point runs out and any device that needs it for some IO triggers an error and gets labeled as readonly. Since it happens when copying big files from net to drive, it leavs eth0 and /dev/sda as main suspects. I have successfully tried dd if=/dev/zero of=/some_file_on_disk bs=1048576 count=4096. It works fine. That leaves the eth0 and/or IOMMU infratructure. This error happens with onboard Realtek 8111C as well as with R8169 in PC slot. Both use the same driver r8169. I have also tried various kernel parameters, like iommu=memaper=X,allowdac,merge with X being 4,5,6 and with or without allowdac and merge. Also tried tu use size 134 217728 (=128M) as the first parameter with no positive change. I have also noticed that agp aperture eats away my RAM ( I have 4Gb). Dooes it have to be this way ? Can't I move it to somewhere above 4GB ( and the rest of the IOMMU stuff too ) ? My machine is: CPU: Phenom 9850 BOard: MSI K9A2GM-FIH ( http://global.msi.com.tw/index.php?func=proddesc&prod_no=1436&maincat_no=1&cat2_no=171 Graphic card: 8800GT with 1GB RAM RAM: 4x1GB disks: main on first SATA, DVD and Windows ATA disk on first/only IDE port. Onboard ethernet has died on me ( RTL 8111C), so I have RTL-8169 card in PCI port. Both use same driver (r8169) BIOS: Latest-v1.2 Onboard LAN: Disabled in BIOS MAchine is not overclocked Reproducible: Always Steps to Reproduce: 1. get a QC Phenom, get a board for it with Realtek LAN chip ( if that is the cause) 2. install Gentoo on it 3. copy large file through the network Actual Results: dmesg: ... "PCI-DMA: Out of IOMMU space for 7222 bytes at device 0000:03:06" Expected Results: successfull copy will attache results of emerge --info
Created attachment 156331 [details] emerge --info
Since it is about kernel, I have attached its config file...
Created attachment 156337 [details] kernel config
Could you attach dmesg and /usr/sbin/lspci output, please.
Created attachment 156361 [details] dmesg output
Created attachment 156363 [details] lspci
I have just replaced r8169 with R8139 card ( 100 Mbit Realtek ). So far it is working fine, which points to driver as a probable cause. Data transfer rate of test transfer is about the same as before, since bottleneck is my DSL, which limits transfer to some 900 KB/s in both cases. One notable difference is that old 8139 card has mtu 1500 while newer 8169 had mtu 7200. Server/gateway, to/through which this machine is connected has mtu 9000 on all internal interfaces. I'll try later 8169 with mtu 1500...
The error is being triggered by the ethernet driver (at least in the dmesg supplied) but that doesn't necessarily mean it is to blame, as such. You may be just running out of IOMMU space because of the amount of in-flight IO. If that is the case changing the mtu may well make a difference. You could also try increasing the iommu size again. From the looks of things your BIOS has some problems, the AGP aperture not being set up properly and hence eating into your memory being one of them. Does it have any options regarding the IOMMU? What about memory hole mapping? You may want to try playing with them, if so.
Realtek 8169 seems to work fine with MTU 1500. Largest MTU I have successfull tried is 3600. 4070 is smallest non-working found, so border is somewhere in between. (In reply to comment #8) > just running out of IOMMU space because of the amount of in-flight IO. If that > is the case changing the mtu may well make a difference. You could also try > increasing the iommu size again. I have tried using 128M for IOMMU mappings as well as up to 1G for AGP aperture and it did not make any difference. If it was due to indaquate IOMMU space, shouldn't enlarging it help things ? This is a second Phenom capable board I tried and also the second that manifests this behaviour. Also, on previous board error has dissapeared when I replaced just the CPU and plugged in X2 6000+ instead of new Phenom - I did not try this on new one though, but I can f needed...
(In reply to comment #8) > > it have any options regarding the IOMMU? What about memory hole mapping? You > may want to try playing with them, if so. > Nothing that i can find. Neither last Giabyte board nor this MSI...
> If it was due to indaquate IOMMU space, shouldn't enlarging it help things ? Yes, it should. It might be worth enabling IOMMU_DEBUG in your config and booting with iommu=leak. That might help determine what is leaking DMA memory, if anything. > This is a second Phenom capable board I tried and also the second that > manifests this behaviour. Also, on previous board error has dissapeared when I > replaced just the CPU and plugged in X2 6000+ instead of new Phenom - I did not > try this on new one though, but I can f needed... Bizarre. Well, you have a work-around now with the MTU option, anyway. The next step after trying leak tracing is probably posting on lkml. I've found a similar bug report on the kernel bugzilla which also implicates r8169. The reporter has different hardware otherwise, though: http://bugzilla.kernel.org/show_bug.cgi?id=9468 If you want to report it to lkml you may want to include a link to that report, and perhaps CC the r8169 maintainer.
(In reply to comment #11) > > If it was due to indaquate IOMMU space, shouldn't enlarging it help things ? > > Yes, it should. But it doesn't. I'll try with IOMMU_DEBUG...
Anything to report from your testing?
(In reply to comment #13) > Anything to report from your testing? > Nope. Whatever I did, when I trigerred the error, it was too late for any kind of diagnostic and I could just reboot as system was refusing to write anywhere. SO I settled for a pragmatic aprroach and simply lowered MTU. But since kernel has made many changes since then, maybe it's time to try it with newest vanilla-2.6.26... Wait a minute...
(In reply to comment #14) It behaves same way as before. COpying a bunch of photos ( some 6GB in total) over network on MTU=7200 forze machine at some 10-20% of total, while with MTU=2048 machine works without a problem... Mashine is the same as at the time of bug filing, but gentoo was updated so gcc is nov 4.3.1 and kernel is vanilla-sources-2.6.26. And I updated BIOS of the board to latest ATM - v-1.4 ...
Are you able to set up a serial console? This would be one way of logging the information when the system fails. Alternatively, you could perhaps take a photo of the screen or write down the important parts on paper, assuming the output is small enough. (I have no idea what the output looks like, so this might be a silly suggestion) Either way, I think the next steps are to reproduce this on the latest development kernel (currently v2.6.28-rc7) and then file a bug upstream against r8169. Please can you test that development kernel?
(In reply to comment #16) > Are you able to set up a serial console? This would be one way of logging the > information when the system fails. Not really, at least not without considerable hassle, which I can't afford right now. Also, I don0t think it would make any difference. When system fails, _everything_ becomes just read-only and so basically unuseable. I don't think that having console-on serial port would make substatnial difference... besides that, I have swapped "old" MSI K9A2GM-FIH board for newer Foxconn (790GX+SB750) A7DA-S board, which has Broadcom BCM5784 chip instead of Realtek 8111. Nevertheless, I have Realtek R8169 NIC in PCI slot, so just to test this out on my newest vanilla-2.6.27.7 kernel, I have set it up as my main NIC, set up MTU = 7200 and copied some 6GB worth of files through it without a lockup. I guess I should mark bug as "fixed" now, although I'm not 100% convinced. Data transfer is awfully slow ( some ~2MB/S), but I get it this might be due to something being misconfigured in samba on server or on client side... > > Alternatively, you could perhaps take a photo of the screen or write down the > important parts on paper, assuming the output is small enough. (I have no idea > what the output looks like, so this might be a silly suggestion) > I did, but output was not much to look at. After intensive googling around I have came to nVidia + BIOS bug bug being possible cause through IOMMU misconfiguration. There was some option in nVidia driver to reserve some IOMMU area as unused that was supposed to help, but did not much of a difference for me. So, now thing seem to be working. What was exact cause and when precisely did it start, I don't know. I suspect kernel, nVidia driver, BIOSin roughly that order of probability...