Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 225863 - gentoo-sources-2.6.25-r4 fails on Phenom when doing intensive IO through eth0 with IOMMU error and marks every accessed device as readonly...
Summary: gentoo-sources-2.6.25-r4 fails on Phenom when doing intensive IO through eth0...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: High normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-11 10:02 UTC by brankob
Modified: 2008-12-03 07:45 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (eminfo.txt,4.68 KB, text/plain)
2008-06-11 10:03 UTC, brankob
Details
kernel config (Brane25.cfg,64.06 KB, text/plain)
2008-06-11 11:46 UTC, brankob
Details
dmesg output (dmesg_out.txt,56.11 KB, text/plain)
2008-06-11 14:03 UTC, brankob
Details
lspci (lspci_out.txt,1.79 KB, text/plain)
2008-06-11 14:04 UTC, brankob
Details

Note You need to log in before you can comment on or make changes to this bug.
Description brankob 2008-06-11 10:02:30 UTC
After exchanging the MoBo and CPU to upgrade from X2 6000+ to Phenom 9850, I get error on Phenom when copying large files from internet (ftp, http) to local disk.

After some time while copying, copying stops, I can't write anything to any local disk, and network is inacesible. That is, I can ifconfig eth0, but just can't sent anything over it.

dmesg reveals error:

"PCI-DMA: Out of IOMMU space for 7222 bytes at device 0000:03:06"

This line is listed many times, with one or two variation in device number.

It seems that IOMMU address space at some point runs out and any device that needs it for some IO triggers an error and gets labeled as readonly.

Since it happens when copying big files from net to drive, it leavs eth0 and /dev/sda as main suspects.

I have successfully tried dd if=/dev/zero of=/some_file_on_disk bs=1048576 count=4096. It works fine.

That leaves the eth0 and/or IOMMU infratructure. This error happens with onboard Realtek 8111C as well as with R8169 in PC slot. Both use the same driver r8169.

I have also tried various kernel parameters, like iommu=memaper=X,allowdac,merge with X being 4,5,6 and with or without allowdac and merge. Also tried tu use size 134 217728 (=128M) as the first parameter with no positive change.

I have also noticed that agp aperture eats away my RAM ( I have 4Gb). Dooes it have to be this way ? Can't I move it to somewhere above 4GB ( and the rest of the IOMMU stuff too ) ?

My machine is:
CPU: Phenom 9850
BOard: MSI K9A2GM-FIH ( http://global.msi.com.tw/index.php?func=proddesc&prod_no=1436&maincat_no=1&cat2_no=171    
Graphic card: 8800GT with 1GB RAM
RAM: 4x1GB
disks: main on first SATA, DVD and Windows ATA disk on first/only IDE port.
Onboard ethernet has died on me ( RTL 8111C), so I have RTL-8169 card in PCI port. Both use same driver (r8169)
BIOS: Latest-v1.2
Onboard LAN: Disabled in BIOS
MAchine is not overclocked







Reproducible: Always

Steps to Reproduce:
1. get a QC Phenom, get a board for it with Realtek LAN chip ( if that is the cause)
2. install Gentoo on it
3. copy large file through the network

Actual Results:  
dmesg: 

...
"PCI-DMA: Out of IOMMU space for 7222 bytes at device 0000:03:06"

Expected Results:  
successfull copy

will attache results of emerge --info
Comment 1 brankob 2008-06-11 10:03:25 UTC
Created attachment 156331 [details]
emerge --info
Comment 2 brankob 2008-06-11 11:44:46 UTC
Since it is about kernel, I have attached its config file...

Comment 3 brankob 2008-06-11 11:46:21 UTC
Created attachment 156337 [details]
kernel config
Comment 4 Duane Griffin 2008-06-11 13:25:29 UTC
Could you attach dmesg and /usr/sbin/lspci output, please.
Comment 5 brankob 2008-06-11 14:03:42 UTC
Created attachment 156361 [details]
dmesg output
Comment 6 brankob 2008-06-11 14:04:00 UTC
Created attachment 156363 [details]
lspci
Comment 7 brankob 2008-06-11 14:54:56 UTC
I have just replaced r8169 with R8139 card ( 100 Mbit Realtek ).

So far it is working fine, which points to driver as a probable cause.

Data transfer rate of test transfer is about the same as before, since bottleneck is my DSL, which limits transfer to some 900 KB/s in both cases.

One notable difference is that old 8139 card has mtu 1500 while newer 8169 had mtu 7200. Server/gateway, to/through which this machine is connected has mtu 9000 on all internal interfaces. 

I'll try later 8169 with mtu 1500...



Comment 8 Duane Griffin 2008-06-11 15:57:16 UTC
The error is being triggered by the ethernet driver (at least in the dmesg supplied) but that doesn't necessarily mean it is to blame, as such. You may be just running out of IOMMU space because of the amount of in-flight IO. If that is the case changing the mtu may well make a difference. You could also try increasing the iommu size again.

From the looks of things your BIOS has some problems, the AGP aperture not being set up properly and hence eating into your memory being one of them. Does it have any options regarding the IOMMU? What about memory hole mapping? You may want to try playing with them, if so.
Comment 9 brankob 2008-06-11 16:11:54 UTC
Realtek 8169 seems to work fine with MTU 1500. Largest MTU I have successfull tried is 3600. 4070 is smallest non-working found, so border is somewhere in between.



(In reply to comment #8)
> just running out of IOMMU space because of the amount of in-flight IO. If that
> is the case changing the mtu may well make a difference. You could also try
> increasing the iommu size again.

I have tried using 128M for IOMMU mappings as well as up to 1G for AGP aperture and it did not make any difference.

If it was due to indaquate IOMMU space, shouldn't enlarging it help things ?

This is a second Phenom capable board I tried and also the second that manifests this behaviour. Also, on previous board error has dissapeared when I replaced just the CPU and plugged in X2 6000+ instead of new Phenom - I did not try this on new one though, but I can f needed...

Comment 10 brankob 2008-06-11 16:15:41 UTC
(In reply to comment #8)
>
> it have any options regarding the IOMMU? What about memory hole mapping? You
> may want to try playing with them, if so.
> 

Nothing that i can find.
Neither last Giabyte board nor this MSI...
Comment 11 Duane Griffin 2008-06-11 17:07:29 UTC
> If it was due to indaquate IOMMU space, shouldn't enlarging it help things ?

Yes, it should.

It might be worth enabling IOMMU_DEBUG in your config and booting with iommu=leak. That might help determine what is leaking DMA memory, if anything.

> This is a second Phenom capable board I tried and also the second that
> manifests this behaviour. Also, on previous board error has dissapeared when I
> replaced just the CPU and plugged in X2 6000+ instead of new Phenom - I did not
> try this on new one though, but I can f needed...

Bizarre. Well, you have a work-around now with the MTU option, anyway.

The next step after trying leak tracing is probably posting on lkml. I've found a similar bug report on the kernel bugzilla which also implicates r8169. The reporter has different hardware otherwise, though:
http://bugzilla.kernel.org/show_bug.cgi?id=9468

If you want to report it to lkml you may want to include a link to that report, and perhaps CC the r8169 maintainer.
Comment 12 brankob 2008-06-11 17:20:40 UTC
(In reply to comment #11)
> > If it was due to indaquate IOMMU space, shouldn't enlarging it help things ?
> 
> Yes, it should.

But it doesn't. 

I'll try with IOMMU_DEBUG...
Comment 13 Mike Pagano gentoo-dev 2008-07-14 13:57:17 UTC
Anything to report from your testing?
Comment 14 brankob 2008-07-14 14:11:19 UTC
(In reply to comment #13)
> Anything to report from your testing?
> 

Nope. Whatever I did, when I trigerred the error, it was too late for any kind of diagnostic and I could just reboot as system was refusing to write anywhere.

SO I settled for a pragmatic aprroach and simply lowered MTU.

But since kernel has made many changes since then, maybe it's time to try it with newest vanilla-2.6.26...

Wait a minute...


Comment 15 brankob 2008-07-14 14:40:55 UTC
(In reply to comment #14)

It behaves same way as before.  COpying a bunch of photos ( some 6GB in total) over network on MTU=7200 forze machine at some 10-20% of total, while with MTU=2048 machine works without a problem...

Mashine is the same as at the time of bug filing, but gentoo was updated so gcc is nov 4.3.1 and kernel is vanilla-sources-2.6.26. And I updated BIOS of the board to  latest ATM - v-1.4 ...
Comment 16 Daniel Drake (RETIRED) gentoo-dev 2008-12-02 18:00:11 UTC
Are you able to set up a serial console? This would be one way of logging the information when the system fails.

Alternatively, you could perhaps take a photo of the screen or write down the important parts on paper, assuming the output is small enough. (I have no idea what the output looks like, so this might be a silly suggestion)


Either way, I think the next steps are to reproduce this on the latest development kernel (currently v2.6.28-rc7) and then file a bug upstream against r8169. Please can you test that development kernel?
Comment 17 brankob 2008-12-03 07:45:32 UTC
(In reply to comment #16)
> Are you able to set up a serial console? This would be one way of logging the
> information when the system fails.

Not really, at least not without considerable hassle, which I can't afford right now. Also, I don0t think it would make any difference. When system fails, _everything_ becomes just read-only and so basically unuseable.
I don't think that having console-on serial port would make substatnial difference...



besides that, I have swapped "old" MSI K9A2GM-FIH board for newer Foxconn (790GX+SB750) A7DA-S board, which has Broadcom BCM5784 chip instead of Realtek 8111.

Nevertheless, I have Realtek R8169 NIC in PCI slot, so just to test this out on my newest vanilla-2.6.27.7 kernel, I have set it up as my main NIC, set up MTU = 7200 and copied some 6GB worth of files through it without a lockup.

I guess I should mark bug as "fixed" now, although I'm not 100% convinced.
Data transfer is awfully slow ( some ~2MB/S), but I get it this might be due to something being misconfigured in samba on server or on client side...

> 
> Alternatively, you could perhaps take a photo of the screen or write down the
> important parts on paper, assuming the output is small enough. (I have no idea
> what the output looks like, so this might be a silly suggestion)
>

I did, but output was not much to look at. After intensive googling around I have came to nVidia + BIOS bug bug being possible cause through IOMMU misconfiguration.

There was some option in nVidia driver to reserve some IOMMU area as unused that was supposed to help, but did not much of a difference for me.

So, now thing seem to be working. What was exact cause and when precisely did it start, I don't know. I suspect kernel, nVidia driver, BIOSin roughly that order of probability...