Bug 301282

Summary:	xorg-server-1.7.4 + intel driver have "batch buffer" problems
Product:	Gentoo Linux	Reporter:	Robert Bradbury <robert.bradbury>
Component:	[OLD] Unspecified	Assignee:	Gentoo X packagers <x11>
Status:	RESOLVED UPSTREAM
Severity:	critical	CC:	arthapex, michael.pihlblad, throw_away_2002, vcunat
Priority:	High
Version:	unspecified
Hardware:	All
OS:	Linux
URL:	[too many to list here]
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Excerpts of various log files outlining problem

Description Robert Bradbury 2010-01-17 14:40:20 UTC

It isn't clear what the problem is here. The glaring symptom is that all X terminals become completely unresponsive (no mouse activity, no keyboard activity).

Error accumulate in the log files of the form:
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

Reproducible: Always

Steps to Reproduce:
1. Boot linux-2.6.32.
2. Start X 1.7.4 on Intel i915 hardware.
3. Work with the system for a while (1-2 days) X terminals eventually hang.

Actual Results:
This problem began on or around Jan 11 2010, around the time that I installed linux-2.6.32 and upgraded to xorg-server-1.7.4. It was never seen before then.

The obvious result is all X terminals hang and error messages start to appear in Xorg.0.log and kern.log.

Expected Results:
Kernel + Xorg-server should work as reliably as they did the last several months.

I have not tried to revert back to earlier versions. As pointed out in Bug #301274 its nearly impossible to fall back from xorg-server-1.7.4 until the portage ebuilds are fixed. I could boot a linux-2.6.31 but it was easier to simply reconfigure X to run off the Radeon video card (which supports xrandr virtual desktops across 2 screens).

Its a glaring enough problem that someone else is bound to trip over it so its simply easier to let google find the problem.

The problem is repeatable. I think I had to reboot the system 4 times over the last week until I reconfigured to use the Radeon card.

Attachment will have distilled log file information.

Comment 1 Robert Bradbury 2010-01-17 14:47:17 UTC

Created attachment 216736 [details]
Excerpts of various log files outlining problem

It looks like I first booted linux-2.6.32-gentoo on 091220, so I ran for 20+ days on 2.6.32 without encountering the problem.  It looks like the problem only came up after the upgrade to xorg-1.7.4.

Comment 2 Rémi Cardona (RETIRED) gentoo-dev

2010-01-19 00:30:21 UTC

Please try a newer git kernel RC.

Thanks

Comment 3 Robert Bradbury 2010-01-19 11:37:35 UTC

The problems take place under 2.6.32 and 2.6.32-r1.  Version 2.6.32-r2 was only released tonight. But it is unclear whether you are requesting working with that or some 2.6.33 variant (or some cutting edge from kernel.org).

I don't normally know how to use "git" (I'm a lot more comfortable with xyzzy.tar.gz; tar xzf xyzzy.tar.gz) and before I go down that road (potentially encountering more bugs), I'll spend my time backing off from 1.7.4 and freezing it there (even without the ebuilds) until Google searches indicate that others have encountered the bug and that it is fixed either in the Kernel code or the 1.7.4 code.

The endless parade of Xorg (and xf86-video-intel or gentoo-sources) upgrades aren't of much use if they are only generating more bugs for me to deal with.  Since the kernel KMS upgrades started working around 2.6.29 and the radeonhd driver got support for DRI on the R600 there isn't much point for me to pursue this until the following conditions are met:
1) DRI switching between VTs on the same hardware driver (e.g. radeonhd) -- right now DRI is tied to a single VT (usually the last Xorg (e.g. :2 or :3) started by Xserver and the first to grab the DRI interface).
2) DRI support and Xrandr support across two different video interfaces (e.g. Intel i915 on the motherboard (VGA) plus Radeon X3450 DVI + VGA to allow 3 monitor sessions all of which have DRI capabilities).

Until those are implemented there is no upside on my end (nor do I really have the time/interest in monitoring the Kernel/Xorg/Hardware-driver mailing lists/blogs to understand when these might become available).

Comment 4 Rémi Cardona (RETIRED) gentoo-dev

2010-01-19 20:42:42 UTC

(In reply to comment #3)
> The problems take place under 2.6.32 and 2.6.32-r1.  Version 2.6.32-r2 was only
> released tonight. But it is unclear whether you are requesting working with
> that or some 2.6.33 variant (or some cutting edge from kernel.org).
> 
> I don't normally know how to use "git" (I'm a lot more comfortable with
> xyzzy.tar.gz; tar xzf xyzzy.tar.gz) and before I go down that road (potentially
> encountering more bugs), I'll spend my time backing off from 1.7.4 and freezing
> it there (even without the ebuilds) until Google searches indicate that others
> have encountered the bug and that it is fixed either in the Kernel code or the
> 1.7.4 code.

emerge sys-kernel/git-sources

> The endless parade of Xorg (and xf86-video-intel or gentoo-sources) upgrades
> aren't of much use if they are only generating more bugs for me to deal with. 
> Since the kernel KMS upgrades started working around 2.6.29 and the radeonhd
> driver got support for DRI on the R600 there isn't much point for me to pursue
> this until the following conditions are met:
> 1) DRI switching between VTs on the same hardware driver (e.g. radeonhd) --
> right now DRI is tied to a single VT (usually the last Xorg (e.g. :2 or :3)
> started by Xserver and the first to grab the DRI interface).
> 2) DRI support and Xrandr support across two different video interfaces (e.g.
> Intel i915 on the motherboard (VGA) plus Radeon X3450 DVI + VGA to allow 3
> monitor sessions all of which have DRI capabilities).
> 
> Until those are implemented there is no upside on my end (nor do I really have
> the time/interest in monitoring the Kernel/Xorg/Hardware-driver mailing
> lists/blogs to understand when these might become available).

If you don't want to help debug, that's your decision. Your bug may or may not get fixed, there's just no way we can tell.

If you indeed want to see this fixed, you'll have to file a bug at FreeDesktop and upstream Intel devs _will_ ask you to test their latest code. That's just how things work because of the development speed. I'm not going to tell upstream to slow down...

So do as you please.

Thanks

Comment 5 Robert Bradbury 2010-01-20 14:14:08 UTC

I've investigated the kernel aspect of this a bit further and it seems to involve several of the drm/i915 source files (i915_dma.c, i915_gem.c, i915_irq.c) in the kernel. Interestingly the first error in kern.log is
drm:i915_hangcheck_elapsed...Hangcheck time elapsed... GPU hung.

Looking back though my kernel sources, it appears that timers on the GPU, metered b DRM_I915_HANGCHECK_PERIOD, appeared in the 2.6.32 release (the files date back to Dec 2, 2009). The hangcheck period is 75 jiffies, which isn't clear. On systems where a jiffy is 10ms, that would work out to 750 ms. But I've got my system CONFIG_HZ set to 300 (I think X86's normally use 100 Hz) so on my system 75*3.333 is only going to work out to ~250 ms. (At least if Jiffies scale with HZ (that part I'm not clear about yet in my brief review of the documentation).

Now, one thing which did attract my attention was the possibility of what else I was doing on the machine when the errors took place and had to be rebooted.
I don't have a precise recollection of these but the leading possibilities are:
1) Switching VTs.
2) Coming back to the computer after the screen had gone to black (power saving, either by gnome-screensaver or the hardware defaults)
3) System builds (large sequences of emerges).
4) Restarting a large chromium session (takes 15-20 minutes of maxed CPU combined with 10-15 minutes of maxed network I/O -- though its a 10 Mbit Ethernet connected to a 1.5Mbit DSL line, so it shouldn't be that bad).

So 1/2 may be an interaction in the driver with other code while 3/4 would suggest that high machine loads and whatever the DRM_I915_HANGCHECK_PERIOD turns out to be on my machine don't play well together.

But my next step would be to drop back to 2.6.31 which doesn't have the current HANGCHECK code in it and see if I have any problems in that environment. I probably won't do that until I see some indication that Xorg/Xrandr supports multiple screens across 2 hardware drivers or there is some significant improvement in the kernel i915/radeon driver that bears investigation. As the radeon driver works with xrandr across 2 screens it makes a bit more sense to work with that for now.

You may want to watch for people having problems with the HANGCHECK code in 2.6.32+ systems.

Comment 6 Rémi Cardona (RETIRED) gentoo-dev

2010-01-20 22:04:10 UTC

Again, this bug might already be fixed in 2.6.33...

Thanks

Comment 7 Robert Bradbury 2010-01-23 12:33:18 UTC

Remi, I have downloaded the "git-sources" (nice ebuild, that one), and as it turns out there are differences between some of the critical files, e.g. i915_irq.c, and i915_gem.c so it may be the case that the batch buffer / HANGCHECK_TIMER problems may have been fixed in 2.6.33. But I do not consider 2.6.33 to be a "lightweight" upgrade as a link check (find -links 1 after running my own cmp and link if equal script) suggests that 2.6.33 has 8000+ linux source file differences between 2.6.32 and 2.6.33 -- that is *way* beyond a typical Linux upgrade where one expects a few dozen to a few hundred changes (normally almost all Linux source files can be linked to identical previous versions).

Now I grew curious about this and it looks like some of this may be due to simple function name changes, e.g. spin_lock_irqsave(...) -> raw_spin_lock_irqsave(...). But I am not about to go through 8000 diff listings to see what are substantive and what are unsubstantive changes.

I would only suggest that testing the Intel drivers under a Firefox or Chrome complex session restore (which is going to push X very hard, which in turn will push the Intel driver very hard) is the best way to test the OS drivers to see if the hardware has its "detect a real hang from the CPU / GPU is simply busy" parameters set correctly. Normal desktop window operations are not going to really "stress" the window manager or Linux hardware driver. They do not push the software/hardware enough. A complex browser session restore on the other hand (until the browser developers wise up enough to do a restore "lightly" -- about which I've filed several bug reports at Mozilla/Google) is the best way to really push on an X system and its associated drivers. A complex session restore (hundreds of tabs) will max the CPU for 10-20 minutes and the network (5-15 minutes) -- most of which time it is pushing on Xorg and indirectly pushing on the graphics driver.

Comment 8 Rémi Cardona (RETIRED) gentoo-dev

2010-01-23 15:52:19 UTC

No, it's definitely not a lightweight upgrade, but since upstream is moving really fast and working hard to push things into the Linux kernel, that's where fixes will be.

If you want to debug this further, you might even have to try the "drm-intel-next" kernel directly from Intel developers.

So again, there's nothing I can do to help here. I'm just a packager. I triage bugs and suggest new versions/options where the bugs may be gone. _Nothing_ more.

So if you want to get results, file a bug _upstream_ like I've told you many times already.

Thanks

Comment 9 Robert Bradbury 2010-01-25 00:44:54 UTC

Remi, understood.  Since ...33 isn't supposed to go public until March (at least according to public discussions I have read)   Given that I've already determined that the source of the problems may be due to questionable "enhancements" in the i915 driver as of ...32 and my radeon driver works reasonably well I'll go with refraining from exploring the cutting edge for now.

Comment 10 Rémi Cardona (RETIRED) gentoo-dev

2010-01-26 22:36:44 UTC

Just for the record, I know testing upstream git kernel sounds scary but in my experience, once -rc1 is done, things go a *lot* better. And after -rc2, I don't remember having any bugs compared to the final release.

Of course, this is just me on one laptop, but really, testing a new kernel is much much safer than it looks like.

Anyhow, feel free to ping me back once you test .33 (when it's out or before, your pick :) ).

Thanks

Comment 11 Michael Pihlblad 2010-03-18 00:38:23 UTC

i don't know if this is helping very much at this time but anyways:

Problem persists in xf86-video-intel-2.10.0-1.

Problem solved with updating to latest intel driver version from git://anongit.freedesktop.org/git/xorg/driver/xf86-video-intel  2010-03-18.

Driver homepage: http://intellinuxgraphics.org

Good luck!

regards Michael

Comment 12 Vladimír Čunát 2010-04-05 21:03:01 UTC

I've got kernel 2.6.33, xserver 1.7.6 and intel 2.11.0 driver and I sometimes get that batch-buffer error message as well.
I'll be glad to help to resolve this, because I'm getting too many X troubles lately, mostly rendering errors and blank screens (usually when scrolling a lot) and some freezes, too.

Comment 13 Rémi Cardona (RETIRED) gentoo-dev

2010-04-06 18:02:42 UTC

There are a bunch of buffer errors with some of the older chips (855 through 945). There are various bugs upstream to try to tackle this, but I fear this is going to be hard to fix. After discussing my own issues (855GM chip), it looks like upstream is using those chips far more than the Windows driver ever did (it doesn't have kernel memory management among other things) and Intel is slow to release the actual programming guide.

So please file bugs upstream, let Intel devs you have hardware you care about, that's pretty much all we can do here.

Thanks

Comment 14 Rémi Cardona (RETIRED) gentoo-dev

2010-04-06 18:03:40 UTC

Closing with the proper resolution. If you do open upstream bugs, don't hesitate to paste URLs here so I can track them.

Thanks

Comment 15 Carsten Lohrke (RETIRED) gentoo-dev

2010-04-09 22:45:02 UTC

*** Bug 314159 has been marked as a duplicate of this bug. ***

Comment 16 dan blum 2010-06-14 00:12:06 UTC

I got to this bug after I got a batchbuffer error. The driver issue current is 2.11. Is this driver good? If so, how do you install it. What does it mean, reolved upstream? 

Does Gentoo have a suggestion on what to do?
(In reply to comment #13)
> There are a bunch of buffer errors with some of the older chips (855 through
> 945). There are various bugs upstream to try to tackle this, but I fear this is
> going to be hard to fix. After discussing my own issues (855GM chip), it looks
> like upstream is using those chips far more than the Windows driver ever did
> (it doesn't have kernel memory management among other things) and Intel is slow
> to release the actual programming guide.
> 
> So please file bugs upstream, let Intel devs you have hardware you care about,
> that's pretty much all we can do here.
> 
> Thanks
>