451660 – =sys-kernel/gentoo-sources-3.7.2 - i915: render error detected, EIR: 0x00000010

Bug 451660 - =sys-kernel/gentoo-sources-3.7.2 - i915: render error detected, EIR: 0x00000010

Summary: =sys-kernel/gentoo-sources-3.7.2 - i915: render error detected, EIR: 0x00000010

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:	linux-3.7.x-regression
Keywords:

Depends on:
Blocks:

Reported:	2013-01-13 11:15 UTC by johan janez
Modified:	2013-07-04 13:29 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
In the attached dmesg, kms-error.txt it seems to me it starts at approx line 700. (kms-error.txt,49.06 KB, text/plain) 2013-01-13 11:15 UTC, johan janez	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description johan janez 2013-01-13 11:15:20 UTC

Created attachment 335442 [details]
In the attached dmesg, kms-error.txt it seems to me it starts at approx line 700.

With kernels >=sys-kernel/gentoo-sources-3.7.1 , with i915 and kms enabled in the config, a second after boot the screen goes blank with what seems to be full backlight, removing kms solves the issue, but than the resolution is off, as does setting i915 as module and than blacklisting it on boot, but on running X or modprobing i915 it than renders the same issue with blank screen and full backlight. This happens on a macbook with a c2d cpu and the intel integrated gm45. This issue is not present on =<sys-kernel/gentoo-sources-3.6.6

Comment 1 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-01-13 11:22:18 UTC

As seen on chat, the most efficient way of finding this would be through a git bisect. But before we do that you might want to try 3.6.11 to see whether that fixes it, it would make the range for the git bisect smaller as well as let us know more exactly between which two release versions the issue lies.

From there on you can do a git bisect between a "working" and a "broken" version:

http://wiki.gentoo.org/wiki/Kernel_git-bisect

Comment 2 Andreas Sturmlechner gentoo-dev

2013-02-24 15:57:02 UTC

Do you see that on every boot?

A very similar error, albeit without kernel crash, is happening now and then on my system. Very hard to bisect because it occurs randomly from 12 sec to 12 hours uptime, no idea how to trigger it myself yet.

https://bugs.freedesktop.org/show_bug.cgi?id=53385

Comment 3 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-04-16 16:46:26 UTC

That bug is irrelevant since it does not result in a null reference as far as I can see. Let's look at the end of the stack trace in this bug (RIP):

> void drm_mode_copy(struct drm_display_mode *dst, const struct drm_display_mode *src)
> {
>     int id = dst->base.id;
>
>     *dst = *src;
>     dst->base.id = id;
>     INIT_LIST_HEAD(&dst->head);
> }

The only way we can get a NULL pointer "dereference" (->) is when in the above code "dst" is NULL. This can also be when "src" is NULL, so we need to check both.

> struct drm_display_mode *drm_mode_duplicate(struct drm_device *dev,
                                       const struct drm_display_mode *mode)
> {
>    struct drm_display_mode *nmode;
> 
>    nmode = drm_mode_create(dev);
>    if (!nmode)
>            return NULL;
> 
>    drm_mode_copy(nmode, mode);
> 
>    return nmode;
> }

So, we see "nmode" being passed into drm_mode_copy here. However, there is an "if (!nmode)" which ensures that "nmode" is not NULL. So, we need to look at "mode" which is not checked at all, it appears one of the parent functions is passed a NULL "mode".

There seems to be some inline magic going on here, drm_mode_duplicate (which simply passes on "mode") is called by intel_modeset_adjusted_mode (which simply passes on "mode") is called by intel_set_mode.

Interesting, in "intel_set_mode" we see this:

> /* Hack: Because we don't (yet) support global modeset on multiple
>  * crtcs, we don't keep track of the new mode for more than one crtc.
>  * Hence simply check whether any bit is set in modeset_pipes in all the
>  * pieces of code that are not yet converted to deal with mutliple crtcs
>  * changing their mode at the same time. */

The code where we are crashing in is considered a hack, there is probably something not converted or so.

> /* Compute whether we need a full modeset, only an fb base update or no
>  * change at all. In the future we might also check whether only the
>  * mode changed, e.g. for LVDS where we only change the panel fitter in
>  * such cases. */
> intel_set_config_compute_mode_changes(set, config);
> 
> ret = intel_modeset_stage_output_state(dev, set, config);
> if (ret)
>         goto fail;
> 
> if (config->mode_changed) {
>         if (set->mode) {
>                 DRM_DEBUG_KMS("attempting to set mode from"
>                                 " userspace\n");
>                 drm_mode_debug_printmodeline(set->mode);
>         }
> 
>         if (!intel_set_mode(set->crtc, set->mode,
>                             set->x, set->y, set->fb)) {
>                 DRM_ERROR("failed to set mode on [CRTC:%d]\n",
>                           set->crtc->base.id);
>                 ret = -EINVAL;
>                 goto fail;
>         }
> } else if (config->fb_changed) {
>         ret = intel_pipe_set_base(set->crtc,
>                                   set->x, set->y, set->fb);
> }

We see "intel_set_mode" in above code (it's called through some kind of cpumask function we can likely ignore), if we look at the first thing that is called with set above that we see "intel_set_config_compute_mode_changes" and "intel_modeset_stage_output_state". There's also a comment and from the stack trace we can conclude that a "fb base update" is probably happening since "fb_restore" is in there.

In "intel_set_config_compute_mode_changes" I don't see anything setting something in "set"; so, let's look in "intel_modeset_stage_output_state", where I don't see anything that sets it either, so it probably comes from higher in the stack trace.

> bool drm_fb_helper_restore_fbdev_mode(struct drm_fb_helper *fb_helper)
> {
>         bool error = false;
>         int i, ret;
>         for (i = 0; i < fb_helper->crtc_count; i++) {
>                 struct drm_mode_set *mode_set = &fb_helper->crtc_info[i].mode_set;
>                 ret = mode_set->crtc->funcs->set_config(mode_set);
>                 if (ret)
>                         error = true;
>         }
>         return error;
> }

So, mode_set is just passed on here again, one more up?

> void intel_fb_restore_mode(struct drm_device *dev)
> {
>         int ret;
>         drm_i915_private_t *dev_priv = dev->dev_private;
>         struct drm_mode_config *config = &dev->mode_config;
>         struct drm_plane *plane;
> 
>         mutex_lock(&dev->mode_config.mutex);
> 
>         ret = drm_fb_helper_restore_fbdev_mode(&dev_priv->fbdev->helper);
>
>         // ...
> }

We're following the helper now, from the device, again just passed on.

> /**
>  * Take down the DRM device.
>  * 
>  * \param dev DRM device structure.
>  * 
>  * Frees every resource in \p dev.
>  *
>  * \sa drm_device
>  */
> int drm_lastclose(struct drm_device * dev)

The comment here says enough, it tries to take down the device for one or another reason. But why? We can't tell, the rest of the stack trace gives no hints regarding that:

> [   12.337009]  [<ffffffff812c8a46>] ? drm_release+0x4fa/0x52b
> [   12.337009]  [<ffffffff810b51b2>] ? __fput+0xe5/0x1a9
> [   12.337009]  [<ffffffff8103bf51>] ? task_work_run+0x76/0x8d
> [   12.337009]  [<ffffffff8156015a>] ? int_signal+0x12/0x17

The release is called as part of a task which is ran upon an interrupt signal, it appears this task probably comes from something like a thread pool (a list of tasks to be run). Therefore we can't see from the dmesg what has put it there.

So, long story short, we know what's happening but don't know why. The only two interesting things we came across is the "hack comment" and the comment below that stating "in the future". For the rest it's just passing on code through the stack trace (which we enumerated from top to bottom).

There's still one more interesting thing in the dmesg, that is this:

> [   12.337009] Process X (pid: 2459, threadinfo ffff880079a4e000, task ffff88007adc4e00)

X may be the process that is releasing the device, it may be unloading the module for one or another reason; to know why, we need more information:

1) Could you please attach your /var/log/Xorg.0.log? :)
2) Could you try =sys-kernel/gentoo-sources-3.8.7 to see if it still present?

Comment 4 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-06-08 12:18:16 UTC

> 1) Could you please attach your /var/log/Xorg.0.log? :)
> 2) Could you try =sys-kernel/gentoo-sources-3.8.7 to see if it still present?

Please attach /var/log/Xorg.0.log and try =sys-kernel/gentoo-sources-3.8.13.

Comment 5 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-07-04 13:29:34 UTC

(In reply to Tom Wijsman (TomWij) from comment #4) 
> Please attach /var/log/Xorg.0.log and try =sys-kernel/gentoo-sources-3.8.13.