Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 577554 - sys-kernel/gentoo-sources drm:i915_hangcheck_elapsed
Summary: sys-kernel/gentoo-sources drm:i915_hangcheck_elapsed
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Normal critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-16 15:25 UTC by Michael Weiss (primeos)
Modified: 2016-03-23 15:04 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
/sys/class/drm/card0/error (sys.class.drm.card0.error,539.80 KB, text/plain)
2016-03-16 15:28 UTC, Michael Weiss (primeos)
Details
My custom configuration that caused the system to freeze. (config-4.4.5-freezing,104.08 KB, text/plain)
2016-03-16 15:39 UTC, Michael Weiss (primeos)
Details
My current configuration that somehow prevents the system from freezing. (config-4.4.5-working,104.39 KB, text/plain)
2016-03-16 15:40 UTC, Michael Weiss (primeos)
Details
Diff between the freezing and non-freezing configuration. (config-diff,1.53 KB, text/plain)
2016-03-16 15:47 UTC, Michael Weiss (primeos)
Details
Output from: lspci -vvv -s 2 (lspci.txt,1.05 KB, text/plain)
2016-03-20 17:26 UTC, Michael Weiss (primeos)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Weiss (primeos) 2016-03-16 15:25:16 UTC
I first run into this bug after =sys-kernel/gentoo-sources-4.0.9 (not completely sure tho). With my configuration and genkernels default configuration my whole system freezes suddenly after a certain uptime (randomly) initially it took over 10 days (4.0.9) but currently it happens after <4-5h (4.4.5) - not sure if it's really related to the kernel version. With a freeze I mean that the power stays on and the screen continuously shows the last frame but nothing is happening (no SysRq, no Caps-Lock, LEDs). The only thing I can do is to hard-reset. It seems like not even a panic is caused since I have loaded a panic kernel (and it works with echo c > /proc/sysrq-trigger).
With my current config it freezes just for ~2 sec and the following line show up (dmesg):
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle

Reproducible: Couldn't Reproduce

Steps to Reproduce:
Seems to happen at random (currently only once ~<4-5h after booting) but seems somehow related to chromium (as far as I remember this never happened without chromium running however it should be possible to reproduce it without chromium).
Actual Results:  
Everything freezes for ~2sec and the following line show up in dmesg:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle


Some other lines that might be related (dmesg):
...
[    0.283757] [drm] Initialized i915 1.6.0 20151010 for 0000:00:02.0 on minor 0
...
[    1.982637] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
...
[   63.349412] WARNING: CPU: 0 PID: 5113 at drivers/gpu/drm/i915/intel_uncore.c:619 hsw_unclaimed_reg_debug+0x64/0x7c()
[   63.349414] Unclaimed register detected before reading register 0x22380
[   63.349416] Modules linked in: iwlmvm snd_hda_codec_hdmi iTCO_wdt asus_nb_wmi iTCO_vendor_support asus_wmi x86_pkg_temp_thermal iwlwifi lpc_ich mfd_core wmi efivarfs
[   63.349428] CPU: 0 PID: 5113 Comm: BrowserBlocking Tainted: G        W       4.4.5-gentoo #3
[   63.349429] Hardware name: ASUSTeK COMPUTER INC. UX303LAB/UX303LAB, BIOS UX303LAB.210 08/25/2015
[   63.349431]  0000000000000000 ffff88021ec03d18 ffffffff812c5e8c ffff88021ec03d60
[   63.349434]  0000000000000009 ffff88021ec03d50 ffffffff810753a5 ffffffff813f11fd
[   63.349437]  ffff880212da0000 ffff880212da0000 ffff880212da0000 ffff880212da0080
[   63.349439] Call Trace:
[   63.349442]  <IRQ>  [<ffffffff812c5e8c>] dump_stack+0x4d/0x63
[   63.349449]  [<ffffffff810753a5>] warn_slowpath_common+0x9a/0xb3
[   63.349453]  [<ffffffff813f11fd>] ? hsw_unclaimed_reg_debug+0x64/0x7c
[   63.349456]  [<ffffffff81075401>] warn_slowpath_fmt+0x43/0x4b
[   63.349458]  [<ffffffff813f4ee8>] ? fw_domains_get_with_thread_status+0xd/0x58
[   63.349461]  [<ffffffff813f11fd>] hsw_unclaimed_reg_debug+0x64/0x7c
[   63.349464]  [<ffffffff813f2369>] gen6_read32+0x43/0xae
[   63.349467]  [<ffffffff813e9c5a>] intel_lrc_irq_handler+0x96/0x1ae
[   63.349470]  [<ffffffff813b3784>] gen8_gt_irq_handler+0x75/0x1d8
[   63.349473]  [<ffffffff813b3952>] gen8_irq_handler+0x6b/0x520
[   63.349476]  [<ffffffff810a86b1>] handle_irq_event_percpu+0x78/0x1a7
[   63.349478]  [<ffffffff810a8806>] handle_irq_event+0x26/0x46
[   63.349481]  [<ffffffff810ab1d7>] handle_edge_irq+0xa1/0xbe
[   63.349484]  [<ffffffff8100630b>] handle_irq+0x104/0x10c
[   63.349486]  [<ffffffff81005c76>] do_IRQ+0x46/0xb5
[   63.349490]  [<ffffffff8175eebf>] common_interrupt+0x7f/0x7f
[   63.349491]  <EOI> 
[   63.349493] ---[ end trace be1d80709bc4e6de ]---
[  411.849338] perf interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[  759.268856] kworker/dying (6) used greatest stack depth: 12472 bytes left
[ 1255.449707] perf interrupt took too long (5036 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 4729.323445] kworker/dying (34) used greatest stack depth: 12448 bytes left
...
[24005.351739] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle
[26561.371210] asus_wmi: Unknown key cf pressed
...
[58269.439309] [drm] stuck on render ring
[58269.441141] [drm] GPU HANG: ecode 8:0:0xac277ffe, in chrome [5110], reason: Ring hung, action: reset
[58269.441143] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[58269.441144] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[58269.441145] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[58269.441146] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[58269.441147] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[58269.526788] drm/i915: Resetting chip after gpu hang
...
[61122.440706] [drm] stuck on render ring
[61122.441982] [drm] GPU HANG: ecode 8:0:0x84dfbffe, in chrome [3967], reason: Ring hung, action: reset
[61122.443699] drm/i915: Resetting chip after gpu hang
...


Boot-Options (my setup sucks (initramfs+boot-options in kernel) due to some UEFI problems):
[    0.000000] Linux version 4.4.5-gentoo (root@jarvis) (gcc version 4.9.3 (Gentoo 4.9.3 p1.5, pie-0.6.4) ) #3 SMP Mon Mar 14 16:59:19 CET 2016
[    0.000000] Command line: BOOT_IMAGE=/michael-emergency-kernel-4.4.5 crashkernel=128M
[    0.000000] efi: EFI v2.40 by American Megatrends
[    0.000000] efi:  ESRT=0xdce2cd98  ACPI=0xdb71c000  ACPI 2.0=0xdb71c000  SMBIOS=0xdce2c918[    0.000000] DMI: ASUSTeK COMPUTER INC. UX303LAB/UX303LAB, BIOS UX303LAB.210 08/25/2015
[    0.000000] Kernel command line: root=/dev/mapper/vg_jarvis-lv_root dolvm rootfstype=ext4 init=/usr/lib/systemd/systemd BOOT_IMAGE=/michael-emergency-kernel-4.4.5 crashkernel=128M
Comment 1 Michael Weiss (primeos) 2016-03-16 15:28:05 UTC
Created attachment 428360 [details]
/sys/class/drm/card0/error

[58269.439309] [drm] stuck on render ring
[58269.441141] [drm] GPU HANG: ecode 8:0:0xac277ffe, in chrome [5110], reason: Ring hung, action: reset
[58269.441143] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[58269.441144] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[58269.441145] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[58269.441146] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[58269.441147] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[58269.526788] drm/i915: Resetting chip after gpu hang
Comment 2 Michael Weiss (primeos) 2016-03-16 15:39:19 UTC
Created attachment 428362 [details]
My custom configuration that caused the system to freeze.
Comment 3 Michael Weiss (primeos) 2016-03-16 15:40:05 UTC
Created attachment 428364 [details]
My current configuration that somehow prevents the system from freezing.
Comment 4 Michael Weiss (primeos) 2016-03-16 15:47:28 UTC
Created attachment 428366 [details]
Diff between the freezing and non-freezing configuration.

Seems to be the relevant change to me - not sure tho:
> CONFIG_HANGCHECK_TIMER=m

Things that could be related:
1187a1188,1190
> # CONFIG_INTEL_MEI is not set
> # CONFIG_INTEL_MEI_ME is not set
> # CONFIG_INTEL_MEI_TXE is not set

< # CONFIG_WATCHDOG_CORE is not set
< # CONFIG_WATCHDOG_NOWAYOUT is not set
---
> CONFIG_WATCHDOG_CORE=y
> CONFIG_WATCHDOG_NOWAYOUT=y
2288c2291
< # CONFIG_SOFT_WATCHDOG is not set
---
> CONFIG_SOFT_WATCHDOG=m
2304c2307
< # CONFIG_I6300ESB_WDT is not set
---
> CONFIG_I6300ESB_WDT=m
2306c2309,2310
< # CONFIG_ITCO_WDT is not set
---
> CONFIG_ITCO_WDT=m
> CONFIG_ITCO_VENDOR_SUPPORT=y
2351c2355
< # CONFIG_MFD_CORE is not set
---
> CONFIG_MFD_CORE=m
2366c2370
< # CONFIG_LPC_ICH is not set
---
> CONFIG_LPC_ICH=m


Mistakes that should be unrelated (I reconfigured a lot and forgot to change this after oldconfig in this version - other versions crashed without this incidents):

< CONFIG_INITRAMFS_SOURCE="/boot/initramfs/4.1.15-r1.cpio"

< # CONFIG_LOCKUP_DETECTOR is not set
< # CONFIG_DETECT_HUNG_TASK is not set
< # CONFIG_PANIC_ON_OOPS is not set
< CONFIG_PANIC_ON_OOPS_VALUE=0

< # CONFIG_DEBUG_RT_MUTEXES is not set
Comment 5 Michael Weiss (primeos) 2016-03-16 15:54:38 UTC
FYI (bugs form the kernel bug tracker that could be related):
https://bugzilla.kernel.org/buglist.cgi?bug_status=__all__&content=drm%3Ai915_hangcheck_elapsed
Comment 6 Michael Weiss (primeos) 2016-03-16 16:42:26 UTC
(In reply to Michael Weiss from comment #5)
> FYI (bugs form the kernel bug tracker that could be related):
> https://bugzilla.kernel.org/buglist.
> cgi?bug_status=__all__&content=drm%3Ai915_hangcheck_elapsed

After closer looking at them I found the following:
https://bugs.freedesktop.org/buglist.cgi?bug_status=__open__&content=drm%3Ai915_hangcheck_elapsed

Seems like they are very related (and still open) - did I report this on the wrong place?

I'm new to all of this (just trying to help here) - Please let me know what I could improve, change, etc. - thx :)
Comment 7 Michael Weiss (primeos) 2016-03-20 17:11:00 UTC
Experienced three more crashes - one yesterday and two today.

The one yesterday froze everything i. e. I have no idea what happened this time.
The first one today happened after exiting i3 (shutting down the x-server) but it didn't completely freeze the system I couldn't switch to another VT but the SysRq-Keys still worked - i. e. I have a dump (I'll look into that later).
The second one today happened most likely due to one of the following parameters: "drm.debug=0x06 i915.semaphores=1".

My 4.4.5-Setup (forgot to include that - semi-working i.e. one crash so far):
=sys-kernel/gentoo-sources-4.4.5
=media-libs/mesa-11.0.6
=x11-drivers/xf86-video-intel-2.99.917-r2
=x11-libs/libdrm-2.4.65

My current 4.5.0-Setup (since the third crash - testing now without the parameters):
=sys-kernel/gentoo-sources-4.5.0
=media-libs/mesa-11.1.2-r1
=x11-drivers/xf86-video-intel-2.99.917_p20160316
=x11-libs/libdrm-2.4.67
=x11-base/xorg-server-1.18.2
=x11-base/xorg-drivers-1.18-r1

I'll now ask the devs at #intel-gfx@freenode.net what I should do about this (move it over to freedesktop.org, etc.), how I could help, etc.
Comment 8 Michael Weiss (primeos) 2016-03-20 17:26:07 UTC
Created attachment 428656 [details]
Output from: lspci -vvv -s 2

My CPU: Intel Core i7-5500U
My GPU: Intel HD Graphics 5500
My Notebook: ASUS ZenBook UX303LA
Comment 9 Michael Weiss (primeos) 2016-03-23 15:04:13 UTC
Fixed with the new setup:
=sys-kernel/gentoo-sources-4.5.0
=x11-libs/libdrm-2.4.67
=media-libs/mesa-11.1.2-r1
=x11-base/xorg-server-1.18.2
=x11-drivers/xf86-input-synaptics-1.8.2
=x11-drivers/xf86-video-intel-2.99.917_p20160316
=x11-drivers/xf86-input-evdev-2.10.1
=x11-base/xorg-drivers-1.18-r1

Works flawlessly for 3 days now :)
All issues are gone and there are no i915 related errors occurring in dmesg anymore.