When switching to another user within KDE and switching back to the previous user, letting the new login alive, the X server crashes always. Reproducible: Always Steps to Reproduce: 1.login to kde. 2.switch to a new use 3.switch back Actual Results: System is completely locked, screen shows some colored patterns Expected Results: ahem -- it should not crash, imho
Created attachment 151418 [details] Xorg log with version intel 2.3.0
Portage 2.1.5_rc6 (default/linux/x86/2008.0/desktop, gcc-4.2.3, glibc-2.7-r2, 2.6.25-gentoo-r1 i686) ================================================================= System uname: 2.6.25-gentoo-r1 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz Timestamp of tree: Wed, 30 Apr 2008 09:32:01 +0000 app-shells/bash: 3.2_p33 dev-lang/python: 2.5.2-r2 sys-apps/baselayout: 2.0.0 sys-apps/openrc: 0.2.3 sys-apps/sandbox: 1.2.18.1-r2 sys-devel/autoconf: 2.13, 2.62 sys-devel/automake: 1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.1 sys-devel/binutils: 2.18-r1 sys-devel/gcc-config: 1.4.0-r4 sys-devel/libtool: 1.5.26 virtual/os-headers: 2.6.25-r1 ACCEPT_KEYWORDS="x86 ~x86" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -march=pentium4 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config" CONFIG_PROTECT_MASK="/etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c /etc/udev/rules.d" CXXFLAGS="-O2 -march=pentium4 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="distlocks parallel-fetch sandbox sfperms strict unmerge-orphans user-fetch userfetch" GENTOO_MIRRORS="ftp://ftp.wh2.tu-dresden.de/pub/mirrors/gentoo " LANG="de_DE.utf8" LC_ALL="de_DE.utf8" LDFLAGS="" LINGUAS="de en_GB fr" MAKEOPTS="-j2" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/portage/local/layman/sci-medicine /usr/portage/local/layman/modulix /usr/portage/local/layman/sci-libs" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="X acl acpi alsa arts berkdb bluetooth branding bzip2 cairo cdr cli consolekit cracklib crypt cups curl dbus doc dri dvd dvdr dvdread eds emboss encode esd evo exif fam firefox gdbm gif gimp gphoto2 gpm gstreamer gtk hal iconv isdnlog jpeg kde kerberos ldap libnotify mad mbox midi mikmod mmx mp3 mpeg mudflap ncurses nls nptl nptlonly ogg opengl openmp pam pcre pdf perl png ppds pppd python qt3 qt3support qt4 quicktime readline reflection scanner sdl session spell spl sse sse2 ssl startup-notification svg tcpd tiff truetype unicode usb v4l v4l2 vorbis win32codecs x86 xml xorg xv zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CAMERAS="*" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="de en_GB fr" USERLAND="GNU" VIDEO_CARDS="fbdev glint i810 mach64 mga neomagic nv r128 radeon savage sis tdfx trident vesa vga via vmware voodoo" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
This bug seems to be related to xf86-video-intel-2.3.0. Compiling and installing xf86-video-intel-2.2.1 from http://xorg.freedesktop.org/archive/individual/driver/xf86-video-intel-2.2.1.tar.gz solves this problem
There are some naming / versioning issues: http://gitweb.freedesktop.org/?p=xorg/driver/xf86-video-intel.git;a=summary According to git, xf86-video-i810 had been renamed to xf86-video-i810 long ago. Therefore it is difficult to tell if xf86-video-i810-2.3.0 is actually newer than xf86-video-intel-2.2.1. xf86-video-intel-2.2.1 is from upcoming X11R7.4 release, while the problematic version comes with gentoo package x11-drivers/xf86-video-i810 2.3.0
Created attachment 151420 [details] Xorg log with working intel driver version 2.2.1
Some clarification. 1.) Gentoo uses xf86-video-i810 for package name, but the driver name is xf86-video-intel 2.) Current git version works ok 3.) If I modify the xf86-video-i810-2.2.3.ebuild to leave out the gentoo patches, the problem is gone. 4.) Consequently, the problem was introduced by one of these patches found in /usr/portage/x11-drivers/xf86-video-i810/files 0001-fixup-pciaccess-version-detect.patch 1.6.5-fix_no_dri.patch xf86-video-i810-2.1.1-fix_build_without_dri.patch xf86-video-i810-2.2.99.902-enable_center_panel_fitting_on_i8xx.patch xf86-video-i810-2.2.99.903-fix-panel-resize-on-i8xx.patch i810.xinf
Inspite of what I wrote in the last comment, xf86-video-i810-2.3.0.ebuild has the same error. (I must have forgotten to run ebuild with sudo, a typical post-hoc fallacy) But I narrowed it further down: xf86-video-i810-2.2.99.901.ebuild still runs fine (as does xf86-video-i810-2.2.1.ebuild), while installing xf86-video-i810-2.2.99.902.ebuild triggers the bug, which inturn is __not__ induced by xf86-video-i810-2.2.99.902-enable_center_panel_fitting_on_i8xx.patch - I know this because I commented the PATCH statement and did run ebuild manifest afterwards before merging. A diff between .901 and .902 will be attached. Is there no git access for gentoo x11? It's a bit hard to emulate the git bisect game completely manually.
Created attachment 151443 [details] diff must contain the change which introduced the session switch bug
Playing the git bisect game between origin/xf86-video-intel-2.2-branch and origin/xf86-video-intel-2.3-branch, I finally found the culprit: >git bisect bad >Bisecting: 0 revisions left to test after this >[0af692e9ee5857e41ffdbaf760752a37737b21b7] Revert "Use mprotect on unbound AGP memory to attempt to catch use while unbound." The related entry in gitweb reads: While I still like the idea, the mprotect calls themselves are failing on Linux and causing more trouble than they're worth. This reverts commit a1612b7728d4153499fe86b6713a13c8702cc7d9. So, the __revert__ of the mprotect commit a1612b7728d4153499fe86b6713a13c8702cc7d9 introduced the buggy behaviour. What I find irritating is that the commit guid 0af692e9ee5857e41ffdbaf760752a37737b21b7 (which really is the revertion by Eric Arnhold, see here: http://article.gmane.org/gmane.comp.freedesktop.xorg.cvs/8151) on gitweb.freedesktop.org is actually c02ab432dd7058c700c35eecf6215daf5f262c51, while the headline stays the same.
Ok, I'll try to answer all of this in one go :) 1) yes, we know about the name change, we still keep the old name because we're a bit lazy, it takes a lot of checking to actually do it, and it's only just a name... 2) Since you've successfully ran git bisect on the code, I strongly urge you to open a bug in FreeDesktop's bugzilla (after reading this guide [1]) so that it can be fixed in master and for the next 2.3 release. If you do open a bug, please add "remi@gentoo.org" as a CC there. Thanks for taking the time to debug the whole thing so thoroughly.
Ok, I will open a new entry at FreeDesktop's bugzilla. For the sake of completness: attached you will find a symbolic backtrace indicating a segmentation fault in XkbEnableDisableControls(). Actually this was generated from git master branch, but 2.3.0 gives the same results. But this is only a bug of third order. The error of second order seems to be, that someone placed a direct call to close_device() within CloseDownDevices(), disregarding the notice in dix/devices.c saying "Don't call this function directly, use RemoveDevice() instead". The first order problem is a ring buffer stall in I830WaitLpRing() lasting more than 2000 ms and indicating constantly ~ 100k of used, ~3k of free and ~130k needed bytes of memory. If anyone some day had documented the environment setting "INTEL_DEBUG=fall" and in addition would have taken care of syncing log output to disk, all this could have been found way easier. Yet while debugging X I also noticed that Gentoo does not provide a package for libpciaccess -- which might be a reason for the stall. Besides: Is there an official way to instruct emerge not to delete the sources after installing the binaries? I would think that at least setting FEATURE="nostrip" should imply this effect. Furthermore, does it really make any sense to specify "debug" in USE flags, and nevertheless get debug symbols stripped and sources deleted? Whoever uses Gentoo in an embedded environment should know how to strip symbols manually.
Created attachment 151584 [details] symbolic stack dump
(In reply to comment #11) > Ok, I will open a new entry at FreeDesktop's bugzilla. > > For the sake of completness: attached you will find a symbolic backtrace > indicating a segmentation fault in XkbEnableDisableControls(). Actually this > was generated from git master branch, but 2.3.0 gives the same results. > > But this is only a bug of third order. The error of second order seems to be, > that someone placed a direct call to close_device() within CloseDownDevices(), > disregarding the notice in dix/devices.c saying "Don't call this function > directly, use RemoveDevice() instead". The first order problem is a ring buffer > stall in I830WaitLpRing() lasting more than 2000 ms and indicating constantly ~ > 100k of used, ~3k of free and ~130k needed bytes of memory. That's great stuff. I'm sure that'll greatly help upstream to find the proper fix for that. > If anyone some day had documented the environment setting "INTEL_DEBUG=fall" > and in addition would have taken care of syncing log output to disk, all this > could have been found way easier. It's upstream's job to properly document how their software works. Xorg is an incredibly complex piece of software, so please remind them to think of their users. > Yet while debugging X I also noticed that Gentoo does not provide a package for > libpciaccess -- which might be a reason for the stall. xorg-server 1.4 does not need libpciaccess. That's why it's only in the X11 overlay for now. All Xorg drivers (at least for a little while) are supposed to build and work properly with and without a libpciaccess'ed xserver. If something breaks, that's a bug :) > Besides: Is there an official way to instruct emerge not to delete the sources > after installing the binaries? What you are probably looking for is FEATURES="split-debug installsources". Take a look at /etc/make.conf.example for what these do exactly. > I would think that at least setting > FEATURE="nostrip" should imply this effect. No need to bloat "nostrip" with that as there are many other options available. > Furthermore, does it really make > any sense to specify "debug" in USE flags, and nevertheless get debug symbols > stripped and sources deleted? Whoever uses Gentoo in an embedded environment > should know how to strip symbols manually. # euse -i debug global use flags (searching: debug) ************************************************************ [- ] debug - Enable extra debug codepaths, like asserts and extra output. If you want to get meaningful backtraces see http://www.gentoo.org/proj/en/qa/backtraces.xml So no, the debug useflag does not do what you think it does. But as you can see everything is already properly documented :) Thanks
Created attachment 151630 [details] xorg log output after #define I810_DEBUG (-1) Seems as if the driver debug environment variable must be named I810_DEBUG, not INTEL_DEBUG ... As far as one can tell from this trace, the GPU takes an after-lunch nap.
(In reply to comment #13) > xorg-server 1.4 does not need libpciaccess. That's why it's only in the X11 > overlay for now. All Xorg drivers (at least for a little while) are supposed to > build and work properly with and without a libpciaccess'ed xserver. If > something breaks, that's a bug :) If you look at the sources: king's road now takes libpciaccess as an abstraction layer to pass over the hardware everglades. Other distributions appear as if they would already use it. > What you are probably looking for is FEATURES="split-debug installsources". > http://www.gentoo.org/proj/en/qa/backtraces.xml > But as you can see everything is already properly documented At least nearby everything. Thank you for the hint. As I see, I need to study more of these documents.
(In reply to comment #15) > If you look at the sources: king's road now takes libpciaccess as an > abstraction layer to pass over the hardware everglades. Other distributions > appear as if they would already use it. For git master yes, for 1.4 (the most recent in portage) no. Donnie has plans for xorg-server 1.5 and he's the guy that will be handling the transition. If you have any questions about how 1.5 will be handled in Gentoo, please ask him. Now back to this bug :) Did you open a bug in FreeDesktop's bugzilla? Don't forget to add me CC there. Thanks
Added this to http://bugs.freedesktop.org/show_bug.cgi?id=15807, and named it "[865G rev 02] Xserver crashes during VT switch" (In reply to comment #16).
Thanks for opening the bug :) Let's see how things go with Intel devs in the loop.
Tracking upstream
It is interesting to see, that the workaround (or fix?) I proposed on xorgs bugzilla in http://bugs.freedesktop.org/show_bug.cgi?id=15807#c11, which consists in commenting out exactly this: + /* Emit a flush of the rendering cache, or on the 965 and beyond + * rendering results may not hit the framebuffer until significantly + * later. In the direct rendering case this is already done just + * after the page flipping updates, so there's no need to duplicate + * the effort here. + */ + if (!pI830->noAccel && !pI830->directRenderingEnabled) + I830EmitFlush(pScrn); + -- is a change which also was committed in between xf86-video-i810-2.2.99.901 and xf86-video-i810-2.2.99.902. See the diff I attached earlier here: http://bugs.gentoo.org/attachment.cgi?id=151443. This was commit '8cdbd55f8075cd18b563badde35815665d7d053e': ------------------------------------------- Fix 965+ rendering issues with DRI disabled. The new chips no longer automatically flush the rendering cache, so if we don't flush the RC at blockhandler, the last rendering done may not appear on the screen. This was particularly noticable with a bare Xorg with some missing root weave, and terminals where the last character wouldn't appear until the cursor blinked. A flush in the DRI blockhandler path had hidden this issue for most people. -------------------------------------------- A pecularity of the crash I encountered is, that I830EmitFlush would only be called in this context if direct rendering was not enabled. And it was not enabled only for the second session. Would like to know, if it ever was enabled for the second session.
Honestly, you've closed the bug WORKSFORME so fast, I didn't really get to see what your issue _really_ was, and how you managed to fix it :) If you want/need me to backport anything from the master branch of the intel driver, please let me know (with a simple explanation). Cheers
(In reply to comment #21) Remi, I found a Workaround, not a solution. I now see that it was strategically wrong to set the status to "works-for-me", since this seems to imply something like "fixed, resolved, ok". If this should mean that nobody else cares now, it will mean that many other users of the intel driver will care in future, when anything after 2.2.99.901 goes into production. I described my issue very clearly: 1.) log into kde, 2.) log into a new session, 3.) switch back and crash at least the newly created session. I used the current master branch to track down the issue completely, for several reasons: 1.) It shows the same, or even a more severe problem. 2.) It's easier to work on the master branch for those people who manage it I wrote one bug report here, and one on Xorg after you told me to do so. Concerning the master branch. I could have written a dozen reports, seeing how much really goes wrong if DRI initialization fails at the start of the second session. This is the main point of failure within master, which causes all the rest. In other words: DRI initializaion failure is the basic illness, and the crash is a symptom showing up on an otherwise ill X framework. I only cured the symptom. Concerning backports: As I said, reverting commit 8cdbd55f8075cd18b563badde35815665d7d053e and commit 0af692e9ee5857e41ffdbaf760752a37737b21b7 makes session switching working again, but it does not fix the real cause, which is that DRI initialization fails for the second login. This is the root of the problem. All other hassle would be easy to fix, if an engaged X-developper, who knows the code would pick up the case. And honestly, the DRI issue is a very complex one. At the moment I do not know, if DRI should succeed normally for the second session. The reason for DRI initialization failure is that at least the current master DRM Kernel driver does not allow a second call to drmSetBusid() to succeed, thus making the DRI initialization of the second session fail (and crash). Until today, I did not had the time to test if that issue is also in 2.3.0 branch. The mentioned commits have many things in common: 1.) Both make session switching fail 2.) Both are committed in between .901 and .902 3.) Both have the same author 4.) Both have been committed in sequence http://gitweb.freedesktop.org/?p=xorg/driver/xf86-video-intel.git;a=shortlog;h=21783ec9dfad9aae0837fd2d8eb313a77f031046 2008-03-26 Eric Anholt Fix 965+ rendering issues with DRI disabled. commit | commitdiff 2008-03-25 Eric Anholt Revert "Use mprotect on unbound AGP memory to attempt ... So, should I reopen it here and upstream?
A couple things : 1) yes, closing WORKSFORME still means "closed", in pretty much all bugzilla's across the web. So please do reopen it if you want upstream to fix it. 2) as for my view on this, you just dove straight into the code and you just gave me just way too much information for me to follow what was going on :) But that's just me. I don't know the driver's code. I'm just a maintainer. Cheers
(In reply to comment #23) Moving from a whistle stop to a more global village, I reopen this bug report here and at FreeDesktop's bugzilla.
Remi, could you just add Eric Anhold's commit '36ec93300926084fb2951d69b001e4c67bc6ff79' which would fix also gentoo's xf86-video-i830-2.3.0 ebuild and then close this issue? I have the driver running now, and switching works. <SNIP> --- a/src/i830_driver.c +++ b/src/i830_driver.c @@ -2411,7 +2411,7 @@ I830BlockHandler(int i, * after the page flipping updates, so there's no need to duplicate * the effort here. */ - if (!pI830->noAccel && !pI830->directRenderingEnabled) + if (pScrn->vtSema && !pI830->noAccel && !pI830->directRenderingEnabled) I830EmitFlush(pScrn); I830VideoBlockHandler(i, blockData, pTimeout, pReadmask); </SNIP> Two points remain open: 1.) Why didn't git bisect identified the buggy patch exactly, but blamed instead it's immediate predecessor (which had removed of the mprotect() calls). 2.) DRI will currently not work for any other than the first session; however, this is a global issue not directly related to the intel driver.
(In reply to comment #25) > could you just add Eric Anhold's commit > '36ec93300926084fb2951d69b001e4c67bc6ff79' which would fix also gentoo's > xf86-video-i830-2.3.0 ebuild and then close this issue? I have the driver > running now, and switching works. I've just committed 2.3.1 to portage, sorry I took so long. 2.3.1 has that commit so it should be fine. Please let me know if that's not the case. > Two points remain open: > 1.) Why didn't git bisect identified the buggy patch exactly, but blamed > instead it's immediate predecessor (which had removed of the mprotect() calls). I guess that's just bad luck, hardware drivers are not exactly the easiest piece of software to debug :) > 2.) DRI will currently not work for any other than the first session; however, > this is a global issue not directly related to the intel driver. That's been an open issue for a very long time. Hopefully DRI2 work should solve this for good. I think all the pieces should be available in the x11 overlay. If you're interested, you should ping Donnie (dberkholz) about it, he probably knows more about it than me. Thanks for reporting the bug, and again sorry for the delay. Closing fixed :)