168954 – >=x11-base/xorg-server-1.1.1 is crashing on start

Bug 168954 - >=x11-base/xorg-server-1.1.1 is crashing on start

Summary: >=x11-base/xorg-server-1.1.1 is crashing on start

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Server (show other bugs)
Hardware:	x86 Linux

Importance:	High major
Assignee:	Gentoo X packagers

URL:	http://forums.gentoo.org/viewtopic-t-...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-03-02 00:18 UTC by Ian Hands
Modified:	2007-04-06 20:41 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
output of cat on requested files... (logs,28.41 KB, text/plain) 2007-03-02 03:18 UTC, Ian Hands	Details
Disassembler dump (0xb7f61160,45.37 KB, text/plain) 2007-03-12 03:33 UTC, Ian Hands	Details
Disassemble on 0x80c4264 (0x80c4264,7.59 KB, text/plain) 2007-03-12 22:01 UTC, Ian Hands	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ian Hands 2007-03-02 00:18:44 UTC

Using nv, vesa, or nvidia drivers /usr/bin/X crashes with:

Backtrace:
0: X(xf86SigHandler+0x84) [0x80c4294]
1: [0xb7f3d420]

Fatal server error:
Caught signal 4.  Server aborting

Aborted

I have tried xorg-server 1.1.1-r1 , 1.1.1-r4 ,and  1.2.0-r1

Reproducible: Always

Steps to Reproduce:
1.Install gentoo 2006.1 via handbook
2.Reboot and emerge gnome
3.emerge proper Nvidia drivers (although vesa and nv do fail).
4./usr/bin/X

Actual Results:  
X failed to launch.
When using the binary NVidia drivers I did see the Nvidia splash.

Expected Results:  
X should work.

Comment 1 Jakub Moc (RETIRED) gentoo-dev

2007-03-02 02:48:52 UTC

Not baselayout.

Comment 2 Jakub Moc (RETIRED) gentoo-dev

2007-03-02 02:50:26 UTC

Don't waste people's time by referring to forums.g.o. for bug descriptions. Reopen w/ nvidia-drivers version, Xorg.0.log, xorg.conf and emerge --info output.

Comment 3 Ian Hands 2007-03-02 03:18:16 UTC

Created attachment 111761 [details]
output of cat on requested files...

Comment 4 Ian Hands 2007-03-02 03:18:48 UTC

Requested info attached.

Not to be a prick but if you had WASTED 2 seconds looking at my FORUM POST..... you would see that this stuff is already posted there! (I figured it wouold be more readable on the forum)

And keep in mind it fails the same way with nv, vesa, and nvidia.

Comment 5 Jakub Moc (RETIRED) gentoo-dev

2007-03-02 09:11:15 UTC

(In reply to comment #4)
> Not to be a prick but if you had WASTED 2 seconds looking at my FORUM POST.....
> you would see that this stuff is already posted there! (I figured it wouold be
> more readable on the forum)

Yeah, and is completely useless there when someone's searching bugzilla.

Comment 6 Ian Hands 2007-03-02 17:54:58 UTC

Good point(In reply to comment #5)
> (In reply to comment #4)
> > Not to be a prick but if you had WASTED 2 seconds looking at my FORUM POST.....
> > you would see that this stuff is already posted there! (I figured it wouold be
> > more readable on the forum)
> 
> Yeah, and is completely useless there when someone's searching bugzilla.
> 

Good point..... Well any ideas on what is up here?

Comment 7 Cornelius Weig 2007-03-03 00:28:44 UTC

how come you have a pentium processor, but the 3dnow USE flag is set? This is an AMD-specific optimization.

Comment 8 Ian Hands 2007-03-03 03:52:41 UTC

(In reply to comment #7)
> how come you have a pentium processor, but the 3dnow USE flag is set? This is
> an AMD-specific optimization.
> 
Good eye! This must have slipped in the make.conf I copied over from another box.... Hmmm I'd bet my other P4 box has this also... Whoa sse2 doesn't belong here either. Well I'm rebuilding the world file as I type this... Thanks. 

Could this be the cause?? Does xorg-server use these flags? Either way I'm rebuilding.... I'll see in a day or two I guess.

Just when I though thought I had a grip on things.... The noob monster strikes!

Comment 9 Joshua Baergen (RETIRED) gentoo-dev

2007-03-03 15:29:15 UTC

(In reply to comment #8)
> Could this be the cause?? Does xorg-server use these flags? Either way I'm
> rebuilding.... I'll see in a day or two I guess.

It could be the cause, and xorg-server doesn't have to use these flags to be affected by them.

Comment 10 Ian Hands 2007-03-04 15:40:09 UTC

> 
> It could be the cause, and xorg-server doesn't have to use these flags to be
> affected by them.
> 
I figured that might be the case. Unfortunately I recompiled the system and world file and still get the same error.

Comment 11 Kevin Pyle 2007-03-11 18:34:43 UTC

Signal 4 is SIGILL, illegal instruction.  This is seen only in two cases:

(1) A wild jump caused the process to try to execute content that is not actually code.
(2) The process is using a function built for a newer processor.  For instance, the "conditional move" instruction introduced in the i686 family will cause this type of crash when executed on an i586 or below.

Based on the observation in comment #7, I am leaning toward the latter.  Please attach the contents of "cat /proc/cpuinfo".  I note from your emerge --info that you are using a Celeron processor, but there is insufficient information to tell which Celeron.  The notes on http://gentoo-wiki.com/Safe_Cflags indicate that some Celerons need a -march of pentium2, whereas others can take higher values.  You are using pentium3.

Also, if you have sys-devel/gdb installed, please use it to disassemble the faulting instruction.  Invoke gdb as: "gdb /usr/bin/X <path-to-core-file>".  When gdb prompts, enter "disassemble 0x80c4294 0x80c42b4".  Then post the resulting output.

Comment 12 Ian Hands 2007-03-11 23:58:01 UTC

kidsbox13 ~ # cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Celeron(TM) CPU                1300MHz
stepping        : 1
cpu MHz         : 1292.674
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 2586.07

I have always assumed that it was coppermine (p3) based... I think the fastest pre-coppermine celerons were in the 500mhz range. (though I would be happy if I was wrong!)

I'm headed out to eat but as soon as I get back I'll try and run gdb as described.

"gdb /usr/bin/X <path-to-core-file>". 

I am assuming that <path-to-core-file> is the path to a file that dumps the debug info(??). Please elaborate on that.

Comment 13 Ian Hands 2007-03-12 03:33:55 UTC

Created attachment 113034 [details]
Disassembler dump

Comment 14 Ian Hands 2007-03-12 03:36:51 UTC

Ok I'm not sure about the <path-to-core-file> part (excuse my ignorance).. but I am seeing some new output when using gdb.

First off I am getting:
Program received signal SIGILL, Illegal instruction.
[Switching to Thread -1211009360 (LWP 21743)]
0xb7f32160 in FontParseXLFDName () from /usr/lib/libXfont.so.1

From just running X under gdb and have attached a disassembler dump (of the address reported above (well not the same one... but the a dump form the same error).

Comment 15 Kevin Pyle 2007-03-12 04:38:12 UTC

Your disassembler dump does not match up with any of the addresses you have posted as faulting. Are you sure you disassembled the right area?

When a program crashes on Unix, the kernel may (depending on various settings) save a memory dump of the program as it was at the time of the crash. This is referred to as "dumping core" and the resulting file is usually called a "core file." The X server may not be dumping core, depending on how it dies and whether its signal handlers are executing correctly. For this type of failure, it is possible that the server is taking a secondary fault in its signal handler, which would most likely kill it completely.

Running gdb on a live X server is just as good for our purposes, though it can be dangerous if the debugger breaks in while your console is not in a usable state. Your particular crash seems to be at a safe time. You could reduce the risk by connecting to the system over ssh and running gdb in the resulting session.

So far, I have not found any definitive information on what Pentium family your Celeron belongs in. The closest hit is on the Talk page of the Gentoo Safe_Cflags document referenced in comment #11. Under the banner <http://gentoo-wiki.com/Talk:Safe_Cflags#Suggested_CFLAGS_not_safe_for_my_Intel.28R.29_Celeron.28R.29_processor>, someone writes that he had to back down to -mcpu=i686 to get a working build. This is probably an overly cautious setting, since -mcpu affects tuning, but does not grant gcc the liberty to use newer instructions. However, lacking any clear answer as to what instructions your Celeron really handles, I would suggest to change the -march to something very safe, like -march=pentium. Then rebuild the package which provides the library which gdb identifies as failing. If that works, you should probably re-emerge everything else to clean out any other uses of unsafe instructions.

Comment 16 Ian Hands 2007-03-12 21:31:09 UTC

(In reply to comment #15)
> Your disassembler dump does not match up with any of the addresses you have
> posted as faulting.  Are you sure you disassembled the right area?
The dump I posted came from the line that reads:
0xb7f32160 in FontParseXLFDName () from /usr/lib/libXfont.so.1

But the address changes each time it runs... (allocating a new spot in mem, right?)

Also running /usr/bin/X gives the same output but two different memory addresses (but they stay the same with consecutive runs):

Could not init font path element /usr/share/fonts/misc/, removing from list!

Backtrace:
0: X(xf86SigHandler+0x84) [0x80c4264]
1: [0xb7fde420]

Fatal server error:
Caught signal 4.  Server aborting

I guess you are looking for disassemble dumps from these two? I'll post them.

> 
> When a program crashes on Unix, the kernel may (depending on various settings)
> save a memory dump of the program as it was at the time of the crash.  This is
> referred to as "dumping core" and the resulting file is usually called a "core
> file."  The X server may not be dumping core, depending on how it dies and
> whether its signal handlers are executing correctly.  For this type of failure,
> it is possible that the server is taking a secondary fault in its signal
> handler, which would most likely kill it completely.

Great stuff... Thanks! I'm at a loss as to where the core is being dumped (if it is) is it dumping to a plain text log file? Sorry again for the noob questions

> 
> Running gdb on a live X server is just as good for our purposes, though it can
> be dangerous if the debugger breaks in while your console is not in a usable
> state.  Your particular crash seems to be at a safe time.  You could reduce the
> risk by connecting to the system over ssh and running gdb in the resulting
> session.
> 
That works for me... The box is in Louisiana and I'm in North Carolina.(its for some kids to play on so it does need X) SSH is the only way I have access to it!

> So far, I have not found any definitive information on what Pentium family your
> Celeron belongs in.  The closest hit is on the Talk page of the Gentoo
> Safe_Cflags document referenced in comment #11.  Under the banner
> <http://gentoo-wiki.com/Talk:Safe_Cflags#Suggested_CFLAGS_not_safe_for_my_Intel.28R.29_Celeron.28R.29_processor>,
> someone writes that he had to back down to -mcpu=i686 to get a working build. 
> This is probably an overly cautious setting, since -mcpu affects tuning, but
> does not grant gcc the liberty to use newer instructions.  However, lacking any
> clear answer as to what instructions your Celeron really handles, I would
> suggest to change the -march to something very safe, like -march=pentium.  Then
> rebuild the package which provides the library which gdb identifies as failing.
>  If that works, you should probably re-emerge everything else to clean out any
> other uses of unsafe instructions.
> 

With that I'm gonna recompile with march=i686 and be done with that mess (I'm fairly certain its a p3 based celeron though, either way i686 is good for now.)

Let me know if I'm misinterpreting anything you say... Or if there is anything I can post that would help you help me.

Thanks,
-Ian

Comment 17 Ian Hands 2007-03-12 22:01:18 UTC

Created attachment 113113 [details]
Disassemble on 0x80c4264

I have done a:
disassemble 0x80c4264 0xb7fde420

also but it generated a 16Mb file... I think I'm getting the syntax wrong on that one. Am I really supposed to be getting a 0x80c4264 to 0xb7fde420? That is what is being dumped.

Thanks Again,
-Ian

Comment 18 Kevin Pyle 2007-03-14 00:34:53 UTC

In order of posting:

The address may change each time the kernel maps it into a new process. To reliably locate the same region, you will need to refer to it symbolically, which will have gdb find where the function is this time around.

The output from /usr/bin/X is probably consistent because /usr/bin/X is not a position-independent executable, so it is mapped at the same address with every run. The security people frown on non-PIEs, but that does not affect your current problem.

Core dumping depends on the X server configuration. I don't recall if the default is to dump core or not. For programs which dump core, it is traditionally named "core" and placed in the current working directory. It is not a text file, but rather a binary file representing the program's CPU and memory state. You need a dedicated tool, such as gdb, to analyze it.

You're following along quite well. The only information you haven't provided is the disassembly of the faulting instruction, but that's not that important for dealing with this.

When you run disassemble with two arguments, it treats the arguments as a [start,stop] pair and disassembles all memory between them. I specified the two operand form to get around gdb trying to disassemble the entire containing function. I intended that the second operand be ~32 bytes higher than the first, rather than covering a huge range of the address space. I requested a disassembly primarily to try to identify which instruction was causing the SIGILL, in hopes that would hint which gcc setting had caused the unusable binary.

If the problem goes away after you rebuild with less aggressive settings, then don't worry about getting the disassembly. Also, please resolve this bug as invalid if the problem goes away with the -march changes.

Comment 19 Ian Hands 2007-03-16 21:37:42 UTC

Please instruct for further debugging.
I've rebuilt the system x2 and worldfile x2! X is still a no go.

kidsbox13 rebuildlogs # emerge --info
Portage 2.1.2.2 (default-linux/x86/2006.1, gcc-4.1.1, glibc-2.5-r0, 2.6.18-gentoo-r6 i686)
=================================================================
System uname: 2.6.18-gentoo-r6 i686 Intel(R) Celeron(TM) CPU                1300MHz
Gentoo Base System release 1.12.9
Timestamp of tree: Fri, 09 Mar 2007 11:50:01 +0000
distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled]
dev-java/java-config: 1.3.7, 2.0.31
dev-lang/python:     2.4.3-r4
dev-python/pycrypto: 2.0.1-r5
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.61
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10
sys-devel/binutils:  2.16.1-r3
sys-devel/gcc-config: 1.3.14
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.17-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=i686 -O2 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/X11/xkb"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/java-config/vms/ /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-march=i686 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks metadata-transfer nostrip sandbox sfperms strict"
GENTOO_MIRRORS="http://gentoo.osuosl.org/ http://distro.ibiblio.org/pub/linux/distributions/gentoo/"
LINGUAS="en en_GB"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
SYNC="rsync://joshlap/gentoo-portage"
USE="X aac aalib alsa asf asm berkdb bitmap-fonts cairo cdr cli cracklib crypt cups dedicated dri esd ffmpeg firefox flac fortran gdbm gnome gpm gtk gtk2 hal howl iconv ipv6 isdnlog java jpeg libg++ lm_sensors mad midi mmx mpeg ncurses nfs nls no-nptl nptl nptlonly nsplugin nvidia ogg opelgl pam pcre pdf perl png ppds pppd python readline reflection rouge samba sdl session spl sse ssl tcpd teamarena theora tiff truetype-fonts type1-fonts unicode usb v4l vorbis win32codecs x86 xatrix xine xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en en_GB" USERLAND="GNU" VIDEO_CARDS="vesa nv nvidia"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY

Comment 20 Donnie Berkholz (RETIRED) gentoo-dev

2007-03-21 18:02:52 UTC

To make X dump, add this (pasted from xorg.conf man page) to a ServerFlags section of xorg.conf:

       Option "NoTrapSignals"  "boolean"
              This  prevents  the  Xorg  server from trapping a range of unex-
              pected fatal signals and exiting  cleanly.   Instead,  the  Xorg
              server  will  die  and  drop core where the fault occurred.  The
              default behaviour is for the Xorg server to  exit  cleanly,  but
              still  drop  a core file.  In general you never want to use this
              option unless you are debugging an Xorg server problem and  know
              how to deal with the consequences.

Comment 21 Ian Hands 2007-03-22 01:28:23 UTC

(In reply to comment #20)
> To make X dump, add this (pasted from xorg.conf man page) to a ServerFlags
> section of xorg.conf:
> 
>        Option "NoTrapSignals"  "boolean"
>               This  prevents  the  Xorg  server from trapping a range of unex-
>               pected fatal signals and exiting  cleanly.   Instead,  the  Xorg
>               server  will  die  and  drop core where the fault occurred.  The
>               default behaviour is for the Xorg server to  exit  cleanly,  but
>               still  drop  a core file.  In general you never want to use this
>               option unless you are debugging an Xorg server problem and  know
>               how to deal with the consequences.
> 

Oh no.... apparently. The box is now gone! It was for my finances little sister, and her mother (in the middle of moving) might have gotten rid of it. If it is not gone I would love to continue finding a solution to this bug, but I'll  have to ask... It is not responding to my ssh attempts.

I guess I'll leave it marked as NEW, but if anyone feels the need to change the status to more appropriately reflect the situation please do so.

I'll call there sometime and see if they still have it; Chances are slim though.

Comment 22 Ian Hands 2007-03-22 01:31:41 UTC

Oh and thanks for all of the help! You guys/gals are great. This wonderful distro would be nothing with out you all.

Thanks for your time,
-Ian

Comment 23 Joshua Baergen (RETIRED) gentoo-dev

2007-04-06 20:41:43 UTC

(In reply to comment #22)
> Oh and thanks for all of the help! You guys/gals are great. This wonderful
> distro would be nothing with out you all.

Thanks :)

We'll close the bug as NEEDINFO for now.