Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 360513 - [4.6/4.7] sys-boot/grub-0.97 fails to boot when built w/ >=sys-devel/gcc-4.6.0
Summary: [4.6/4.7] sys-boot/grub-0.97 fails to boot when built w/ >=sys-devel/gcc-4.6.0
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Normal blocker with 2 votes (vote)
Assignee: Gentoo Toolchain Maintainers
URL:
Whiteboard:
Keywords: Bug
: 375651 (view as bug list)
Depends on:
Blocks: gcc-4.6
  Show dependency tree
 
Reported: 2011-03-26 07:58 UTC by Ryan Hill (RETIRED)
Modified: 2012-10-11 11:57 UTC (History)
35 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
012_all_grub-0.97-gcc46.patch (012_all_grub-0.97-gcc46.patch,453 bytes, patch)
2011-06-25 07:49 UTC, Ryan Hill (RETIRED)
Details | Diff
grub gcc4.6.2 test compile (test,134.78 KB, text/plain)
2011-11-28 18:15 UTC, godmachine (Lance Poore)
Details
compil log with 4.5.3-r1 (sys-boot:grub-0.97-r10:20111128-183834_gcc4-6-2.log,145.23 KB, text/plain)
2011-11-28 18:45 UTC, bdouxx
Details
compil log with 4.6.2 (sys-boot:grub-0.97-r10:20111128-183834_gcc4-6-2.log,145.23 KB, text/plain)
2011-11-28 18:46 UTC, bdouxx
Details
Build.log (Build.log,5.72 KB, text/plain)
2012-01-06 16:45 UTC, Piotr Szymaniak
Details
automake.out (automake.out,588 bytes, text/plain)
2012-01-06 16:45 UTC, Piotr Szymaniak
Details
build.log with 4.6.2 (build.log,147.28 KB, text/plain)
2012-01-22 18:00 UTC, Jason Lynch
Details
Ubuntu patch for this problem from GRUB HEAD r803 (803_802.diff,1.33 KB, patch)
2012-01-30 01:05 UTC, Richard Yao (RETIRED)
Details | Diff
test script (test.sh,1.85 KB, text/plain)
2012-05-04 06:41 UTC, SpanKY
Details
905_all_grub-0.97-revert_1tb_limit_gcc46.patch (905_all_grub-0.97-revert_1tb_limit_gcc46.patch,4.25 KB, patch)
2012-05-21 19:23 UTC, Ryan Hill (RETIRED)
Details | Diff
906_all_grub-0.97-gcc46.patch (906_all_grub-0.97-gcc46.patch,17.98 KB, patch)
2012-05-26 02:15 UTC, Ryan Hill (RETIRED)
Details | Diff
fix inifinite boot with >=gcc-4.6 (905_all_grub-0.97-gcc46.patch,1.68 KB, patch)
2012-05-26 09:09 UTC, Kacper Kowalik (Xarthisius) (RETIRED)
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ryan Hill (RETIRED) gentoo-dev 2011-03-26 07:58:44 UTC
After building grub-0.97-r10 with GCC 4.6 I get:

# grub-install /dev/sda


    GNU GRUB  version 0.97  (640K lower / 9216K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]
grub> root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
grub> setup  --stage2=/boot/grub/stage2 --prefix=/grub (hd0)
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  24 sectors are embedded.
succeeded
 Running "install --stage2=/boot/grub/stage2 /grub/stage1 (hd0) (hd0)1+24 p (hd0,0)/grub/stage2 /grub/menu.lst"... failed

Error 6: Mismatched or corrupt version of stage1/stage2
grub> quit


Copying over stage2 from a good build fixes it.  

If I use -fno-inline-small-functions it works, which makes me believe this is http://gcc.gnu.org/PR39333 and that option just papers over whatever the real problem is.  As this is difficult to debug and upstream seems uninterested, we might want to just pass one of the options that make it work.
Comment 1 Ryan Hill (RETIRED) gentoo-dev 2011-03-27 12:44:37 UTC
After running through all the differences btwn -O1 and -O2 it looks like the root cause is -freorder-functions.
Comment 2 Billy DeVincentis 2011-04-09 20:13:06 UTC
I had the same problem upon trying the new gcc. Couldn't boot into the system - couldn't do anything.
Comment 3 Billy DeVincentis 2011-04-18 20:43:28 UTC
Could you please add a patch for the current ebuild or a modified ebuild as this is probably the main reason I am not updating to gcc-4.6.0 

Also, is grub2 affected by this?
Comment 4 Ryan Hill (RETIRED) gentoo-dev 2011-04-18 21:42:25 UTC
it's not that easy to do.  we have to unset CFLAGS when building grub.
Comment 5 Philipp 2011-04-18 21:55:02 UTC
(In reply to comment #3)
> Could you please add a patch for the current ebuild or a modified ebuild as
> this is probably the main reason I am not updating to gcc-4.6.0 
> 
> Also, is grub2 affected by this?

No, grub2 works fine.
Comment 6 Billy DeVincentis 2011-04-19 00:51:20 UTC
Just switched to grub2 and will soon try system rebuild with 4.6.0
Comment 7 Mathias 2011-06-20 06:33:26 UTC
grub-static-0.97-r10 works fine
Comment 8 Xake 2011-06-20 18:14:51 UTC
(In reply to comment #7)
> grub-static-0.97-r10 works fine

That may be because grub-static is something pre-compiled, and thus would not have any problems with your currently installed version of GCC.
Comment 9 Ryan Hill (RETIRED) gentoo-dev 2011-06-25 07:49:59 UTC
Created attachment 278079 [details, diff]
012_all_grub-0.97-gcc46.patch

This works around the problem without messing up all the -fno-stack-protector crap.  -freorder-functions isn't the root cause, but it's the only flag I could find that worked with both -Os and -O2.
Comment 10 Rafał Mużyło 2011-07-19 14:32:04 UTC
*** Bug 375651 has been marked as a duplicate of this bug. ***
Comment 11 Ryan Hill (RETIRED) gentoo-dev 2011-09-07 04:00:32 UTC
Would anyone freak out if i applied this?
Comment 12 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2011-09-07 04:21:11 UTC
dirtyepic: go ahead and commit it, without a revbump please. I've got some other grub1 changes for a final release before I start to direct users to GRUB2 instead.
Comment 13 Ryan Hill (RETIRED) gentoo-dev 2011-09-10 02:33:25 UTC
Thanks.  I spun patchset 1.11 and stuck it on the mirrors (i'm not using devspace for a one-off like this) and updated 0.97-r10 to use it.
Comment 14 Robert Cabrera 2011-09-15 06:51:32 UTC
I tried this patch and later the latest improved ebuild. It compiles now and installs, however, won't boot. I'm using GCC-4.6.1 so there is something still wrong here.

Instead, on my ~amd64 multilib Dell laptop, I get the the Dell boot-up splash screen then a brief flicker where grub mentions something about stage 1.5 then cycles back to the Dell screen. It will repeat this continuously until I manually turn off the machine.

I've chrooted in and rebuilt Grub with CFLAGS="-march=native -O2 -pipe", CFLAGS="-march=native -Os -pipe", CFLAGS="-march=native -O1 -pipe", CFLAGS="-march=native -O0 -pipe", and finally CFLAGS="". All with the same result.

Finally, I gave up and just installed grub-static, which has at least allowed me to boot my machine without needing to chroot into it.

My  emerge --info
Portage 2.1.10.16 (default/linux/amd64/10.0/desktop/kde, gcc-4.6.1, glibc-2.13-r4, 3.0.2-pf x86_64)
=================================================================
System uname: Linux-3.0.2-pf-x86_64-Intel-R-_Core-TM-2_CPU_T7200_@_2.00GHz-with-gentoo-2.0.3
Timestamp of tree: Wed, 14 Sep 2011 08:15:01 +0000
app-shells/bash:          4.2_p10
dev-java/java-config:     2.1.11-r3
dev-lang/python:          2.7.2-r2, 3.2-r2
dev-util/cmake:           2.8.5-r2
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.0.3
sys-apps/openrc:          0.9.3-r1
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.13, 2.68
sys-devel/automake:       1.9.6-r3, 1.10.3, 1.11.1-r1
sys-devel/binutils:       2.21.1-r1
sys-devel/gcc:            4.6.1-r1
sys-devel/gcc-config:     1.5-r1
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 2.6.39 (virtual/os-headers)
sys-libs/glibc:           2.13-r4
Repositories: gentoo
ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="!* /etc /usr/share/config /usr/share/gnupg/qualified.txt /usr/share/themes/oxygen-gtk/gtk-2.0 /var/lib/hsqldb"
CONFIG_PROTECT_MASK="!* /etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/splash /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests binpkg-logs candy collision-protect distlocks fail-clean fixlafiles fixpackages multilib-strict news parallel-fetch parallel-install protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS="-march=native -O2 -pipe"
GENTOO_MIRRORS="http://gentoo.netnitco.net http://gentoo.osuosl.org/ http://gentoo.mirrors.tds.net/gentoo http://mirror.csclub.uwaterloo.ca/gentoo-distfiles/ http://gentoo.wetzlmayr.com/ http://osmirrors.cerias.purdue.edu/pub/gentoo/ http://www.cyberuse.com/gentoo/ http://gentoo.mirrors.hoobly.com/ ftp://gentoo.imj.fr/pub/gentoo/ http://130.59.10.35/ftp/mirror/gentoo/"
LANG="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-O1 -Wl,--as-needed"
LINGUAS="en en_US"                                                                                                                                                                                         
MAKEOPTS="-j3 -s"                                                                                                                                                                                          
PKGDIR="/usr/portage/packages"                                                                                                                                                                             
PORTAGE_CONFIGROOT="/"                                                                                                                                                                                     
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"             
PORTAGE_TMPDIR="/var/tmp"                                                                                                                                                                                  
PORTDIR="/usr/portage"                                                                                                                                                                                     
PORTDIR_OVERLAY=""                                                                                                                                                                                         
SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"                                                                                                                                                          
USE="X a52 aac acl acpi alsa amd64 attica avahi berkdb bidi bittorrent bluetooth bluray branding bzip2 cairo cdda cddb cdr chm cli consolekit cracklib crypt cups curl cxx dbus declarative dell designer-plugin desktopglobe djvu dri dts dvd dvdr ebook emboss emovix encode exif fam fbcondecor fbsplash ffmpeg firefox fits flac fortran gdbm gdu gif glibc-omitfp gnutls gphoto2 gps groupwise httpd iconv ieee1394 imagemagick indi ipv6 java java6 javascript jce jpeg jpeg2k kde kipi lame laptop latex lcms ldap libnotify live lm_sensors lzma mad matroska mdnsresponder-compat meanwhile mms mmx mng modplug modules mp3 mp4 mpeg msn mudflap multilib musepack musicbrainz ncurses nls nptl nptlonly nsplugin ntp nvidia ogg openexr opengl openmp oscar otr pam pango parse-clocks pcre pdf perl plasma pm-utils pmu png policykit ppds pppd ps python python3 qalculate qt3support qt4 qwt rdesktop readline samba scanner schroedinger scim sdl semantic-desktop session skype smp sms sndfile solver sox spell sqlite sse sse2 sse3 sse5 ssl ssse3 startup-notification stream svg sysfs taglib tcpd templates theora thumbnail tidy tiff timidity truetype twolame udev unicode upnp usb vcd vcdx video vlc vlm vnc vorbis wavpack webpresence wicd wifi winpopup x264 xcb xcomposite xine xinerama xml xorg xscreensaver xulrunner xv xvid xvmc yahoo zeroconf zlib" ALSA_CARDS="hda-intel" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev keyboard mouse synaptics joystick" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en en_US" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" SANE_BACKENDS="epson epson2" USERLAND="GNU" VIDEO_CARDS="nvidia nv" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 15 Ryan Hill (RETIRED) gentoo-dev 2011-09-15 17:03:12 UTC
And 4.5 works?
Comment 16 Robert Cabrera 2011-09-15 18:57:35 UTC
(In reply to comment #15)
> And 4.5 works?

It did before, but I was running ~x86. I'm running ~amd64 now.

I'm running a fresh installation I began installing 2 weeks ago. This laptop contains a core2 duo processor. Previously, I just imaged the Gentoo installation I've been running since 2004 and just recompiled to this processor and rebuilt my kernel to support the different drivers. So even though my laptop is five years old my installation is over 8.

I upgraded to 4 gigs of memory 3 weeks ago in began preparation for this change in CHOST.

With this install I took my old world file, my make.conf, the files I had in /etc/portage and all other pertinent configuration files as guides. Backed up my old installation then began a fresh from scratch one using them as references. 

After bootstrapping I decided that since 4.6.x would probably be in the tree by years end, I'd save myself a bunch of compiling later by just making the switch now. Looking through the bug lists and in the forum I didn't see any show stoppers that effected me.So I took a chance.

In fact, other than this grub issue it's been a smooth transition. I've had no compilation failures and every app seems to be behaving as it should.

I hope this helps
Comment 17 Ryan Hill (RETIRED) gentoo-dev 2011-09-16 03:11:38 UTC
I will need you to try with 4.5 so we know if this is the same issue or an unrelated bug.
Comment 18 Robert Cabrera 2011-09-17 02:08:44 UTC
(In reply to comment #17)
> I will need you to try with 4.5 so we know if this is the same issue or an
> unrelated bug.

Ryan, FYI prior to using the new patches I was getting the stage mismatch error mentioned here and in the other grub w/GCC-4.6 bug reports. I was able to compile grub but it wouldn't install.

As I said previously, now I can get it to compile and install, but won't boot properly.

And as I mentioned, it's now booting fine with grub-static.

I hope this gives you a little clarification.TIA

Rob
Comment 19 Robert Cabrera 2011-09-17 02:15:16 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > I will need you to try with 4.5 so we know if this is the same issue or an
> > unrelated bug.
> 
> Ryan, FYI prior to using the new patches I was getting the stage mismatch error
> mentioned here and in the other grub w/GCC-4.6 bug reports. I was able to
> compile grub but it wouldn't install.
> 
> As I said previously, now I can get it to compile and install, but won't boot
> properly.
> 
> And as I mentioned, it's now booting fine with grub-static.
> 
> I hope this gives you a little clarification.TIA
> 
> Rob

Sorry, but I didn't think through to ask what is the proper way to do this. I'm a truck driver not a programmer.

If I install GCC-4.5.X do I have to rebuild my whole toolchain and system against it? Or can I just install it and some way tell portage to use it to compile grub?

TIA

Rob
Comment 20 Ryan Hill (RETIRED) gentoo-dev 2011-09-22 00:25:51 UTC
No, you can install it then switch to 4.5 using gcc-config.  emerge grub, then switch back to 4.6 with gcc-config.  Run grub-install with your boot drive as the argument (eg. for me it's grub-install /dev/sda), the same as you did when you first installed grub.  If the problem is the same as this bug you might not have to go any further than that.  As you can see in the first comment, I got a "Mismatched or corrupt version of stage1/stage2" error when I tried it.  If you get it too let me know and switch back to grub-static for the meantime.  If not, try rebooting.
Comment 21 Robert Cabrera 2011-11-06 08:18:33 UTC
Sorry it has taken so long to respond but I've been extremely busy the last 6 weeks with work.

I installed gcc-4.5.X and grub compiled, installed, and booted as expected.

So my problem is definitely related to grub being compiled with 4.6.x.

I've tried grub2 but couldn't figure out the proper syntax in grub,conf to get it to boot, so I gave up.

I switched back to grub-static and all appears to be working well with it. I'll keep using it until grub2 is unmasked and better instructions on its proper configuration is posted.

Thanks
Comment 22 Ryan Hill (RETIRED) gentoo-dev 2011-11-09 01:46:35 UTC
Reopen - other people are also seeing this but not bothering to comment.
Comment 23 Ryan Hill (RETIRED) gentoo-dev 2011-11-19 06:02:00 UTC
Can I get a build log from someone still having this problem?  One for 4.5 and one for 4.6 would be helpful.

This bug is the primary reason 4.6 isn't unmasked yet.  If we don't find a solution soon we might have to resort to something ugly like die with a message to use grub-static with >=4.6.
Comment 24 Oleh 2011-11-28 17:41:19 UTC
Works with gcc-4.6.2
Comment 25 godmachine (Lance Poore) 2011-11-28 18:15:02 UTC
Created attachment 294085 [details]
grub gcc4.6.2 test compile
Comment 26 bdouxx 2011-11-28 18:45:34 UTC
Created attachment 294087 [details]
compil log with 4.5.3-r1

gcc 4.5.3-r1 compil ok, install ok and reboot ok.
gcc 4.6.2    compil ok, install ok but don't reboot.

compil log:
sys-boot:grub-0.97-r10:20111128-183834_gcc4-6-2.log
sys-boot:grub-0.97-r10:20111128-183702_gcc4-5-3.log

do you need more info?
Comment 27 bdouxx 2011-11-28 18:46:11 UTC
Created attachment 294089 [details]
compil log with 4.6.2

gcc 4.5.3-r1 compil ok, install ok and reboot ok.
gcc 4.6.2    compil ok, install ok but don't reboot.

compil log:
sys-boot:grub-0.97-r10:20111128-183834_gcc4-6-2.log
sys-boot:grub-0.97-r10:20111128-183702_gcc4-5-3.log

do you need more info?
Comment 28 Ryan Hill (RETIRED) gentoo-dev 2011-11-29 05:48:25 UTC
So it was broken with 4.6.1 and 4.6.2 works?  I can't reproduce with either.

I did track down a Ubuntu bug report that explains the cause of the original error was a couple functions getting put into .text.unlikely and reordered before _start.  So -fno-reorder-functions was the correct fix.


  * A new inter-procedural static profile estimation pass detects functions that are executed once or unlikely to be executed. Unlikely executed functions are optimized for size. Functions executed once are optimized for size except for the inner loops.

  * On most targets with named section support, functions used only at startup (static constructors and main), functions used only at exit and functions detected to be cold are placed into separate text segment subsections. This extends the -freorder-functions feature and is controlled by the same switch.


binutils-2.22 might also be enough to fix it as support was added in 2.21.51.
Comment 29 Andreas Sturmlechner gentoo-dev 2011-12-04 12:47:22 UTC
Works with gcc-4.6.2, only now did I dare to test this. (well, a security boot stick would have been in place). On a non-related note, stage file sizes have increased 27% on average since last re-install on Oct 1st, 2010, whatever gcc was at work at that time (gcc-4.5.1 according to splat).
Comment 30 Ryan Hill (RETIRED) gentoo-dev 2012-01-05 04:33:13 UTC
I guess I should say what I'm looking for.  To recap, grub built w/ 4.6 seems to install fine but booting fails (loops?).  If you want to test it's probably a good idea to make a backup by installing to a memory stick first.  I need emerge --info and build logs for failing systems to start with.  I also want to know how many people are actually hitting this, so speak up.
Comment 31 Paolo Pedroni 2012-01-05 15:03:53 UTC
(In reply to comment #30)
> I guess I should say what I'm looking for.  To recap, grub built w/ 4.6 seems
> to install fine but booting fails (loops?).  If you want to test it's probably
> a good idea to make a backup by installing to a memory stick first.  I need
> emerge --info and build logs for failing systems to start with.  I also want to
> know how many people are actually hitting this, so speak up.

Here it seems to work, do you want emerge --info and build logs anyway?
Comment 32 Ryan Hill (RETIRED) gentoo-dev 2012-01-06 02:36:18 UTC
Nope, thanks though.
Comment 33 SpanKY gentoo-dev 2012-01-06 04:47:56 UTC
(In reply to comment #30)

or just a bootable cd from gentoo.org

fwiw, i just built grub with 4.6.2, ran grub-install, and it rebooted fine
Comment 34 Piotr Szymaniak 2012-01-06 16:45:17 UTC
Created attachment 298111 [details]
Build.log

Now, i was just about to say that it works fine for me, but… seems to fail a bit different than before.

Not sure if this is related to that bug, but i'm posting it here.


~ # emerge --info =sys-boot/grub-0.97-r10
Portage 2.2.0_alpha84 (default/linux/x86/10.0/desktop, gcc-4.6.2, glibc-2.13-r4, 3.2.0 i686)
=================================================================
                        System Settings
=================================================================
System uname: Linux-3.2.0-i686-Pentium-R-_Dual-Core_CPU_E5400_@_2.70GHz-with-gentoo-2.1
Timestamp of tree: Thu, 05 Jan 2012 18:00:02 +0000
distcc 3.1 i686-pc-linux-gnu [disabled]
ccache version 3.1.6 [disabled]
app-shells/bash:          4.2_p20
dev-java/java-config:     2.1.11-r3
dev-lang/python:          2.7.2-r3, 3.2.2
dev-util/ccache:          3.1.6
dev-util/cmake:           2.8.6-r4
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1
sys-apps/openrc:          0.9.7
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.68
sys-devel/automake:       1.9.6-r3, 1.11.2
sys-devel/binutils:       2.22-r1
sys-devel/gcc:            4.6.2
sys-devel/gcc-config:     1.5-r2
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r3
sys-kernel/linux-headers: 3.1 (virtual/os-headers)
sys-libs/glibc:           2.13-r4
Repositories: gentoo multimedia x11 sunrise mgorny roslin gamerlay-stable
Installed sets: @system
ACCEPT_KEYWORDS="x86 ~x86"
ACCEPT_LICENSE="*"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-march=native -O2 -pipe -fomit-frame-pointer"
DISTDIR="/var/tmp/distfiles"
EMERGE_DEFAULT_OPTS="--quiet-build=n"
FEATURES="assume-digests binpkg-logs distlocks ebuild-locks fixlafiles news parallel-fetch preserve-libs protect-owned sandbox sfperms sign strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS=""
GENTOO_MIRRORS="http://distfiles.ift.uni.wroc.pl/"
LANG="pl_PL.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="pl"
MAKEOPTS="-j3"
PKGDIR="/home/p/binpkgs"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/var/lib/layman/multimedia /var/lib/layman/x11 /var/lib/layman/sunrise /var/lib/layman/mgorny /home/lazy_bum/uberlay/roslin /home/lazy_bum/uberlay/gamerlay"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa bash-completion bzip2 cairo cdda cdr cli consolekit cracklib crypt cups custom-cflags custom-cxxflags cxx dbus dri dts dvd dvdr emboss encode exif fam firefox flac fortran gdbm gdu gif gpm gtk iconv ipv6 jabber jpeg lcms libnotify mad mmx mmxext mng modules mp3 mp4 mpeg mudflap ncurses nls nptl nptlonly ogg opengl openmp pam pango pcre pdf png policykit ppds pppd qt3support qt4 readline sdl session spell sse sse2 ssl ssse3 svg sysfs tcpd tiff truetype udev unicode usb vim-syntax vorbis x264 x86 xcb xml xorg xulrunner xv xvid zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog cpu cpufreq disk hddtemp network uptime users" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="pl" PHP_TARGETS="php5-3" QEMU_SOFTMMU_TARGETS="i386" QEMU_USER_TARGETS="i386" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="nouveau" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

=================================================================
                        Package Settings
=================================================================

sys-boot/grub-0.97-r10 was built with the following:
USE="ncurses -custom-cflags -netboot -static"
CFLAGS=""
Comment 35 Piotr Szymaniak 2012-01-06 16:45:35 UTC
Created attachment 298113 [details]
automake.out
Comment 36 Ryan Hill (RETIRED) gentoo-dev 2012-01-07 00:34:34 UTC
That was bug #396683.
Comment 37 Piotr Szymaniak 2012-01-07 09:16:09 UTC
(In reply to comment #36)
> That was bug #396683.

I guess it should be reopened then. Posted both logs there already.

As for this bug, I've downgraded automake to 1.11.1, emerged grub and ran grub-install. Rebooted without any issues.
Comment 38 Ryan Hill (RETIRED) gentoo-dev 2012-01-07 19:44:13 UTC
Try syncing your tree.
Comment 39 Jason Lynch 2012-01-22 17:53:59 UTC
I am also experiencing this problem. Built with gcc 4.6.2, grub will install fine, but upon booting, it simply reboots over and over again.

I'm attaching the build log, and here's the emerge --info:

Portage 2.2.0_alpha84 (default/linux/amd64/10.0/desktop/gnome, gcc-4.6.2, glibc-2.14.1-r2, 3.2.1-00688-g6bacd7f x86_64)
=================================================================
                        System Settings
=================================================================
System uname: Linux-3.2.1-00688-g6bacd7f-x86_64-Intel-R-_Core-TM-2_CPU_T7200_@_2.00GHz-with-gentoo-2.1
Timestamp of tree: Sun, 22 Jan 2012 13:00:01 +0000
app-shells/bash:          4.2_p20
dev-java/java-config:     2.1.11-r3
dev-lang/python:          2.7.2-r3, 3.2.2
dev-util/cmake:           2.8.7-r1
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1
sys-apps/openrc:          0.9.8.1
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.13, 2.68
sys-devel/automake:       1.10.3, 1.11.2-r1
sys-devel/binutils:       2.22-r1
sys-devel/gcc:            4.6.2
sys-devel/gcc-config:     1.5-r2
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r3
sys-kernel/linux-headers: 3.2 (virtual/os-headers)
sys-libs/glibc:           2.14.1-r2
Repositories: gentoo
Installed sets: 
ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5.4/ext-active/ /etc/php/cgi-php5.4/ext-active/ /etc/php/cli-php5.4/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
EMERGE_DEFAULT_OPTS="--with-bdeps=y"
FEATURES="assume-digests binpkg-logs distlocks ebuild-locks fixlafiles news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS="-march=native -O2 -pipe"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LANG="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="en_US en"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY=""
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa amd64 archive avahi berkdb bluetooth branding bzip2 cairo caps cdda cdr cjk cli colord consolekit cracklib crypt cups curl cxx dbus dri dts dvd dvdr eds emboss encode evo exif expat fam firefox flac fontconfig fortran gd gdbm gdu gif glade gmp gnome gnome-keyring gnome-online-accounts gpm gstreamer gtk gtk3 iconv idn imagemagick ipv6 java jpeg lcms libnotify lua lzma mad mmx mng modules mono mp3 mp4 mpeg msn mudflap multilib nautilus ncurses networkmanager nls nptl nptlonly ogg opengl openmp pam pango pcre pdf perl playlist png policykit ppds pppd pulseaudio python qt3support qt4 readline samba sasl sdl session sndfile socialweb speex spell sqlite sse sse2 ssl startup-notification subversion svg sysfs syslog tcl tcpd theora tiff tk truetype udev unicode upnp usb v4l vala vorbis wifi x264 xcb xml xmp xorg xpm xulrunner xv xvid zeroconf zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev synaptics" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en_US en" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18 ruby19" USERLAND="GNU" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

=================================================================
                        Package Settings
=================================================================

sys-boot/grub-0.97-r10 was built with the following:
USE="(multilib) ncurses -custom-cflags -netboot -static"
Comment 40 Jason Lynch 2012-01-22 18:00:47 UTC
Created attachment 299559 [details]
build.log with 4.6.2
Comment 41 Ryan Hill (RETIRED) gentoo-dev 2012-01-23 03:54:49 UTC
Thanks.  Did you run grub-install afterwards?  Are you installing to a SSD?  Do you have a separate boot partition and is the partition type MBR or did you use GPT?
Comment 42 Jason Lynch 2012-01-23 05:23:33 UTC
Yes, I did run grub-install after compiling. It's a regular 2.5" SATA hard drive (it is a laptop, however). It's a standard MBR partition type, with the following layout:

sda1: Windows (NTFS)
sda2: /boot (ext2)
sda3: swap
sda5: / (ext4)
sda6: /home (as a dm-crypt partition)

Filesystem type is probably irrelevant, but who knows, maybe the fact that there's a dual-boot scenario is.
Comment 43 Ryan Hill (RETIRED) gentoo-dev 2012-01-24 03:12:12 UTC
Maybe.  Did you install to the MBR (sda) or the boot partition (sda2)?  Can I see `fdisk -l` and your grub.conf?
Comment 44 Jason Lynch 2012-01-24 05:10:52 UTC
Installed to the MBR (/dev/sda).

phobos ~ # fdisk -l

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xbcc6bcc6

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63   137114774    68557356    7  HPFS/NTFS/exFAT
/dev/sda2       137114775   137243294       64260   83  Linux
/dev/sda3       137243295   145243664     4000185   82  Linux swap / Solaris
/dev/sda4       145243665   312581807    83669071+   5  Extended
/dev/sda5       145243728   205246439    30001356   83  Linux
/dev/sda6       205246503   312581807    53667652+  83  Linux

phobos ~ # cat /boot/grub/grub.conf 
default 0
timeout 30
splashimage=(hd0,1)/boot/grub/splash.xpm.gz

title Windows Vista Professional
rootnoverify (hd0,0)
makeactive
chainloader +1

title Gentoo Linux (current)
root (hd0,1)
kernel /boot/vmlinuz root=/dev/sda5

title Gentoo Linux (previous)
root (hd0,1)
kernel /boot/vmlinuz.old root=/dev/sda5
Comment 45 Richard Yao (RETIRED) gentoo-dev 2012-01-30 01:05:41 UTC
Created attachment 300329 [details, diff]
Ubuntu patch for this problem from GRUB HEAD r803

It appears that GRUB HEAD has a fix for this from Ubuntu:

http://bzr.savannah.gnu.org/lh/grub/trunk/grub-legacy/changes/803?start_revid=803

I have attached the diff. I don't have time to test it, but if someone else here does, feel free to try it out.
Comment 46 Ryan Hill (RETIRED) gentoo-dev 2012-01-30 01:18:37 UTC
We already apply the same.
Comment 47 Ryan Hill (RETIRED) gentoo-dev 2012-02-08 04:09:15 UTC
Well other than being dual-boot your setup is almost identical to mine.  I played with a couple other systems I have here but I still can't reproduce it.  I really don't know enough about grub to figure out where the problem is.

Maybe we should just tell people to use grub-static with 4.6.
Comment 48 Ryan Hill (RETIRED) gentoo-dev 2012-02-08 04:12:13 UTC
Jason, can you tar up your /boot/grub and email it to me?  Maybe we can rule out hardware/disk layout.
Comment 49 Andreas K. Hüttel archtester gentoo-dev 2012-02-09 19:36:46 UTC
Have hit this too. Willing to test and debug, but only in ~10days (one week work trip and no time before it...)
Comment 50 Ryan Hill (RETIRED) gentoo-dev 2012-02-10 00:32:13 UTC
Jason's /boot/grub is identical to mine, byte for byte.  I went through a dump of his MBR with a hex editor and everything looks sane (though I'm no expert, at least the stage1 portion matched mine and the jmps pointed to the right addresses).
Comment 51 Andreas K. Hüttel archtester gentoo-dev 2012-02-18 19:33:24 UTC
(In reply to comment #50)
> Jason's /boot/grub is identical to mine, byte for byte.  I went through a dump
> of his MBR with a hex editor and everything looks sane (though I'm no expert,
> at least the stage1 portion matched mine and the jmps pointed to the right
> addresses).

OK... before I overwrite the bootloader on my box with grub static, what should I save for debugging and/or what else could I try to help?
Comment 52 Ryan Hill (RETIRED) gentoo-dev 2012-02-20 04:33:04 UTC
I don't really know.  I've pretty much exhausted my limited abilities. :/

Could you try the svn build of 4.6 from the toolchain overlay?  It installs into a separate SLOT so it won't overwrite your existing install.  You'll have to select it with gcc-config.

Do you dual-boot?

The only other thing I can think of is to disable grub patches one by one to see if any of them are the cause.  I don't see any reports of this in other distros.
Comment 53 Richard Yao (RETIRED) gentoo-dev 2012-02-20 22:32:00 UTC
(In reply to comment #51)
> (In reply to comment #50)
> > Jason's /boot/grub is identical to mine, byte for byte.  I went through a dump
> > of his MBR with a hex editor and everything looks sane (though I'm no expert,
> > at least the stage1 portion matched mine and the jmps pointed to the right
> > addresses).
> 
> OK... before I overwrite the bootloader on my box with grub static, what should
> I save for debugging and/or what else could I try to help?

I suggest fiddling with this in QEMU. It should save you some trouble and produce an environment that others can examine.
Comment 54 Alexey Shvetsov archtester gentoo-dev 2012-02-21 09:41:10 UTC
May be better to move to grug2? grub1 is unmaintained by grub upstream
Comment 55 Richard Yao (RETIRED) gentoo-dev 2012-02-21 10:01:51 UTC
(In reply to comment #54)
> May be better to move to grug2? grub1 is unmaintained by grub upstream

I posted another suggestion in the GCC-4.6 discussion on the gentoo-dev mailing list that I will repost here:

> I took a look at the problem cited in your bug report. I suggest
> compiling sys-boot/grub with CFLAGS="-O0 -ggdb3", attaching gdb to
> grub-install and then watching what happens in the debugger. If you
> compare runs with a GCC 4.5.3 built stage2 and a GCC 4.6.2 built
> stage2, you should be able to find the bug.

That technique enabled me to solve a similar issue that I encountered when exploring the possibility of sys-boot/grub-illumos as part of my work on ZFS. That was in ZFS specific code, so my patch for that issue likely will not help us here.

I do not have time to look into this issue right now, but I will have time before the end of March. If Ryan does not fix it by then, I will look into it myself.
Comment 56 Richard Yao (RETIRED) gentoo-dev 2012-02-21 10:25:05 UTC
Ignore my previous comment. I just read through the bug and it seems that the issue involved is not in comment #1, but instead in comment #14.

My comment about looking into this still stands. I will be patching sys-boot/grub as part of my ZFS work and it will be hard to ignore this, but I do not have time to look into it this month.
Comment 57 Richard Yao (RETIRED) gentoo-dev 2012-02-21 11:02:02 UTC
I spoke to ajmitch in #ubuntu-devel, who told me that the published build log from their build bot suggested the use of GCC 4.6:

https://launchpad.net/ubuntu/precise/+source/grub/+builds

Given that Ubuntu does not seem to be affected by this issue, I suggest someone download their binary package, extract the appropriate stage1_5/stage2 and then dd it into place on an affected system:

dd if=/path/to/stage2 of=/dev/<boot-device> bs=512 seek=1

Be certain that you use the stage1_5/stage2 that is analogous to the one you use on your system. Here is the url to where Ubuntu has their binary packages:

http://packages.ubuntu.com/precise/grub

If this works, then the next step would be to compile the Ubuntu GRUB fork with our toolchain and repeat with the created binary to determine whether or not this is a toolchain bug.
Comment 58 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-03-08 13:03:06 UTC
Just to keep track what I've did so far:
 * gcc-4.6.2[vanilla] also causes infinite loop
 * it's sufficient to replace stage2 and e2fs_stage1_5, I've installed them (compiled on another box with gcc-4.5.x) to MBR using grub compiled with gcc-4.6.2 and my laptop boots fine. I've checked that *both* files cause the described problem
Comment 59 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-03-08 14:27:57 UTC
OK, good news is that =grub-0.97-r2 works like a charm. All that's left now is to bisect the bastard. If anyone also suffering from infinite loop could confirm that ^^, we would be very happy.
You'll of course need something like:

        sed -i stage2/Makefile.am \
            -e 's:STAGE2_CFLAGS):& -fno-reorder-functions :' || die

to get rid of original gcc-4.6 problem. Go hunt it down, I won't be angry if you beat me to it ;)
Comment 60 SpanKY gentoo-dev 2012-03-08 17:09:37 UTC
if all it takes for grub-0.97 to work is adding -fno-reorder-functions, then that's perfectly acceptable i think.  the codebase is dead, so attempting to rework the optimization/linking handling is a waste of time.
Comment 61 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-03-08 17:20:16 UTC
(In reply to comment #60)
> if all it takes for grub-0.97 to work is adding -fno-reorder-functions, then
> that's perfectly acceptable i think.  the codebase is dead, so attempting to
> rework the optimization/linking handling is a waste of time.
It only fixes one of two bugs. BTW it's already InCVS.

I've nailed the second one. It's caused by 820_all_grub-0.97-cvs-sync.patch
To be more specific:
+2008-03-28  Robert Millan  <rmh@aybabtu.com>
+
+       Surpass 1 TiB disk addressing limit.  Note: there are no plans to handle
+       the 2 TiB disk limit in GRUB Legacy, since that would need considerable
+       rework.  If you have >2TiB disks, use GRUB 2 instead.
+
+       * grub/asmstub.c (biosdisk): Add unsigned qualifier to `sector'.
+       * stage2/bios.c (biosdisk): Likewise.
+       * stage2/disk_io.c (rawread, devread, rawwrite, devwrite): Likewise.
+       * stage2/shared.h (rawread, devread, rawwrite, devwrite): Likewise.
+       * lib/device.c (get_drive_geometry): Replace BLKGETSIZE with
+       BLKGETSIZE64.

Reverting that by
  sed -e "s:unsigned int sector: int sector:g" -i 820_all_grub-0.97-cvs-sync.patch
gives me bootable system. I'd very much appreciate if somebody could confirm it.

@toolchain: any hints how should I proceed from here?
Comment 62 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-03-08 22:01:24 UTC
(In reply to comment #61)
> Reverting that by
>   sed -e "s:unsigned int sector: int sector:g" -i
> 820_all_grub-0.97-cvs-sync.patch
> gives me bootable system. I'd very much appreciate if somebody could confirm
> it.
> 
> @toolchain: any hints how should I proceed from here?
<hat type="grub1-maintainer">
Ugh, that's going to be problematic I think. What if the sector used is high enough to need that bit?

Why is that change actually needed?
</hat>
Comment 63 Richard Yao (RETIRED) gentoo-dev 2012-03-08 22:39:13 UTC
This sounds like a GCC 4.6 regression. Someone should bisect GCC.
Comment 64 Jason Lynch 2012-03-09 00:22:51 UTC
(In reply to comment #61)
> Reverting that by
>   sed -e "s:unsigned int sector: int sector:g" -i
> 820_all_grub-0.97-cvs-sync.patch
> gives me bootable system. I'd very much appreciate if somebody could confirm
> it.

I can confirm that this results in a working grub installation on my previously broken machine.
Comment 65 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-03-09 00:41:30 UTC
Can one of the testers please put a /boot partition ABOVE the 1TiB point, and try that with the unsigned -> signed change?
Comment 66 Doug Goldstein (RETIRED) gentoo-dev 2012-03-09 04:26:29 UTC
A valid option could be for us to document in the Handbook if you are putting /boot above the 1TiB mark, you need to use grub2. You already have to use grub2 for anything above the 2TiB mark. We're also looking to start transitioning to grub2.
Comment 67 Ryan Hill (RETIRED) gentoo-dev 2012-03-10 01:02:53 UTC
(In reply to comment #63)
> This sounds like a GCC 4.6 regression. Someone should bisect GCC.

If that were the case then wouldn't Ubuntu be seeing this too?  

Kacper, did you happen to test all the patches in that patchset or did you stop when you found one that worked?  I wonder if something else is interacting badly.
Comment 68 Ryan Hill (RETIRED) gentoo-dev 2012-03-10 01:28:55 UTC
(In reply to comment #67)
> (In reply to comment #63)
> > This sounds like a GCC 4.6 regression. Someone should bisect GCC.
> 
> If that were the case then wouldn't Ubuntu be seeing this too?  

Nevermind.  I just looked at their "patchset" and the cvs update patch isn't included.

http://patches.ubuntu.com/g/grub/grub_0.97-29ubuntu65.patch
Comment 69 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-03-10 09:10:44 UTC
(In reply to comment #67)
> (In reply to comment #63)
> > This sounds like a GCC 4.6 regression. Someone should bisect GCC.
> 
> If that were the case then wouldn't Ubuntu be seeing this too?  
> 
> Kacper, did you happen to test all the patches in that patchset or did you
> stop when you found one that worked?  I wonder if something else is
> interacting badly.
I was moving from working revbumps of grub (-r6) till I found the one that hangs (-r8) Then I've compared difference in patchsets and removed patches one by one. After locating the culprit I've trimmed it as much as possible to find the specific changeset that was causing the infinite boot.
BTW I've just checked that fedora doesn't apply this bit[1] and still has "int sector", that's why we're the only ones seeing it ;)
I'm out of town for the weekend, so I won't be able to continue till Sunday evening. 

[1] http://pkgs.fedoraproject.org/gitweb/?p=grub.git
Comment 70 Robert Cabrera 2012-03-22 23:46:20 UTC
Any update? Progress? TIA
Comment 71 ncahill_alt 2012-04-06 19:02:18 UTC
Hi, did the "unsigned int sector" update in 820_all_grub-0.97-cvs-sync.patch ever work?  I assume it must have, and unfortunately I don't have a 64-bit machine handy to check on, but at least on 32bit the code does not make sense.

Specifically, in grub/asmstub.c, line 1066, we have:

   off_t offset = (off_t) sector * (off_t) SECTOR_SIZE;

I don't know what off_t is on 64bit, but at least on mine, it is long int, which means sector, having changed to unsigned int, can overflow.

Neil.
Comment 72 Richard Yao (RETIRED) gentoo-dev 2012-04-06 22:53:27 UTC
(In reply to comment #71)
> Hi, did the "unsigned int sector" update in 820_all_grub-0.97-cvs-sync.patch
> ever work?  I assume it must have, and unfortunately I don't have a 64-bit
> machine handy to check on, but at least on 32bit the code does not make
> sense.
> 
> Specifically, in grub/asmstub.c, line 1066, we have:
> 
>    off_t offset = (off_t) sector * (off_t) SECTOR_SIZE;
> 
> I don't know what off_t is on 64bit, but at least on mine, it is long int,
> which means sector, having changed to unsigned int, can overflow.
> 
> Neil.

Nice catch. To add to this, the size of off_t changes depending on your machine architecture. Here is a brief excerpt of the build on amd64:

x86_64-pc-linux-gnu-gcc -m32 -DHAVE_CONFIG_H -I. -I..  -DGRUB_UTIL=1 -DFSYS_EXT2FS=1 -DFSYS_FAT=1 -DFSYS_FFS=1 -DFSYS_ISO9660=1 -DFSYS_JFS=1 -DFSYS_MINIX=1 -DFSYS_REISERFS=1 -DFSYS_UFS2=1 -DFSYS_VSTAFS=1 -DFSYS_XFS=1 -DUSE_MD5_PASSWORDS=1 -DSUPPORT_HERCULES=1 -DSUPPORT_SERIAL=1  -I../stage2 -I../stage1 -I../lib -Wall -Wmissing-prototypes -Wunused -Wshadow -Wpointer-arith -falign-jumps=1 -falign-loops=1 -falign-functions=1 -Wundef -O2 -fno-strict-aliasing -g -MT asmstub.o -MD -MP -MF .deps/asmstub.Tpo -c -o asmstub.o asmstub.c

The build system is compiling asmstub.o with -m32, so this is definitely using a 32-bit size. Additionally, this is then passed to lseek, which expects a value of type loff_t, which is guarenteed to always be 64-bit, no matter what.

It looks like we should replace off_t with loff_t on line 1066. Unfortunately, I am not in a position to test this to see if this will have an impact on the issue.
Comment 73 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-04-07 08:22:13 UTC
(In reply to comment #72)
> It looks like we should replace off_t with loff_t on line 1066.
> Unfortunately, I am not in a position to test this to see if this will have
> an impact on the issue.

I did:
    sed -e "/off_t offset =/ s/off_t/loff_t/g" \
        -i grub/asmstub.c lib/device.c || die
sadly it's not enough. However, that was expected since that line comes from patch 826-* which I had previously applied.

I've also checked that grub compiled with gcc-4.6.3 and gcc-4.7.0 still exhibits infinite loop without hack from c#61
Comment 74 Ryan Hill (RETIRED) gentoo-dev 2012-04-08 02:58:36 UTC
I ran through the code a couple weeks ago and it seemed to me that there were several spots where changing sector to unsigned int could cause an overflow, and one place where sector was still being declared as signed int.  I couldn't confirm any of it however as I still can't reproduce this.  I have a system with three drives, two 1TB and one 256GB, and tried setting it up in several different combinations but had no luck.  I don't have anything larger than 1TB so I can't test as Robin asked in comment #65.

Adding -Wconversion or -Wstrict-overflow=# may highlight some places to start investigating.  I also wonder if anyone has tried -fno-strict-overflow?

Since this code is in upstream's repo, it might be worth it to open a bug in Savannah.  Being much more familiar with the code base than we are, they can probably narrow down the problem faster.
Comment 75 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2012-04-08 03:55:16 UTC
Somebody come up with a quick/easy way to run test cases, and I'll throw a 6TB volume at you for testing.
Comment 76 Ryan Hill (RETIRED) gentoo-dev 2012-05-03 02:32:18 UTC
Is anyone willing to work on this?  If not I'll add a die to the ebuild for amd64 telling the user to use grub-static until this is resolved.  We're seven months past my worst-case estimate for unmasking 4.6.
Comment 77 Tolga Dalman 2012-05-03 06:36:37 UTC
IMHO it is sensible to remove sys-boot/grub and replace it entirely by the static version.
Comment 78 SpanKY gentoo-dev 2012-05-03 15:46:36 UTC
(In reply to comment #77)

uhh, where do you think grub-static comes from ?  it isn't magically created from wishes.

considering we have an idea of where things are going wrong, it should be easy to create a patch from there ...
Comment 79 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-03 20:15:45 UTC
(In reply to comment #76)
> Is anyone willing to work on this?  If not I'll add a die to the ebuild for
> amd64 telling the user to use grub-static until this is resolved.  We're
> seven months past my worst-case estimate for unmasking 4.6.

Unfortunately, I'm able to reproduce it on my laptop, where I have only 128Gb HDD :/
Comment 80 SpanKY gentoo-dev 2012-05-04 05:26:57 UTC
(In reply to comment #71)

i've looked at the code in question.  i don't follow your logic.

at the top of grub/asmstub.c and lib/device.c we have:
#define _LARGEFILE_SOURCE   1
#define _FILE_OFFSET_BITS   64

that means the headers define off_t as off64_t which is the same as int64_t.  this can be verified by adding some simple code:
  char f[sizeof(off_t) - 8];
if building that fails, it means sizeof(off_t) is smaller than 8.  but it works fine on my system.

looking at the function, we have:
  off_t offset = (off_t) sector * (off_t) SECTOR_SIZE;
where sector is an unsigned int.  that means a 32bit unsigned value is cast up to a signed 64bit value.  this can't overflow.  SECTOR_SIZE is defined as 0x200.  that also means multiplying a signed 32bit value against 0x200 will not result in overflow.

the call after that:
  if (lseek (fd, offset, SEEK_SET) != offset)
because of the defines at the top, this is turned into lseek64().  this can be verified with `objdump -dr`:

00000d2b <biosdisk>:
...
     d95:       e8 fc ff ff ff          call   d96 <biosdisk+0x6b>
                        d96: R_386_PC32 lseek64
...

so i don't see how changing off_t to loff_t in these two files would make any difference whatsoever.
Comment 81 SpanKY gentoo-dev 2012-05-04 06:41:05 UTC
Created attachment 310751 [details]
test script

wrt testing, i don't think you need actual hardware here.  use sparse files, loop devices, device mapper, gptfdisk, and a little fudging.

check out the attached hack of a script.  note, you will need:
 - loop device and device mapper support in the kernel
 - sys-apps/gptfdisk
 - sys-fs/device-mapper
 - sys-fs/multipath-tools

then run this as root like so:
# ./test.sh auto

this will create all the images and try to run grub on it.  by default, it'll create a ~3TB disk, put a partition at the end, and try to do grub stuff on that.  it'll locate a loop device automatically, and use /dev/mapper/grubtest, so hopefully it shouldn't randomly eat your machine :P.

you can quickly switch between large and small first partitions:
# ./test.sh auto +200M
# ./test.sh auto +2T

for more info, read the source.  i'd note that while this seems to fail for me with large disks, it fails for me whether grub has been built with gcc-4.5 or gcc-4.6.  i'm inclined to think this size issue isn't really gcc related.
Comment 82 Robert Cabrera 2012-05-19 08:00:06 UTC
Any resolution to this problem yet? It's been 14 months since this bug was first reported, yet as of the last update here, it appears we are no closer to a solution.

How are the other distros that use grub-0.97 dealing with this? Or have they all made the switch to grub-2?

Is this bug the only thing keeping GCC-4.6 and or 4.7 from the tree?
Comment 83 Markus 2012-05-19 12:53:13 UTC
I upgraded to sys-devel/gcc-4.6.3, rebuild system and world. (most is stable amd64)
So I rebuild sys-boot/grub-0.97-r10 as well and installed it into the MBR without errors. System booted fine.
Comment 84 Ryan Hill (RETIRED) gentoo-dev 2012-05-21 06:03:54 UTC
The other distros don't apply the cvs update we do that causes the problem.  Yes this is the only thing keeping 4.6 unkeyworded.  I plan to go ahead with what I said in comment #76 tomorrow.
Comment 85 Pacho Ramos gentoo-dev 2012-05-21 07:49:07 UTC
(In reply to comment #84)
> The other distros don't apply the cvs update we do that causes the problem. 
> Yes this is the only thing keeping 4.6 unkeyworded.  I plan to go ahead with
> what I said in comment #76 tomorrow.

But, why we don't drop that cvs update causing the problem? If I remember correctly, people affected by that could go to grub2 instead...
Comment 86 SpanKY gentoo-dev 2012-05-21 15:01:17 UTC
(In reply to comment #85)

the cvs update brings in functionality too, not just bugs.  so we're dropping support for one thing to make something else work.  we could just as easily say "use grub2 if you want to use 1TB devices".
Comment 87 Andreas K. Hüttel archtester gentoo-dev 2012-05-21 15:42:35 UTC
(In reply to comment #86)
> (In reply to comment #85)
> 
> the cvs update brings in functionality too, not just bugs.  so we're
> dropping support for one thing to make something else work.  we could just
> as easily say "use grub2 if you want to use 1TB devices".

The question is, which condition is defined more easily (i.e. can be presented easier to the user as a decision guide). So, to the experts:

Without the CVS commit:
"Use grub2 if ..."

With the CVS commit:
"Use grub2 if ..."

???
Comment 88 Richard Yao (RETIRED) gentoo-dev 2012-05-21 17:34:48 UTC
(In reply to comment #86)
> (In reply to comment #85)
> 
> the cvs update brings in functionality too, not just bugs.  so we're
> dropping support for one thing to make something else work.  we could just
> as easily say "use grub2 if you want to use 1TB devices".

Does that issue not only manifest if the user wants to place /boot past a 1TB offset or forgo a /boot directory entirely on a >1TB drive? Do any users actually do that?
Comment 89 Ryan Hill (RETIRED) gentoo-dev 2012-05-21 19:23:23 UTC
Created attachment 312587 [details, diff]
905_all_grub-0.97-revert_1tb_limit_gcc46.patch

Unless I'm mistaken it affects anyone installing grub on any drive larger than 1TB.  We're talking total # of sectors here.

This reverts the offending patch.
Comment 90 Robert Cabrera 2012-05-24 09:21:14 UTC
Now it's been quite a long time since I tried recompiling and installing grub-0.97, however it affected me on my ~amd64 laptop and my drive is under 500GB. 

So I believe the issue goes beyond just having a drive over 1TB.

As mentioned previously, I never had an issue when I was using GCC < 4.6.x
Comment 91 Samuli Suominen (RETIRED) gentoo-dev 2012-05-24 09:46:03 UTC
(In reply to comment #90)
> Now it's been quite a long time since I tried recompiling and installing
> grub-0.97, however it affected me on my ~amd64 laptop and my drive is under
> 500GB. 
> 
> So I believe the issue goes beyond just having a drive over 1TB.
> 
> As mentioned previously, I never had an issue when I was using GCC < 4.6.x

I believe you are confused. The current ebuild in Portage is broken for everyone on ~amd64 and GCC 4.6.x. 
But if we revert this one patch, which hasn't been reverted yet, only users of >1T are broken.
Please read the bug again entirely before commenting.
Comment 92 Pacho Ramos gentoo-dev 2012-05-24 10:04:08 UTC
Looking to debian patchset 66 seems that there were including that 1TB support patch:
http://patch-tracker.debian.org/patch/misc/view/grub/0.97-66/ChangeLog
http://packages.debian.org/changelogs/pool/main/g/grub/grub_0.97-66/changelog

But looks like they don't have this gcc-4.6 issue, the problem is that I haven't found why :(
Comment 93 Paolo Pedroni 2012-05-24 14:08:03 UTC
(In reply to comment #91)
> I believe you are confused. The current ebuild in Portage is broken for
> everyone on ~amd64 and GCC 4.6.x. 

Ehm, no. I have three amd64 machines with grub-0.97-r10 and gcc 4.6.2 or 4.6.3 and they all work correctly.
Comment 94 Samuli Suominen (RETIRED) gentoo-dev 2012-05-24 14:30:47 UTC
(In reply to comment #93)
> (In reply to comment #91)
> > I believe you are confused. The current ebuild in Portage is broken for
> > everyone on ~amd64 and GCC 4.6.x. 
> 
> Ehm, no. I have three amd64 machines with grub-0.97-r10 and gcc 4.6.2 or
> 4.6.3 and they all work correctly.

Just having it emerged with GCC 4.6.x itself doesn't alone do anything, but after you do `grub-install` to Master Boot Record is where the fun begins.

Can we please not spam this bug :-/
Comment 95 Paolo Pedroni 2012-05-24 14:46:23 UTC
(In reply to comment #94)
> (In reply to comment #93)
> > (In reply to comment #91)
> > > I believe you are confused. The current ebuild in Portage is broken for
> > > everyone on ~amd64 and GCC 4.6.x. 
> > 
> > Ehm, no. I have three amd64 machines with grub-0.97-r10 and gcc 4.6.2 or
> > 4.6.3 and they all work correctly.
> 
> Just having it emerged with GCC 4.6.x itself doesn't alone do anything, but
> after you do `grub-install` to Master Boot Record is where the fun begins.

While I'm not as smart as a developer, I still know what I'm doing with my systems, and if I tell you that I have three working machines it means that I've installed grub in the MBR of the three machines, and they still work. I'm writing this with one of those. Please don't dismiss power users like this.

> 
> Can we please not spam this bug :-/

If presenting further data points is spamming, I'll surely refrain from doing so in the future. Your (and Gentoo's) loss.
Comment 96 Richard Yao (RETIRED) gentoo-dev 2012-05-24 20:23:58 UTC
(In reply to comment #95)
> (In reply to comment #94)
> > (In reply to comment #93)
> > > (In reply to comment #91)
> > > > I believe you are confused. The current ebuild in Portage is broken for
> > > > everyone on ~amd64 and GCC 4.6.x. 
> > > 
> > > Ehm, no. I have three amd64 machines with grub-0.97-r10 and gcc 4.6.2 or
> > > 4.6.3 and they all work correctly.
> > 
> > Just having it emerged with GCC 4.6.x itself doesn't alone do anything, but
> > after you do `grub-install` to Master Boot Record is where the fun begins.
> 
> While I'm not as smart as a developer, I still know what I'm doing with my
> systems, and if I tell you that I have three working machines it means that
> I've installed grub in the MBR of the three machines, and they still work.
> I'm writing this with one of those. Please don't dismiss power users like
> this.

What is the output of `uname -m` on your system?
Comment 97 Ryan Hill (RETIRED) gentoo-dev 2012-05-25 00:53:09 UTC
I know this is a very long and convoluted bug report so lets sum it up:

- the bug manifests as a continuous reboot cycle
- this bug affects /very/ few people
- we haven't determined what they have in common, other than they're all on amd64
- i have a pretty much identical system as one reporter yet i can't reproduce this on multiple drives and configurations
- Kacper found that running sed -e "s:unsigned int sector: int sector:g" -i 820_all_grub-0.97-cvs-sync.patch fixed the bug, which narrowed down the commit that triggered it to http://bzr.savannah.gnu.org/lh/grub/trunk/grub-legacy/revision/791
- reverting that commit might break things for people with >1TiB drives.  No one has actually tested this, but Mike wrote a script to do so.
- i posted the patch to revert the commit so people could do the testing, not because i want it added.  i forgot to mention that :P
- nobody has actually investigated why making sector unsigned causes things to go wonky.

People saying "but it works for me!" aren't helping.  It works for me too.  Maybe we could get together and start a club and wear funny hats or something.  Contact me -> _OFF LIST_ <-.  If you're not hitting this bug, don't comment.  If you are, please let us know so we can get some real testing done.
Comment 98 Ryan Hill (RETIRED) gentoo-dev 2012-05-25 01:31:41 UTC
Okay Staples finally got something bigger than 1TiB in stock.  Gimme a couple days to set this thing up.
Comment 99 Paolo Pedroni 2012-05-25 06:51:20 UTC
(In reply to comment #96)
> What is the output of `uname -m` on your system?

You think I don't know if I run amd64 or not?

# uname -m
x86_64
Comment 100 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-25 08:32:38 UTC
(In reply to comment #97)

> - Kacper found that running sed -e "s:unsigned int sector: int sector:g" -i
> 820_all_grub-0.97-cvs-sync.patch fixed the bug, which narrowed down the
> commit that triggered it to
> http://bzr.savannah.gnu.org/lh/grub/trunk/grub-legacy/revision/791

It gets better, using your patch I've found out that only that:

--- stage2/bios.c
+++ stage2/bios.c
@@ -47,9 +47,10 @@
    return the error number. Otherwise, return 0.  */
 int
 biosdisk (int read, int drive, struct geometry *geometry,
-    unsigned int sector, int nsec, int segment)
+    unsigned int sector_, int nsec, int segment)
 {
   int err;
+  int sector = sector_;

   if (geometry->flags & BIOSDISK_FLAG_LBA_EXTENSION)
     {

is required to make grub work. That narrows it down to one function that doesn't like sector as unsigned int.
Comment 101 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-25 08:57:34 UTC
Nailed it down to one line:

--- stage2/bios.c
+++ stage2/bios.c
@@ -73,7 +73,7 @@
       /* FIXME: sizeof (DAP) must be 0x10. Should assert that the compiler
    can't add any padding.  */
       dap.length = sizeof (dap);
-      dap.block = sector;
+      dap.block = (int)sector;
       dap.blocks = nsec;
       dap.reserved = 0;
       /* This is undocumented part. The address is formated in

Reading "FIXME" I take elaborate guess that starting from gcc-4.6 assignment
(unsigned long long) = (unsigned int) adds some sort of padding :D

Now, somebody please fix it :)
Comment 102 Rafał Mużyło 2012-05-25 18:09:11 UTC
Actually, could this be a case of bad code ?
      struct disk_address_packet
      {
	unsigned char length;
	unsigned char reserved;
	unsigned short blocks;
	unsigned long buffer;
	unsigned long long block;
      } __attribute__ ((packed)) dap;

This is indeed gives sizeof(dap)=0x10 on x86, but (mind I don't have a working amd64 machine yet, so can't check myself) doesn't that give sizeof(dap)=0x14 on amd64 ?
Comment 103 Duncan Exon Smith 2012-05-25 18:15:49 UTC
(In reply to comment #102)
> Actually, could this be a case of bad code ?
>       struct disk_address_packet
>       {
> 	unsigned char length;
> 	unsigned char reserved;
> 	unsigned short blocks;
> 	unsigned long buffer;
> 	unsigned long long block;
>       } __attribute__ ((packed)) dap;
> 
> This is indeed gives sizeof(dap)=0x10 on x86, but (mind I don't have a
> working amd64 machine yet, so can't check myself) doesn't that give
> sizeof(dap)=0x14 on amd64 ?

Wrote a test program to confirm... yes, it's 0x14.

I.e., this prints out "20":

int main(int argc, char *argv[])
{ printf("size = %lu\n", sizeof(dap)); return 0; }
Comment 104 Richard Yao (RETIRED) gentoo-dev 2012-05-25 21:06:21 UTC
(In reply to comment #103)
> Wrote a test program to confirm... yes, it's 0x14.
> 
> I.e., this prints out "20":
> 
> int main(int argc, char *argv[])
> { printf("size = %lu\n", sizeof(dap)); return 0; }

The build system passes -m32, which changes it to 0x10 on amd64.

With that said, can anyone reproduce this in QEMU? If we can reproduce this in QEMU, it should be possible to attach gdb, load debugging symbols and get a backtrace.

Those that want to take the initiative to do this on their own can look at the following for information on how this is done:

http://www.cs.stonybrook.edu/~porter/courses/cse506/f11/lab1.html

Note that you will need to get a checkout of the JOS source code and read the build system. Also, DO NOT email Donald Porter with questions on how to do that. If you must email someone with questions about how debugging JOS' bootloader works, email me.
Comment 105 Ryan Hill (RETIRED) gentoo-dev 2012-05-26 02:15:27 UTC
Created attachment 313069 [details, diff]
906_all_grub-0.97-gcc46.patch

Give this a spin (apply the 905 patch first).
Comment 106 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-26 07:51:37 UTC
(In reply to comment #105)
> Created attachment 313069 [details, diff] [details, diff]
> 906_all_grub-0.97-gcc46.patch
> 
> Give this a spin (apply the 905 patch first).
Still exhibits infinite loop with it :/
Comment 107 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-26 09:09:09 UTC
Created attachment 313099 [details, diff]
fix inifinite boot with >=gcc-4.6

I've google around a bit and found grub4dos project[1]. They seem to do some magic in the relevant part of the code. Snatching their changes fixed the issue for me.

Changes look non-invasive enough, so hopefully we're close to fixing this...

[1] http://code.google.com/p/grub4dos-chenall/
Comment 108 Ryan Hill (RETIRED) gentoo-dev 2012-05-29 02:43:27 UTC
Nice work!
Comment 109 Rafał Mużyło 2012-05-29 07:45:11 UTC
@comment 107: does that patch mean the problem lied in grub code or is it still a gcc bug ?
Comment 110 Richard Yao (RETIRED) gentoo-dev 2012-05-29 07:52:39 UTC
(In reply to comment #109)
> @comment 107: does that patch mean the problem lied in grub code or is it
> still a gcc bug ?

That patch suggests that the issue involves buggy BIOS implementations. If that is the case, GCC should play no role in whether or not GRUB works, but for some reason, it does. That suggests to me that GCC is doing something wrong.

I am not sure if anyone wants to debug this further, but as food for thought, I think that `dap = (struct disk_address_packet *)0x580;` suggests to me that the value of the dap option is changing. That might provide a clue as to what GCC could is differently internally, but again, Iam not sure if anyone wants to debug this further.
Comment 111 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-29 09:06:00 UTC
(In reply to comment #109)
> @comment 107: does that patch mean the problem lied in grub code or is it
> still a gcc bug ?

Honestly I have no idea. But it looks like the assembler function that feeds on dap strongly relies upon proper alignment of struct members in the memory. I haven't investigated how they changed and whether it's a "feature" or a bug in gcc. It'd be super cool if we could reduce the code outside of grub. Then a simple bisect would answer your question. But since the asm is a black magic for me and I cannot evaluate if grub code is right or wrong :/ 
I won't be pursuing it any further... Help yourself though ;)
Comment 112 Richard Yao (RETIRED) gentoo-dev 2012-05-29 09:53:44 UTC
I have committed Xarthisius' patch in -r11. It was reviewed by myself and jdhore and it was approved by Chainsaw in IRC.

I am marking this as IN_PROGRESS. I will let the toolchain team mark it as FIXED.
Comment 113 Pacho Ramos gentoo-dev 2012-05-29 10:10:37 UTC
Nice, thanks a lot :D, I guess gcc-4.6 stabilization is more near now, no? ;)
Comment 114 Richard Yao (RETIRED) gentoo-dev 2012-05-29 10:15:53 UTC
(In reply to comment #113)
> Nice, thanks a lot :D, I guess gcc-4.6 stabilization is more near now, no? ;)

Someone in the toolchain team needs to keyword GCC 4.6 on amd64 first.
Comment 115 Bartosz Brachaczek 2012-05-29 16:25:34 UTC
I  compared asm output from GCC 4.5.3 and 4.6.3 and I can tell you that nothing changed in alignment in the generated code. Actually the only difference in the relevant code is that GCC 4.6.3 output is slightly reordered in the assignment code (though still perfectly valid) and GCC 4.6.3 uses an SSE register to do the `dap.block = sector' assignment (also perfectly valid, at least when SSE is enabled). Casting `sector' to int, as well as using that patch from grub4dos, causes GCC 4.6.3 to emit asm code more similar to that from GCC 4.5.3, without SSE registers usage.

And I think GRUB cannot use any SSE code as it neither enters long mode on x86_64 capable hardware (AFAIK GRUB2 does that) nor enables SSE in protected mode (which needs to be done explicitly). So, if I'm right (and I'm not 100% sure about that SSE enablement), the proper solution would be to unconditionally append something like "-march=i686" to CFLAGS.

And if anyone wonders why GCC emitted SSE code at all, please remember that it runs on x86_64, where it knows for sure that SSE and SSE2 are available, without the need for any -march= or -msse flags. BTW, I saw also one SSE usage in GRUB XFS code generated by GCC 4.5.3.
Comment 116 Richard Yao (RETIRED) gentoo-dev 2012-05-29 20:30:13 UTC
(In reply to comment #115)
> I  compared asm output from GCC 4.5.3 and 4.6.3 and I can tell you that
> nothing changed in alignment in the generated code. Actually the only
> difference in the relevant code is that GCC 4.6.3 output is slightly
> reordered in the assignment code (though still perfectly valid) and GCC
> 4.6.3 uses an SSE register to do the `dap.block = sector' assignment (also
> perfectly valid, at least when SSE is enabled). Casting `sector' to int, as
> well as using that patch from grub4dos, causes GCC 4.6.3 to emit asm code
> more similar to that from GCC 4.5.3, without SSE registers usage.
> 
> And I think GRUB cannot use any SSE code as it neither enters long mode on
> x86_64 capable hardware (AFAIK GRUB2 does that) nor enables SSE in protected
> mode (which needs to be done explicitly). So, if I'm right (and I'm not 100%
> sure about that SSE enablement), the proper solution would be to
> unconditionally append something like "-march=i686" to CFLAGS.

Thankyou for your analysis. You are right in blaming the SSE instructions, which likely are not legal in real mode. Since -m32 is passed, GCC should restrict itself to i386 code unless explicitly told that instruction set extensions are available, which makes this is a regression in GCC.

> And if anyone wonders why GCC emitted SSE code at all, please remember that
> it runs on x86_64, where it knows for sure that SSE and SSE2 are available,
> without the need for any -march= or -msse flags. BTW, I saw also one SSE
> usage in GRUB XFS code generated by GCC 4.5.3.

Has anyone tried booting with XFS in a way that executes that code? Unless it is executed, people would not see it in practice.
Comment 117 Bartosz Brachaczek 2012-05-29 21:45:55 UTC
(In reply to comment #116)
> Thankyou for your analysis. You are right in blaming the SSE instructions,
> which likely are not legal in real mode. Since -m32 is passed, GCC should
> restrict itself to i386 code unless explicitly told that instruction set
> extensions are available, which makes this is a regression in GCC.

Yes, you're right, it's a GCC bug. Earlier I didn't check what man page has to say about -m32:
> Generate code for a 32-bit (...) environment.  The 32-bit
> environment sets int, long and pointer to 32 bits and _generates
> code that runs on any i386 system_.

So the patch in -r11 is irrelevant to this issue, it only hides it. To be absolutetly sure that SSE is to be blamed here, I tried to run grub with an SSE instruction put by in inline asm on qemu - and it failed. Then I put SSE enablement code[1] just before that instruction and it started working.

[1] http://wiki.osdev.org/SSE#Adding_support
Comment 118 Richard Yao (RETIRED) gentoo-dev 2012-05-29 22:34:14 UTC
(In reply to comment #117)
> (In reply to comment #116)
> > Thankyou for your analysis. You are right in blaming the SSE instructions,
> > which likely are not legal in real mode. Since -m32 is passed, GCC should
> > restrict itself to i386 code unless explicitly told that instruction set
> > extensions are available, which makes this is a regression in GCC.
> 
> Yes, you're right, it's a GCC bug. Earlier I didn't check what man page has
> to say about -m32:
> > Generate code for a 32-bit (...) environment.  The 32-bit
> > environment sets int, long and pointer to 32 bits and _generates
> > code that runs on any i386 system_.
> 
> So the patch in -r11 is irrelevant to this issue, it only hides it. To be
> absolutetly sure that SSE is to be blamed here, I tried to run grub with an
> SSE instruction put by in inline asm on qemu - and it failed. Then I put SSE
> enablement code[1] just before that instruction and it started working.
> 
> [1] http://wiki.osdev.org/SSE#Adding_support

Nice find. I suspect that this might also be what is happening in bug #408019.
Comment 119 Fred Krogh 2012-05-29 22:49:06 UTC
I am running an amd64~ system with gcc-4.6.3 and just installed grub-0.97-r11.  The install instructions seem to indicate that bad things may happen if I don't run grub-install.  Comments here don't seem to make clear that this is safe.  Could someone please clarify the situation.  Thanks.
Comment 120 Richard Yao (RETIRED) gentoo-dev 2012-05-29 23:15:56 UTC
(In reply to comment #119)
> I am running an amd64~ system with gcc-4.6.3 and just installed
> grub-0.97-r11.  The install instructions seem to indicate that bad things
> may happen if I don't run grub-install.  Comments here don't seem to make
> clear that this is safe.  Could someone please clarify the situation. 
> Thanks.

grub-install is needed to install the bootloader to your disk. In particular, it places stage1 in the MBR and stage1_5 right after the MBR. It is needed for the boot process to work.
Comment 121 Bartosz Brachaczek 2012-05-29 23:23:57 UTC
(In reply to comment #119)
> I am running an amd64~ system with gcc-4.6.3 and just installed
> grub-0.97-r11.  The install instructions seem to indicate that bad things
> may happen if I don't run grub-install.  Comments here don't seem to make
> clear that this is safe.  Could someone please clarify the situation. 
> Thanks.

Please make sure you emerged the fixed (with PATCHVER="1.13") -r11 ebuild, as the original -r11 ebuild lacked the patch but it has been fixed after 15 minutes. Then you'll be safe.
Comment 122 Kacper Kowalik (Xarthisius) (RETIRED) gentoo-dev 2012-05-30 07:19:59 UTC
(In reply to comment #115)
> I  compared asm output from GCC 4.5.3 and 4.6.3 and I can tell you that
> nothing changed in alignment in the generated code. Actually the only
> difference in the relevant code is that GCC 4.6.3 output is slightly
> reordered in the assignment code (though still perfectly valid) and GCC
> 4.6.3 uses an SSE register to do the `dap.block = sector' assignment (also
> perfectly valid, at least when SSE is enabled). Casting `sector' to int, as
> well as using that patch from grub4dos, causes GCC 4.6.3 to emit asm code
> more similar to that from GCC 4.5.3, without SSE registers usage.
> 
> And I think GRUB cannot use any SSE code as it neither enters long mode on
> x86_64 capable hardware (AFAIK GRUB2 does that) nor enables SSE in protected
> mode (which needs to be done explicitly). So, if I'm right (and I'm not 100%
> sure about that SSE enablement), the proper solution would be to
> unconditionally append something like "-march=i686" to CFLAGS.

What about -mno-sse?

Could you also reduce that to a simple testcase outside of grub?
Comment 123 Bartosz Brachaczek 2012-05-30 10:10:39 UTC
(In reply to comment #122)
> What about -mno-sse?

It works. But I think telling the compiler "do not generate anything that i686 supports" with -march=i686 is more natural than telling it to "not generate SSE code" with -mno-sse. Who knows if GCC won't emit MMX code?

> 
> Could you also reduce that to a simple testcase outside of grub?

It seems that this particular C code results in SSE usage on -O1 and -Os optimization levels. I reduced it to -ftree-ter. Here is the testcase:

$ echo 'void f(unsigned int u) { unsigned long long llu; llu = u; }' | \
  x86_64-pc-linux-gnu-gcc-4.6.3 -m32 -ftree-ter -x c -S -o - - | grep xmm

If `grep xmm' outputs anything, it means that SSE registers were used. For me it's the output:
        movd    8(%ebp), %xmm0
        movq    %xmm0, -8(%ebp)
Comment 124 SpanKY gentoo-dev 2012-05-30 21:37:21 UTC
we probably should add -march=i686 to the STAGE1 and STAGE2 CFLAGS variables
Comment 125 Richard Yao (RETIRED) gentoo-dev 2012-05-30 22:40:14 UTC
(In reply to comment #124)
> we probably should add -march=i686 to the STAGE1 and STAGE2 CFLAGS variables

It should be -march=i386. Otherwise, we could break backwards compatibility.
Comment 126 SpanKY gentoo-dev 2012-05-30 23:04:03 UTC
this makes it build with -march according to your CHOST:
http://sources.gentoo.org/gentoo/src/patchsets/grub/0.97/908_all_grub-0.97-no-sse.patch?rev=1.1

(In reply to comment #125)

i don't know understand what you mean.  how exactly stage1 and stage2 and the kernel and the host grub utils are compiled are not interdependent.  there are specific hand off rules that they each follow, and the insns executed to implement that are irrelevant.

the only thing i can think you're referring to is that building with -march=i686 might generate code that doesn't work on i386/i486/i586 systems.  but i handled that case already.
Comment 127 Richard Yao (RETIRED) gentoo-dev 2012-05-30 23:42:31 UTC
(In reply to comment #126)
> this makes it build with -march according to your CHOST:
> http://sources.gentoo.org/gentoo/src/patchsets/grub/0.97/908_all_grub-0.97-
> no-sse.patch?rev=1.1
> 
> (In reply to comment #125)
> 
> i don't know understand what you mean.  how exactly stage1 and stage2 and
> the kernel and the host grub utils are compiled are not interdependent. 
> there are specific hand off rules that they each follow, and the insns
> executed to implement that are irrelevant.
> 
> the only thing i can think you're referring to is that building with
> -march=i686 might generate code that doesn't work on i386/i486/i586 systems.
> but i handled that case already.

It might be a problem when cross compiling code for i[345]86 on a x86_64 system.

With that said, we really should patch GCC to assume -march=i386 when -m32 is passed unless -march= is explicitly passed to override it. Otherwise, we will encounter this problem in other packages. In specific, I suspect that bug #408019 is another instance of this issue.

If my suspicion is correct, this issue started as early as GCC 4.4, but GCC's optimization passes did not improve enough to trigger it in GRUB until GCC 4.6. I still need to do disassembly to verify my suspicion, but the symptoms are there.
Comment 128 SpanKY gentoo-dev 2012-05-31 00:01:08 UTC
(In reply to comment #127)

if you're cross-compiling code, then you build a cross-compiler.  if you want to target an i386, then you don't use a target tuple like i686-pc-linux-gnu, you use i386-pc-linux-gnu.

patching gcc to assume -march=i386 when -m32 is used is wrong imo.  we specifically fixed gcc for ARCH=x86 to default to the -march of the tuple it's using.  since you're using an x86_64-pc-linux-gnu tuple, -march=i686 is the acceptable default.

if you want to utilize your existing x86_64-pc-linux-gnu to build with -m32 but target an i386, then you have to explicitly specify -march=i386.  these are the defaults we want.

i've pushed out grub-0.97-r12 with the -march patch.  closing out because this bug has already done way more than it should -- it's covered 3 independent issues at this point.
Comment 129 Richard Yao (RETIRED) gentoo-dev 2012-05-31 00:12:43 UTC
(In reply to comment #128)
> patching gcc to assume -march=i386 when -m32 is used is wrong imo.  we
> specifically fixed gcc for ARCH=x86 to default to the -march of the tuple
> it's using.  since you're using an x86_64-pc-linux-gnu tuple, -march=i686 is
> the acceptable default.

If our patch to use the host tuple affects -m32, then it needs to be changed. -m32 on amd64 is supposed to be -march=i386 according to the man page. See comment #117:

(In reply to comment #117)
> Yes, you're right, it's a GCC bug. Earlier I didn't check what man page has
> to say about -m32:
> > Generate code for a 32-bit (...) environment.  The 32-bit
> > environment sets int, long and pointer to 32 bits and _generates
> > code that runs on any i386 system_.
Comment 130 SpanKY gentoo-dev 2012-05-31 00:37:44 UTC
(In reply to comment #129)

we aren't patching anything.  the -m32 behavior you describe is from upstream gcc and has always been that way.  i don't think it's a bug in gcc, just poor wording in the manual.  feel free to file a bug report in upstream gcc bugzilla asking for it to be clarified.

for ARCH=x86, we pass --with-arch=${CTARGET%%-*} when configuring.
Comment 131 Robert Cabrera 2012-05-31 13:38:02 UTC
Congratulations!  It worked! I'm stoked! For the first time since I made the jump to GCC-4.6.x I'm up and running with a self-compiled version of grub installed and working as expected.

Great job! Thanks

Robert
Comment 132 Nikolaj Šujskij 2012-06-01 06:44:56 UTC
I can confirm that ~amd64 grub:0 works all right compiled with GCC 4.6. Yay!
Comment 133 nihil39 2012-10-11 11:49:18 UTC
So, that said, what is keeping gcc 4.6.x from being marked as stable?
Comment 134 Bartosz Brachaczek 2012-10-11 11:57:08 UTC
(In reply to comment #133)
> So, that said, what is keeping gcc 4.6.x from being marked as stable?

Certainly not this issue. Look at bug 418383. It can be accessed with 2 mouse clicks from this bugzilla page... This bug blocks "gcc-4.6", which blocks "gcc-4.6-stable".