505026 – net-libs/libnids CFLAGS="-O2 -fstrict-aliasing" on amd64 miscompiles libnids

Bug 505026 - net-libs/libnids CFLAGS="-O2 -fstrict-aliasing" on amd64 miscompiles libnids

Summary: net-libs/libnids CFLAGS="-O2 -fstrict-aliasing" on amd64 miscompiles libnids

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Library (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Netmon project

URL:
Whiteboard:
Keywords:	UPSTREAM

Depends on:
Blocks:	915000 517420
	Show dependency tree

Reported:	2014-03-18 22:50 UTC by David
Modified:	2023-10-01 06:04 UTC (History)
CC List:	2 users (show)

See Also:	https://launchpad.net/bugs/1072650 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60581
Package list:
Runtime testing required:	---

Attachments
libnids-ebuild-amd64-optimisation.patch (libnids-ebuild-amd64-optimisation.patch,834 bytes, patch) 2014-03-18 22:51 UTC, David	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David 2014-03-18 22:50:00 UTC

I am the maintainer of the Perl module Net::LibNIDS.  I have two mostly identical Gentoo virtual machines, one is x86 and one is amd64.  My simple tests fail on the amd64 machine.

I have tracked this down to compiling libnids with -O2.  Another post on Stack Overflow appears to concur with my results: http://stackoverflow.com/questions/13431038/libnids-64-bits-systems

If I use ebuild to unpack, compile with -O1 then install and qmerge everything is fine and my client tests pass.  Using flag-o-matic to strip -O* flags and add -O1 in the ebuild is also successful (patch attached).

Conclusion: on amd64 the net-libs/libnids package should only be compiled with -O1 at maximum.

Reproducible: Always

Steps to Reproduce:
1. Compile libnids-1.24-r3
2. Use any code that registers a callback, loads from a pcap and expects data to be passed
3. No callback occurs

Comment 1 David 2014-03-18 22:51:42 UTC

Created attachment 372978 [details, diff]
libnids-ebuild-amd64-optimisation.patch

Comment 2 David 2014-03-18 22:59:28 UTC

All the gory details available in the Ubuntu bugtracker - this was reported by the author and maintainer of the libnids upstream package:

https://bugs.launchpad.net/ubuntu/+source/gcc-defaults/+bug/1072650

Comment 3 Jeroen Roovers (RETIRED) gentoo-dev

2014-03-18 23:33:31 UTC

Removing -O2 is a rather sweeping change: it bundles many optimisation flags, one of which at a time is likely cause for miscompilation. Which is it? And is it really a compiler issue or is libnids doing things wrong? In the former case, this should be a bug against sys-devel/gcc, not net-libs/libnids. In the latter case, which specific bit of libnids code is miscompiled?

Comment 4 Rafał Mużyło 2014-03-19 04:04:01 UTC

First, on a funny note: it seems that while on amd64 the test program from the launchpad bug produces random results on each run, on x86 is produces same result on each run, but still a wrong one.

Anyway, this seems to come from "Revenge of the Compiler" series.

It's most likely a bug in both gcc *and* libnids.

The 'bug' in gcc is ... not printing "dereferencing type-punned pointer will break strict-aliasing rules" warning.

Yes, '-fno-strict-aliasing' seems to make the test program print '0xffffc200' on x86 and amd64. On that note: I think the program should use 'unsigned short', not 'short', cause the result is a bit confusing.

The bug in libnids is obviously breaking the strict aliasing.
Simply wrapping the struct in in the test program in an union with 'short[6]' seems to fix the result.

Comment 5 Rafał Mużyło 2014-03-19 05:31:10 UTC

I'd say that for Gentoo, the gcc part of this problem might be more important, even it it seems to be 'only' a failure to print a warning - for quite obvious reasons.

Comment 6 Rafał Mużyło 2014-03-19 06:03:56 UTC

On a (perhaps) important note, gcc 4.8.2 doesn't print that warning either.

Comment 7 SpanKY gentoo-dev

2014-03-19 06:48:29 UTC

(In reply to Rafał Mużyło from comment #5)

i'd report the bug to upstream gcc and be done with it.

if libnids is violating aliasing rules, then obviously it's broken and whether the compiler complains about it is immaterial.  you can disable just aliasing optimizations by using the -fno-strict-aliasing.  it's a much smaller hammer than trying to force -O1 or -O0 everywhere.

Comment 8 Rafał Mużyło 2014-03-19 07:11:40 UTC

As a confirmation:
in that test program, following loop:

for (i = 0; i < sizeof(hdr); i++)
  sum += *((unsigned char *)(&hdr) + i) << ((i%2)*8);

on a little-endian machine prints '0x1c200', which is the expected result.

Comment 9 Rafał Mużyło 2014-03-19 07:14:45 UTC

My point about the warning and Gentoo is obviously that if there is no warning *here*, there might not be in other packages either. So, it's not quite 'be done with it' case.

Comment 10 Rafał Mużyło 2014-03-19 09:09:10 UTC

...but let's see what gcc upstream will say about this.

Comment 11 Joshua Kinard gentoo-dev

2014-04-07 19:08:56 UTC

This is actually pretty interesting.  Though I spot that libnids is incorrectly using a 'char' for the IP protocol number.  It should be using an unsigned char (or a uint8_t), since the IP protocol number is 0-255.  Separate issue, though...

That said, patching out the use of -O2 is a band-aid issue.  This is either a problem w/ libnids' code or a quirk in gcc.  Looking more like a gcc quirk that needs to be resolved upstream.

Snort uses similar constructs in its code, and it compiles/runs fine w/ -O2.  Anyone looked to see if there's something different they do that seems to dodge around this?

Comment 12 David 2014-04-07 19:15:41 UTC

According to the libnids author there is no release planned at this time and no patch for this issue, which is most likely due to libnids violating the strict aliasing rules.  

Lack of gcc warning aside, the ebuild could perhaps be updated to include -fno-strict-aliasing (per Comment 7) which fixes the issue and will ensure that libnids is built correctly.

Comment 13 Joshua Kinard gentoo-dev

2014-04-07 20:14:39 UTC

(In reply to David from comment #12)
> According to the libnids author there is no release planned at this time and
> no patch for this issue, which is most likely due to libnids violating the
> strict aliasing rules.  
> 
> Lack of gcc warning aside, the ebuild could perhaps be updated to include
> -fno-strict-aliasing (per Comment 7) which fixes the issue and will ensure
> that libnids is built correctly.

If we go that route, we'll want to remember to go back and remove that flag filter once libnids or gcc is fixed.  So this bug will need to remain open in this case.

I also tested the testcase on my SGI O2 (MIPS big-endian, Linux) under the o32 and n32 ABI's, as well as on a Gentoo/FreeBSD 64-bit VM.  Only under my Linux/amd64 system, compiled w/ -O2, does re-running the program multiple times print a random result.  Under the other two systems, -O2 produces a result that doesn't change, but the size sum does appear to be wrong.  Changing to -O1 or using -fno-strict-aliasing changes the printed-out value to what appears to be correct for that arch.

Ditto on a Gentoo/FreeBSD 64-bit system, too, with gcc.  clang-3.4 on that system produces the correct output regardless of -O1 or -O2.

It definitely suggests a bug in gcc somewhere.  But the libnids code definitely needs a second look as well, if they're using those data types that way.

Comment 14 Joshua Kinard gentoo-dev

2014-04-07 20:16:39 UTC

toolchain: Re-adding back as I think this is suggesting a bug in gcc somewhere, since clang on a G/FBSD system produces correct output regardless of -O1 or -O2.  Obviously, the libnids code needs to fixed up, too, but that is a separate issue.

Comment 15 Rafał Mużyło 2014-04-07 20:28:29 UTC

...well, while even Linus did complain about what gcc is doing w.r. to strict aliasing, technically they're right - I've already said, the bug is not printing the warning, not what they do with the code (if Linus didn't convinced them, it's unlikely any of us can, anyway).

clang simply doesn't takes strict aliasing optimizations that far.

Comment 16 SpanKY gentoo-dev

2014-04-07 22:17:02 UTC

(In reply to Joshua Kinard from comment #14)

if you're violating strict aliasing, then gcc is free to generate whatever random code it likes really.  the fact that clang works isn't really relevant.

a gcc bug has been opened on the topic (awesome) which means there isn't anything left for the toolchain to cover.

Comment 17 Jeroen Roovers (RETIRED) gentoo-dev

2014-04-08 14:02:43 UTC

I have set -fno-strict-aliasing for:
=net-libs/libnids-1.18-r4
=net-libs/libnids-1.24-r5