I am the maintainer of the Perl module Net::LibNIDS. I have two mostly identical Gentoo virtual machines, one is x86 and one is amd64. My simple tests fail on the amd64 machine.
I have tracked this down to compiling libnids with -O2. Another post on Stack Overflow appears to concur with my results: http://stackoverflow.com/questions/13431038/libnids-64-bits-systems
If I use ebuild to unpack, compile with -O1 then install and qmerge everything is fine and my client tests pass. Using flag-o-matic to strip -O* flags and add -O1 in the ebuild is also successful (patch attached).
Conclusion: on amd64 the net-libs/libnids package should only be compiled with -O1 at maximum.
Steps to Reproduce:
1. Compile libnids-1.24-r3
2. Use any code that registers a callback, loads from a pcap and expects data to be passed
3. No callback occurs
Created attachment 372978 [details, diff]
All the gory details available in the Ubuntu bugtracker - this was reported by the author and maintainer of the libnids upstream package:
Removing -O2 is a rather sweeping change: it bundles many optimisation flags, one of which at a time is likely cause for miscompilation. Which is it? And is it really a compiler issue or is libnids doing things wrong? In the former case, this should be a bug against sys-devel/gcc, not net-libs/libnids. In the latter case, which specific bit of libnids code is miscompiled?
First, on a funny note: it seems that while on amd64 the test program from the launchpad bug produces random results on each run, on x86 is produces same result on each run, but still a wrong one.
Anyway, this seems to come from "Revenge of the Compiler" series.
It's most likely a bug in both gcc *and* libnids.
The 'bug' in gcc is ... not printing "dereferencing type-punned pointer will break strict-aliasing rules" warning.
Yes, '-fno-strict-aliasing' seems to make the test program print '0xffffc200' on x86 and amd64. On that note: I think the program should use 'unsigned short', not 'short', cause the result is a bit confusing.
The bug in libnids is obviously breaking the strict aliasing.
Simply wrapping the struct in in the test program in an union with 'short' seems to fix the result.
I'd say that for Gentoo, the gcc part of this problem might be more important, even it it seems to be 'only' a failure to print a warning - for quite obvious reasons.
On a (perhaps) important note, gcc 4.8.2 doesn't print that warning either.
(In reply to Rafał Mużyło from comment #5)
i'd report the bug to upstream gcc and be done with it.
if libnids is violating aliasing rules, then obviously it's broken and whether the compiler complains about it is immaterial. you can disable just aliasing optimizations by using the -fno-strict-aliasing. it's a much smaller hammer than trying to force -O1 or -O0 everywhere.
As a confirmation:
in that test program, following loop:
for (i = 0; i < sizeof(hdr); i++)
sum += *((unsigned char *)(&hdr) + i) << ((i%2)*8);
on a little-endian machine prints '0x1c200', which is the expected result.
My point about the warning and Gentoo is obviously that if there is no warning *here*, there might not be in other packages either. So, it's not quite 'be done with it' case.
...but let's see what gcc upstream will say about this.
This is actually pretty interesting. Though I spot that libnids is incorrectly using a 'char' for the IP protocol number. It should be using an unsigned char (or a uint8_t), since the IP protocol number is 0-255. Separate issue, though...
That said, patching out the use of -O2 is a band-aid issue. This is either a problem w/ libnids' code or a quirk in gcc. Looking more like a gcc quirk that needs to be resolved upstream.
Snort uses similar constructs in its code, and it compiles/runs fine w/ -O2. Anyone looked to see if there's something different they do that seems to dodge around this?
According to the libnids author there is no release planned at this time and no patch for this issue, which is most likely due to libnids violating the strict aliasing rules.
Lack of gcc warning aside, the ebuild could perhaps be updated to include -fno-strict-aliasing (per Comment 7) which fixes the issue and will ensure that libnids is built correctly.
(In reply to David from comment #12)
> According to the libnids author there is no release planned at this time and
> no patch for this issue, which is most likely due to libnids violating the
> strict aliasing rules.
> Lack of gcc warning aside, the ebuild could perhaps be updated to include
> -fno-strict-aliasing (per Comment 7) which fixes the issue and will ensure
> that libnids is built correctly.
If we go that route, we'll want to remember to go back and remove that flag filter once libnids or gcc is fixed. So this bug will need to remain open in this case.
I also tested the testcase on my SGI O2 (MIPS big-endian, Linux) under the o32 and n32 ABI's, as well as on a Gentoo/FreeBSD 64-bit VM. Only under my Linux/amd64 system, compiled w/ -O2, does re-running the program multiple times print a random result. Under the other two systems, -O2 produces a result that doesn't change, but the size sum does appear to be wrong. Changing to -O1 or using -fno-strict-aliasing changes the printed-out value to what appears to be correct for that arch.
Ditto on a Gentoo/FreeBSD 64-bit system, too, with gcc. clang-3.4 on that system produces the correct output regardless of -O1 or -O2.
It definitely suggests a bug in gcc somewhere. But the libnids code definitely needs a second look as well, if they're using those data types that way.
toolchain: Re-adding back as I think this is suggesting a bug in gcc somewhere, since clang on a G/FBSD system produces correct output regardless of -O1 or -O2. Obviously, the libnids code needs to fixed up, too, but that is a separate issue.
...well, while even Linus did complain about what gcc is doing w.r. to strict aliasing, technically they're right - I've already said, the bug is not printing the warning, not what they do with the code (if Linus didn't convinced them, it's unlikely any of us can, anyway).
clang simply doesn't takes strict aliasing optimizations that far.
(In reply to Joshua Kinard from comment #14)
if you're violating strict aliasing, then gcc is free to generate whatever random code it likes really. the fact that clang works isn't really relevant.
a gcc bug has been opened on the topic (awesome) which means there isn't anything left for the toolchain to cover.
I have set -fno-strict-aliasing for: