Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 263723 - <=sys-apps/coreutils-7.1: sort -b -kPOS1,POS2 doesn't work as expected
Summary: <=sys-apps/coreutils-7.1: sort -b -kPOS1,POS2 doesn't work as expected
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo's Team for Core System packages
URL: http://lists.gnu.org/archive/html/bug...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-25 13:03 UTC by ferret
Modified: 2009-03-31 20:16 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
patch I used to debug key positioning in coreutils-7.1 sort (debug-sort-keys.patch,470 bytes, patch)
2009-03-25 13:04 UTC, ferret
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description ferret 2009-03-25 13:03:13 UTC
ferret@jupiter ~ $ sort -u -b -k1,1 <<<$'a b c\na  b c'
a b c
a  b c

I believe here that the blanks between 'a' and 'b' are part of field 2, not field 1, and so should not make the lines different.  I believe the correct output should be 'a b c\n'.

Explaining why this is wrong takes some doing.  Here is all the relevant text I could find from info sort.

------- START OF TEXT FROM INFO SORT --------

`-b'
`--ignore-leading-blanks'
     Ignore leading blanks when finding sort keys in each line.  By
     default a blank is a space or a tab, but the `LC_CTYPE' locale can
     change this.

......

`-t SEPARATOR'
`--field-separator=SEPARATOR'
     Use character SEPARATOR as the field separator when finding the
     sort keys in each line.  By default, fields are separated by the
     empty string between a non-blank character and a blank character.
     By default a blank is a space or a tab, but the `LC_CTYPE' locale
     can change this.

     That is, given the input line ` foo bar', `sort' breaks it into
     fields ` foo' and ` bar'.  The field separator is not considered
     to be part of either the field preceding or the field following,
     so with `sort -t " "' the same input line has three fields: an
     empty field, `foo', and `bar'.  However, fields that extend to the
     end of the line, as `-k 2', or fields consisting of a range, as
     `-k 2,3', retain the field separators present between the
     endpoints of the range.

......

   A position in a sort field specified with `-k' may have any of the
option letters `Mbdfinr' appended to it, in which case the global
ordering options are not used for that particular field.  The `-b'
option may be independently attached to either or both of the start and
end positions of a field specification, and if it is inherited from the
global options it will be attached to both.  If input lines can contain
leading or adjacent blanks and `-t' is not used, then `-k' is typically
combined with `-b', `-g', `-M', or `-n'; otherwise the varying numbers
of leading blanks in fields can cause confusing results.

   If the start position in a sort field specifier falls after the end
of the line or after the end field, the field is empty.  If the `-b'
option was specified, the `.C' part of a field specification is counted
from the first nonblank character of the field.

......

[in the examples section]

     The inheritance works in this case because `-k 5b,5b' and `-k
     5b,5' are equivalent, as the location of a field-end lacking a `.C'
     character position is not affected by whether initial blanks are
     skipped.

------- END OF TEXT FROM INFO SORT --------

First we need to establish where exactly sort is choosing to split keys.  In order to do this I added a debug printing line to src/sort.c which puts parentheses around where it thinks the first key is (patch so you can try this in the first reply).  Here's what we get with a few different options:

ferret@jupiter ~/coreutils-7.1~/src $ ./sort -u -k1,1 <<<$' a b c'
debug: ( a) b c
 a b c

ferret@jupiter ~/coreutils-7.1~/src $ ./sort -u -k1b,1 <<<$' a b c'
debug:  (a) b c
 a b c

ferret@jupiter ~/coreutils-7.1~/src $ ./sort -b -u -k1,1 <<<$' a b c'
debug:  (a )b c
 a b c

ferret@jupiter ~/coreutils-7.1~/src $ ./sort -u -k1b,1b <<<$' a b c'
debug:  (a )b c
 a b c

The first two seem obviously correct given the documentation.  By default the first field includes its leading space.  With the b option added to POS1, that leading space has been excluded from the key.  In the third and fourth runs, the TRAILING space has been tacked onto the key.  Why?

As far as I can tell by looking at the code it's doing just what it does with POS1: it's taking whatever position it would normally take, and then pushing forward past any blanks.  The thing is, this is a totally bizarre thing to do at the end of a key!  It's a limit, not a startpoint, so by pushing forward from there you are ADDING trail blanks, not removing lead blanks.  While it makes the code seem orthogonal, the behaviour is unhelpful, confusing, and not as documented (see the last paragraph of the INFO section above).

As an aside, and I hesitate to mention this in case it adds to the confusion, sort -b -k1,1.0 works just fine.  This is in spite of the documentation stating that a missing '.C' character specifier in POS2 shall act as if '.0' had been specified.
Comment 1 ferret 2009-03-25 13:04:43 UTC
Created attachment 186225 [details, diff]
patch I used to debug key positioning in coreutils-7.1 sort
Comment 2 ferret 2009-03-25 13:52:25 UTC
I've checked some other sort implementations.

Sort on HPUX 10.20 and 11.11 behaves the same way as this one.

Sort on Solaris 9 and Solaris 10 behaves "correctly" (my definition of correctly, from the start of the bug report).

Sort from an old version of coreutils (5.2.1) works "correctly".
Comment 3 Pádraig Brady 2009-03-27 10:23:51 UTC
Good analysis of the bug!
The bug was in fact always present in coreutils I think
and fixed 1 week after coreutils-7.1 was released :(

That debug patch is very useful BTW :) 
I wonder would a --key-debug option be useful for sort
to output something like: ⌈ a⌋ b c
Comment 4 ferret 2009-03-30 11:54:26 UTC
Thanks for the comment.  I tried the latest coreutils (which was a thoroughly harrowing experience, I can tell you), and this bug is indeed fixed in that version!  So now we just have to wait for the next release and I can close this bug.
Comment 5 Pádraig Brady 2009-03-30 12:27:43 UTC
Note the next release of coreutils (7.2 due very soon now) will depend on a released version of automake (1.10b), which hopefully will ease the dependencies somewhat.
Comment 6 SpanKY gentoo-dev 2009-03-31 20:16:23 UTC
coreutils-7.2 added to the tree