Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 567014 - app-shells/bash: =~ regex matching problem
Summary: app-shells/bash: =~ regex matching problem
Status: RESOLVED INVALID
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo's Team for Core System packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-28 08:20 UTC by Ulrich Müller
Modified: 2016-11-27 03:14 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Müller gentoo-dev 2015-11-28 08:20:23 UTC
$ [[ 12345 =~ ^([0-9]*|[a-z]*)(.*)$ ]]
$ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'"
1:'12345' 2:''

This is as expected. However, changing the order inside the first subexpression should not make any difference:

$ [[ 12345 =~ ^([a-z]*|[0-9]*)(.*)$ ]]
$ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'"
1:'' 2:'12345'

Expected result:
1:'12345' 2:''
Comment 1 SpanKY gentoo-dev 2016-11-27 01:33:29 UTC
bash's man page states:
An additional binary operator, =~, is available, with the same precedence as == and !=.  When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).

and when you look at the bash source, it's just a thin pass thru layer on top of the regex(3) API.  the BASH_REMATCH vars are directly mapped to regexec's regmatch_t results.

if you take that regex and run it through the regex(3) API, you get the same results.  sample code:
#include <regex.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{   
    regex_t preg;
    regmatch_t *matches;
    int ret, i;
    size_t len;
    
    const char *regex = argv[1];
    const char *string = argv[2];
    
    if (argc < 3)
        string = "12345";
    
    ret = regcomp(&preg, regex, REG_EXTENDED);
    printf("regcomp = %i\n", ret);
    len = sizeof(regmatch_t) * (preg.re_nsub + 2);
    matches = malloc(len);
    memset(matches, 0, len);
    ret = regexec(&preg, string, preg.re_nsub + 1, matches, 0);
    printf("regexec = %i\n", ret);
    for (i = 0; i < preg.re_nsub + 1; ++i)
        printf("%i: %i %i\n", i, matches[i].rm_so, matches[i].rm_eo);
    regfree(&preg);
}

when you run your test case:
$ gcc -Wall test.c
$ ./a.out '^([0-9]*|[a-z]*)(.*)$'
regcomp = 0
regexec = 0
0: 0 5
1: 0 5
2: 5 5
$ ./a.out '^([a-z]*|[0-9]*)(.*)$'
regcomp = 0
regexec = 0
0: 0 5
1: 0 0
2: 0 5

both results are valid according to the POSIX extended regex expressions (ERE):
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04

i think it just comes down to POSIX regexes not supporting greedy modifiers.  there's not really anything bash, the C library (e.g. glibc), or Gentoo can or should do here.  maybe if someone was suitably motivated, they could see about pushing extensions from PCRE/etc... into POSIX in which case it'd trickle all back down.
Comment 2 nvinson234 2016-11-27 03:05:37 UTC
> i think it just comes down to POSIX regexes not supporting greedy modifiers.  

Not at all.  Regex modifiers are greedy by default and are greedy here as well.  The problem lies in that the order of the subexpressions in ([0-9]*|[a-z]*) does matter.

([a-z]*|[0-9]*) and ([0-9]*|[a-z]*) can both match empty strings and will only match non-empty strings when the first subexpression is non-empty.

If I wrote a third version:
    (@*|[0-9]*|[a-z]*)

then the only time this version would return a non-empty match is when the string starts with 1 or more '@'.  Any other case and the result is '' because @* is satisfied and the other branches don't have to be checked.

In fact, if you ignore the capture groups, the above expression can be simplified to .*

One way to "fix" this regex is to write it as ^([a-z]+|[0-9]+)?(.*)$.
With this regex, the capture groups behave as expected and it should match everything the original one did.
Comment 3 SpanKY gentoo-dev 2016-11-27 03:14:59 UTC
(In reply to nvinson234 from comment #2)

i meant that, w/out greedy modifiers, you can't write a more deterministic regex regardless of ordering in the subgroup