Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 567014

Summary: app-shells/bash: =~ regex matching problem
Product: Gentoo Linux Reporter: Ulrich Müller <ulm>
Component: [OLD] Core systemAssignee: Gentoo's Team for Core System packages <base-system>
Status: RESOLVED INVALID    
Severity: normal    
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---

Description Ulrich Müller gentoo-dev 2015-11-28 08:20:23 UTC
$ [[ 12345 =~ ^([0-9]*|[a-z]*)(.*)$ ]]
$ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'"
1:'12345' 2:''

This is as expected. However, changing the order inside the first subexpression should not make any difference:

$ [[ 12345 =~ ^([a-z]*|[0-9]*)(.*)$ ]]
$ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'"
1:'' 2:'12345'

Expected result:
1:'12345' 2:''
Comment 1 SpanKY gentoo-dev 2016-11-27 01:33:29 UTC
bash's man page states:
An additional binary operator, =~, is available, with the same precedence as == and !=.  When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).

and when you look at the bash source, it's just a thin pass thru layer on top of the regex(3) API.  the BASH_REMATCH vars are directly mapped to regexec's regmatch_t results.

if you take that regex and run it through the regex(3) API, you get the same results.  sample code:
#include <regex.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{   
    regex_t preg;
    regmatch_t *matches;
    int ret, i;
    size_t len;
    
    const char *regex = argv[1];
    const char *string = argv[2];
    
    if (argc < 3)
        string = "12345";
    
    ret = regcomp(&preg, regex, REG_EXTENDED);
    printf("regcomp = %i\n", ret);
    len = sizeof(regmatch_t) * (preg.re_nsub + 2);
    matches = malloc(len);
    memset(matches, 0, len);
    ret = regexec(&preg, string, preg.re_nsub + 1, matches, 0);
    printf("regexec = %i\n", ret);
    for (i = 0; i < preg.re_nsub + 1; ++i)
        printf("%i: %i %i\n", i, matches[i].rm_so, matches[i].rm_eo);
    regfree(&preg);
}

when you run your test case:
$ gcc -Wall test.c
$ ./a.out '^([0-9]*|[a-z]*)(.*)$'
regcomp = 0
regexec = 0
0: 0 5
1: 0 5
2: 5 5
$ ./a.out '^([a-z]*|[0-9]*)(.*)$'
regcomp = 0
regexec = 0
0: 0 5
1: 0 0
2: 0 5

both results are valid according to the POSIX extended regex expressions (ERE):
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04

i think it just comes down to POSIX regexes not supporting greedy modifiers.  there's not really anything bash, the C library (e.g. glibc), or Gentoo can or should do here.  maybe if someone was suitably motivated, they could see about pushing extensions from PCRE/etc... into POSIX in which case it'd trickle all back down.
Comment 2 nvinson234 2016-11-27 03:05:37 UTC
> i think it just comes down to POSIX regexes not supporting greedy modifiers.  

Not at all.  Regex modifiers are greedy by default and are greedy here as well.  The problem lies in that the order of the subexpressions in ([0-9]*|[a-z]*) does matter.

([a-z]*|[0-9]*) and ([0-9]*|[a-z]*) can both match empty strings and will only match non-empty strings when the first subexpression is non-empty.

If I wrote a third version:
    (@*|[0-9]*|[a-z]*)

then the only time this version would return a non-empty match is when the string starts with 1 or more '@'.  Any other case and the result is '' because @* is satisfied and the other branches don't have to be checked.

In fact, if you ignore the capture groups, the above expression can be simplified to .*

One way to "fix" this regex is to write it as ^([a-z]+|[0-9]+)?(.*)$.
With this regex, the capture groups behave as expected and it should match everything the original one did.
Comment 3 SpanKY gentoo-dev 2016-11-27 03:14:59 UTC
(In reply to nvinson234 from comment #2)

i meant that, w/out greedy modifiers, you can't write a more deterministic regex regardless of ordering in the subgroup