$ [[ 12345 =~ ^([0-9]*|[a-z]*)(.*)$ ]] $ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'" 1:'12345' 2:'' This is as expected. However, changing the order inside the first subexpression should not make any difference: $ [[ 12345 =~ ^([a-z]*|[0-9]*)(.*)$ ]] $ echo "1:'${BASH_REMATCH[1]}' 2:'${BASH_REMATCH[2]}'" 1:'' 2:'12345' Expected result: 1:'12345' 2:''
bash's man page states: An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). and when you look at the bash source, it's just a thin pass thru layer on top of the regex(3) API. the BASH_REMATCH vars are directly mapped to regexec's regmatch_t results. if you take that regex and run it through the regex(3) API, you get the same results. sample code: #include <regex.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char *argv[]) { regex_t preg; regmatch_t *matches; int ret, i; size_t len; const char *regex = argv[1]; const char *string = argv[2]; if (argc < 3) string = "12345"; ret = regcomp(&preg, regex, REG_EXTENDED); printf("regcomp = %i\n", ret); len = sizeof(regmatch_t) * (preg.re_nsub + 2); matches = malloc(len); memset(matches, 0, len); ret = regexec(&preg, string, preg.re_nsub + 1, matches, 0); printf("regexec = %i\n", ret); for (i = 0; i < preg.re_nsub + 1; ++i) printf("%i: %i %i\n", i, matches[i].rm_so, matches[i].rm_eo); regfree(&preg); } when you run your test case: $ gcc -Wall test.c $ ./a.out '^([0-9]*|[a-z]*)(.*)$' regcomp = 0 regexec = 0 0: 0 5 1: 0 5 2: 5 5 $ ./a.out '^([a-z]*|[0-9]*)(.*)$' regcomp = 0 regexec = 0 0: 0 5 1: 0 0 2: 0 5 both results are valid according to the POSIX extended regex expressions (ERE): http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04 i think it just comes down to POSIX regexes not supporting greedy modifiers. there's not really anything bash, the C library (e.g. glibc), or Gentoo can or should do here. maybe if someone was suitably motivated, they could see about pushing extensions from PCRE/etc... into POSIX in which case it'd trickle all back down.
> i think it just comes down to POSIX regexes not supporting greedy modifiers. Not at all. Regex modifiers are greedy by default and are greedy here as well. The problem lies in that the order of the subexpressions in ([0-9]*|[a-z]*) does matter. ([a-z]*|[0-9]*) and ([0-9]*|[a-z]*) can both match empty strings and will only match non-empty strings when the first subexpression is non-empty. If I wrote a third version: (@*|[0-9]*|[a-z]*) then the only time this version would return a non-empty match is when the string starts with 1 or more '@'. Any other case and the result is '' because @* is satisfied and the other branches don't have to be checked. In fact, if you ignore the capture groups, the above expression can be simplified to .* One way to "fix" this regex is to write it as ^([a-z]+|[0-9]+)?(.*)$. With this regex, the capture groups behave as expected and it should match everything the original one did.
(In reply to nvinson234 from comment #2) i meant that, w/out greedy modifiers, you can't write a more deterministic regex regardless of ordering in the subgroup