Pages

Friday, October 28, 2011

The Lazy Quantifier Bug

Try the following following expression to find a pattern of: "a" (optional) followed by up to two words, followed by "b", followed by "c".

The input is "x1 x2 x3 a b c" and therefore the match is "x3 a b c", since it satisfied the condition of up to two words before the "b".

Match match = Regex.Match("x1 x2 x3 a b c", @"((a\s+)?(\w+\s+){0,2}b\s+c)");


However,  turn the condition of up to two words to be lazy, and you get the match "x2 x3 a b c", which does not qualify the expression at all (lazy or not).

Match match = Regex.Match("x1 x2 x3 a b c", @"((a\s+)?(\w+\s+){0,2}?b\s+c)");


So why do we get "x2 x3 a b c" as a result of this regex?


This appears to be a Regex bug.  There are 2 groups in the match, this non-greedy pattern is producing.  The first group is the entire string "a x1 x2 x3 b c".  The second group is "x3".  But examine the second group's "Captures" collection and you will see 3 captures, namely x1, x2, x3.  Therefore, (\w+\s+){0,2}? captures 3 instead of at most two words.  Hence, I believe it's a bug.

Given the pattern, "a\s(\w+\s+){0,2}?b\s+c", the following strings should produce the following results

"a a b c" should match "a b c" at index=2 (but it matches "a a b c" instead)
"a x1 b c" should match "a x1 b c"
"a x1 x2 b c"  should match "a x1 x2 b c"
"a x1 x2 x3 b c" should not match (but it does)
"a a x1 b c" should match "a x1 b c" at Index=2  (but it matches "a a x1 b c" instead)
"a a a x1 b c" should match "a x1 b c" at index=4 (but it matches "a a a x1 b c" instead)
"a a x1 x2 b c"  should match "a x1 x2 b c" at index=2 (but it matches "a a x1 x2 b c" instead)
"a a a x1 x2 b c" should match "a x1 x2 b c" at index=4 (but it matches "a a x1 x2 b c" at index=2 instead)

The other pattern ("a\s(\w+\s+){0,2}b\s+c") works properly, i.e.,

"a a b c" should match "a a b c"
"a x1 b c" should match "a x1 b c"
"a x1 x2 b c"  should match "a x1 x2 b c"
"a x1 x2 x3 b c" should not match
"a a x1 b c" should match "a a x1 b c"
"a a a x1 b c" should match "a a x1 b c" at index=2
"a a x1 x2 b c"  should match "a x1 x2 b c" at index=2
"a a a x1 x2 b c" should match "a x1 x2 b c" at index=4

At first, I thought {0,2}? made no sense, but given the strings like "a a x1 b c", there is certainly a use for a non-greedy {n,m}.

No comments:

Post a Comment