Pages

Friday, October 28, 2011

The Lazy Quantifier Bug

Try the following following expression to find a pattern of: "a" (optional) followed by up to two words, followed by "b", followed by "c".

The input is "x1 x2 x3 a b c" and therefore the match is "x3 a b c", since it satisfied the condition of up to two words before the "b".

Match match = Regex.Match("x1 x2 x3 a b c", @"((a\s+)?(\w+\s+){0,2}b\s+c)");


However,  turn the condition of up to two words to be lazy, and you get the match "x2 x3 a b c", which does not qualify the expression at all (lazy or not).

Match match = Regex.Match("x1 x2 x3 a b c", @"((a\s+)?(\w+\s+){0,2}?b\s+c)");


So why do we get "x2 x3 a b c" as a result of this regex?


This appears to be a Regex bug.  There are 2 groups in the match, this non-greedy pattern is producing.  The first group is the entire string "a x1 x2 x3 b c".  The second group is "x3".  But examine the second group's "Captures" collection and you will see 3 captures, namely x1, x2, x3.  Therefore, (\w+\s+){0,2}? captures 3 instead of at most two words.  Hence, I believe it's a bug.

Given the pattern, "a\s(\w+\s+){0,2}?b\s+c", the following strings should produce the following results

"a a b c" should match "a b c" at index=2 (but it matches "a a b c" instead)
"a x1 b c" should match "a x1 b c"
"a x1 x2 b c"  should match "a x1 x2 b c"
"a x1 x2 x3 b c" should not match (but it does)
"a a x1 b c" should match "a x1 b c" at Index=2  (but it matches "a a x1 b c" instead)
"a a a x1 b c" should match "a x1 b c" at index=4 (but it matches "a a a x1 b c" instead)
"a a x1 x2 b c"  should match "a x1 x2 b c" at index=2 (but it matches "a a x1 x2 b c" instead)
"a a a x1 x2 b c" should match "a x1 x2 b c" at index=4 (but it matches "a a x1 x2 b c" at index=2 instead)

The other pattern ("a\s(\w+\s+){0,2}b\s+c") works properly, i.e.,

"a a b c" should match "a a b c"
"a x1 b c" should match "a x1 b c"
"a x1 x2 b c"  should match "a x1 x2 b c"
"a x1 x2 x3 b c" should not match
"a a x1 b c" should match "a a x1 b c"
"a a a x1 b c" should match "a a x1 b c" at index=2
"a a x1 x2 b c"  should match "a x1 x2 b c" at index=2
"a a a x1 x2 b c" should match "a x1 x2 b c" at index=4

At first, I thought {0,2}? made no sense, but given the strings like "a a x1 b c", there is certainly a use for a non-greedy {n,m}.

Thursday, October 27, 2011

Python string formatting: % vs. .format


Python 2.6 introduced the string.format() method with a slightly different syntax from the existing % operator. Which is better and for what situations?

The following uses each method and has the same outcome, so what is the difference?

#!/usr/bin/python
sub1 = "python string!"
sub2 = "an arg"


a = "i am a %s"%sub1
b = "i am a {0}".format(sub1)


c = "with %(kwarg)s!"%{'kwarg':sub2}
d = "with {kwarg}!".format(kwarg=sub2)


print a
print b
print c
print d

To answer the question... .format just seems more sophisticated in many ways. You can do stuff like re-use arguments, which you can't do with %. An annoying thing about % is also how it can either take a variable or a tuple. You'd think the following would always work:

"hi there %s" % name


yet, if name happens to be (1, 2, 3), it will throw a TypeError. To guarantee that it always prints, you'd need to do

"hi there %s" % (name,)   # supply the single argument as a single-item tuple


which is just ugly. .format doesn't have those issues. Also in the second example you gave, the .format example is much cleaner looking.

Why would you not use it?

  • not knowing about it (me before reading this)
  • having to be compatible with Python 2.5