Trouble with regular expressions

MRAB google at mrabarnett.plus.com
Sat Feb 7 20:38:44 EST 2009


John Machin wrote:
> On Feb 8, 10:15 am, MRAB <goo... at mrabarnett.plus.com> wrote:
>> John Machin wrote:
>>> On Feb 8, 1:37 am, MRAB <goo... at mrabarnett.plus.com> wrote:
>>>> LaundroMat wrote:
>>>>> Hi,
>>>>> I'm quite new to regular expressions, and I wonder if anyone here
>>>>> could help me out.
>>>>> I'm looking to split strings that ideally look like this: "Update: New
>>>>> item (Household)" into a group.
>>>>> This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
>>>>> ("Update", "New item", "(Household)")
>>>>> Some strings will look like this however: "Update: New item (item)
>>>>> (Household)". The expression above still does its job, as it returns
>>>>> ("Update", "New item (item)", "(Household)").
>>> Not quite true; it actually returns
>>>     ('Update:', ' New item (item) ', '(Household)')
>>> However ignoring the difference in whitespace, the OP's intention is
>>> clear. Yours returns
>>>     ('Update:', ' New item ', '(item) (Household)')
>> The OP said it works OK, which I took to mean that the OP was OK with
>> the extra whitespace, which can be easily stripped off. Close enough!
> 
> As I said, the whitespace difference [between what the OP said his
> regex did and what it actually does] is not the problem. The problem
> is that the OP's "works OK" included (item) in the 2nd group, whereas
> yours includes (item) in the 3rd group.
> 
Ugh, right again!

That just shows what happens when I try to post while debugging! :-)

>>>>> It does not work however when there is no text in parentheses (eg
>>>>> "Update: new item"). How can I get the expression to return a tuple
>>>>> such as ("Update:", "new item", None)?
>>>> You need to make the last group optional and also make the middle group
>>>> lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.
>>> Why do you perpetuate the redundant ^ anchor?
>> The OP didn't say whether search() or match() was being used. With the ^
>> it doesn't matter.
> 
> It *does* matter. re.search() is suboptimal; after failing at the
> first position, it's not smart enough to give up if the pattern has a
> front anchor.
> 
> [win32, 2.6.1]
> C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
> ('^frobozz');txt=100
> *'x'" "assert not rx.match(txt)"
> 1000000 loops, best of 3: 1.17 usec per loop
> 
> C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
> ('^frobozz');txt=100
> 0*'x'" "assert not rx.match(txt)"
> 1000000 loops, best of 3: 1.17 usec per loop
> 
> C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
> ('^frobozz');txt=100
> *'x'" "assert not rx.search(txt)"
> 100000 loops, best of 3: 4.37 usec per loop
> 
> C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
> ('^frobozz');txt=100
> 0*'x'" "assert not rx.search(txt)"
> 10000 loops, best of 3: 34.1 usec per loop
> 
> Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9
> 
On my PC the numbers for Python 2.6 are:

C:\Python26>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.02 usec per loop

C:\Python26>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.04 usec per loop

C:\Python26>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 3.69 usec per loop

C:\Python26>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 28.6 usec per loop

I'm currently working on the re module and I've fixed that problem:

C:\Python27>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.28 usec per loop

C:\Python27>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.23 usec per loop

C:\Python27>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

C:\Python27>python -mtimeit -s"import 
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

Hmm. Needs more tweaking...



More information about the Python-list mailing list