Relative performance of comparable regular expressions

Tue Jan 13 11:14:49 EST 2009

On Tue, Jan 13, 2009 at 6:16 AM, Barak, Ron <Ron.Barak at lsi.com> wrote:
> Hi John,
>
> Thanks for the below - teaching me how to fish      ( instead of just giving
> me a fish :-)
> Now I could definitely get the answers for myself, and also be a bit more
> enlightened.
>
> As for your (2) remark below (on my question: Which would give better
> performance, matching with "^[a-zA-Z]{3}", or with "^\S{3}" ?):
> (2) I think you mean "^\s{3}" not "^\S{3}",
> I actually did meant to use \S, namely - a character that is not a
> white-space.

(A) Please don't top-post, it makes replying to you more awkward and
makes it harder for readers to follow the conversation.

(B) But "^[a-zA-Z]{3}", and "^\S{3}" aren't even equivalent! \S allows
*digits* and *punctuation* too, whereas the former *only* matches
letters.

Cheers,
Chris
-- 
Follow the path of the Iguana...
http://rebertia.com

>
> -----Original Message-----
> From: John Machin [
> ]
> Sent: Tuesday, January 13, 2009 11:15
> To: python-list at python.org
> Subject: Re: Relative performance of comparable regular expressions
>
> On Jan 13, 7:24 pm, "Barak, Ron" <Ron.Ba... at lsi.com> wrote:
>> Hi,
>>
>> I have a question about relative performance of comparable regular
>> expressions.
>>
>> I have large log files that start with three letters month names
>> (non-unicode).
>>
>> Which would give better performance, matching with  "^[a-zA-Z]{3}", or
>> with "^\S{3}" ?
>
> (1) If you want to match at the start of a line, use re.match()
> *without* the pointless "^". Don't use re.search with a pattern starting
> with "^" -- it won't be any faster than and it could be a lot worse;
> re.search doesn't know to stop if the first match fails:
>
> command-prompt>\python26\python -m timeit -s"import re;rx=re.compile
> ('^AB')
> ;text='Z'*100" "rx.match(text)"
> 1000000 loops, best of 3: 1.15 usec per loop
>
> command-prompt>\python26\python -m timeit -s"import re;rx=re.compile
> ('^AB')
> ;text='Z'*100" "rx.search(text)"
> 100000 loops, best of 3: 4.47 usec per loop
>
> command-prompt>\python26\python -m timeit -s"import re;rx=re.compile
> ('^AB')
> ;text='Z'*1000" "rx.search(text)"
> 10000 loops, best of 3: 34.1 usec per loop
>
> (2) I think you mean "^\s{3}" not "^\S{3}"
>
> (3) Now that you've seen how to do timings, over to you :-)
>
>> Also, which is better (if different at all): "\d\d" or "\d{2}" ?
>> Also, would matching "." be different (performance-wise) than matching the
>> actual character, e.g. matching ":" ?
>> And lastly, at the end of a line, is there any performance difference
>> between "(.+)$" and "(.+)"
>
> Cheers,
> John