Regular expression to match a #

John Machin sjmachin at lexicon.net
Fri Aug 12 01:07:04 CEST 2005


John Machin wrote:
> Devan L wrote:
> 
>> John Machin wrote:
>>
>>> Aahz wrote:
>>>
>>>> In article <42fb45d7$1 at news.eftel.com>,
>>>> John Machin  <sjmachin at lexicon.net> wrote:
>>>>
>>>>
>>>>> Search for r'^something' can never be better/faster than match for
>>>>> r'something', and with a dopey implementation of search [which 
>>>>> Python's
>>>>> re is NOT] it could be much worse. So please don't tell newbies to
>>>>> search for r'^something'.
>>>>
>>>>
>>>>
>>>> You're somehow getting mixed up in thinking that "^" is some kind of
>>>> "not" operator -- it's the start of line anchor in this context.
>>>
>>>
>>> I can't imagine where you got that idea from.
>>>
>>> If I change "[which Python's re is NOT]" to "[Python's re's search() is
>>> not dopey]", does that help you?
>>>
>>> The point was made in a context where the OP appeared to be reading a
>>> line at a time and parsing it, and re.compile(r'something').match()
>>> would do the job; re.compile(r'^something').search() will do the job too
>>> -- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
>>> very inefficiently in the failing case with dopey implementations of
>>> search() (which apply match() at offsets 0, 1, 2, .....).
>>
>>
>>
>> I don't see much difference.
> 
> 
> and I didn't expect that you would -- like I wrote above: "Python's re's 
> search() is not dopey".

*ahem*

C:\junk>python
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
 >>> import timeit
 >>> t1 = timeit.Timer('re.search("^\w"," will not work")','import re')
 >>> t2 = timeit.Timer('re.match("\w"," will not work")','import re')
 >>> t3 = timeit.Timer('obj(" will not work")','import 
re;obj=re.compile("^\w").s
earch')
 >>> t4 = timeit.Timer('obj(" will not work")','import 
re;obj=re.compile("\w").ma
tch')
 >>> t5 = timeit.Timer('obj(" will not work 
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("^\w").search')
 >>> t6 = timeit.Timer('obj(" will not work 
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("\w").match')
 >>> ["%.3f" % t.timeit() for t in t1, t2, t3, t4]
['5.510', '4.835', '1.588', '1.178']
 >>> ["%.3f" % t.timeit() for t in t1, t2, t3, t4]
['5.512', '4.808', '1.584', '1.170']

Observation: factoring out the compile step makes the difference much 
more apparent.

 >>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]
['1.578', '1.175', '2.283', '1.174']
 >>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]
['1.582', '1.179', '2.284', '1.172']
 >>>

Conclusion: search time depends on length of searched string.

Meta-conclusion: Either I have to retract my 
based-on-hope-rather-than-on-experimentation assertion, or redefine "not 
dopey" to mean "surely nobody would search for ^x when match x would do, 
so it would be dopey to optimise re for that" :-)

So, back to the original point:

If re.match("something") does the job you want, don't use 
re.search("^something") instead.



More information about the Python-list mailing list