How to make regexes faster? (Python v. OmniMark)

Fri Apr 19 09:36:38 EDT 2002

In article <3CBF9168.AAE0FC46 at engcorp.com>,
Peter Hansen  <peter at engcorp.com> wrote:
>"Frederick H. Bartlett" wrote:
>> 
>> I was recently introduced to OmniMark. One of our exercises was to take
>> a plain text file of Hamlet and convert it to SGML.
>> 
>> So I did it in Python, too. But the best time I could get from Python
>> was .57 sec, while OmniMark came in at .20 sec. What's the most
>> efficient technique for Pythonesque regex-based text processing?
>
>Hmmm... how fast do you need it to be?  Sounds to me like 0.57 seconds
>is pretty darned fast.
>
>Do you have specific goals, or are you just on a search for 
>something faster?  Remember, "better is the enemy of good"
>and the grass is always greener.
>
>(See http://www.seds.org/~chrisl/akin.html )
>
>-Peter

I was counting on Peter to write this.

Because it's correct, of course.  Moreover, you should
know that Peter has payroll responsibilities.  He writes
from a more practical perspective than other acquaintances
might afford you.

Suppose, for a moment, that we make a serious attempt to
investigate "the most efficient technique for Pythonesque
regex-based text processing".  While your question might
seem a reasonable one, in fact it leaves too many degrees
of freedom to allow for a precise response.  First, if I
truly needed speed in big SGML work from Python, I'd exer-
cise the freely-available SGML-specific modules available.
Next, I'd determine whether my test examples are indeed
regex-bound (it might well be I/O which constrains your
performance).  After that ... well, part of the charm of
regex-s for some people is that they're so flexible that
different techniques are superior in different circumstances.
-- 

Cameron Laird <Cameron at Lairds.com>
Business:  http://www.Phaseit.net
Personal:  http://starbase.neosoft.com/~claird/home.html