[Python-Dev] Better text processing support in py2k?

Tim Peters tim_one@email.msn.com
Fri, 31 Dec 1999 17:53:49 -0500


[Fredrik Lundh, whose very nice eMatter book is on sale until
  the end of the 20th century (as real people think of it),
  although the eMatter distribution scheme has lots of problems
  [just an editorial note from a bot who has to-- for unknown
   reasons Fatbrain "is working on" --delete the Fatbrain
   registry tree and reregister the book almost every time he
   tries to open it <wink>
  ]
]

> we have something called SIO which uses memory mapping
> where possible, and just a more aggressive read-ahead for
> other cases.  on a windows box, a traditional while/readline
> loop runs 3-5 times faster than before.  with SRE instead of
> re, a while/readline/match loop runs up to 10 times faster
> than before.
>
> note that this is without *any* changes to the Python
> source code...

If so, there's potential for significantly more speed.  Python does its
line-at-a-time input with a character-at-a-time macro-in-a-loop, the same
way naive vendors (read "almost all vendors") implement fgets.  It's
replacing that inner loop with direct peeking into the FILE buffer that gets
Perl its dramatic speed -- despite that Perl has fancier input functionality
(the oft-requested automagical "input record separator").  So it sounds like
the Perl trick is orthogonal to SIO's tricks; Perl isn't doing mmaps or
read-aheads or anything else fancy under the covers -- it only optimizes the
inner loop!

> ...
> with a little luck, the new module will replace both pcre
> and regex...

If something more tangible than luck would help to make this come true, feel
free to mention it <wink>.

> not to mention that it's fairly easy to write your own front-
> end to the matching engine -- the expression parser and the
> compiler are both written in good old python.

Ah, good news / bad news.  Perl refugees aren't accustomed to "precompiling"
regexp objects, so write code that will cause regexps to get recompiled over
& over.  Even if you cache the results under the covers, the overhead of the
Python call to the regexp compiler will likely take as long as the engine
takes to search.

Personally, in such cases, I think they should learn how to use the language
<0.5 wink>.