
Tim Peters is back from his vacation:
While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:
1. [*A* clear way to do memory- and time-efficient textfile input]
I agree, but unsure how to fix it. The best way to write this now is
# f is some open file object. while 1: lines = f.readlines(BUFSIZE) if not lines: break for line in lines: process(line)
and it's not something anyone figures out on their own -- or enjoys typing or explaining afterwards.
Perl gets its line-at-a-time speed by peeking and poking C FILE structs directly in compiler- and platform-specific ways -- ways that vendors *should* have done in their own fgets implementations, but almost never do. I have no idea whether it works well with Perl's nascent notions of threading, but in the absence of that "the system" doesn't know Perl is cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one line at a time -- even mixing in C-level ungetc calls works (well, sometimes <0.1 wink -- they don't always peek and poke enough fields>)).
The Python QIO extension module is much easier to port but less compatible (it doesn't use stdio, so QIO-opened files don't play well with others) and slower (although that's likely repairable -- he's got two passes over the buffer where one hairier pass should suffice).
we have something called SIO which uses memory mapping where possible, and just a more aggressive read-ahead for other cases. on a windows box, a traditional while/readline loop runs 3-5 times faster than before. with SRE instead of re, a while/readline/match loop runs up to 10 times faster than before. note that this is without *any* changes to the Python source code...
2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module.
The irony here is that the re engine is very often unboundedly faster than the regex engine -- provided you're chewing over large strings. Some tests /F ran showed that the length-independent *overhead* of invoking re is about 10x higher than for regex. Presumably the bulk of that is due to re.py, i.e. that you get to the re engine via going thru Python layers on your way in and out, while regex was pure C.
I've attached some old benchmarks. I think the current code base is a bit faster, but you get the idea.
In any case, /F is working on a new engine (for Unicode), and I believe he has this all well in hand.
with a little luck, the new module will replace both pcre and regex... not to mention that it's fairly easy to write your own front- end to the matching engine -- the expression parser and the compiler are both written in good old python. </F> $ python sre_bench.py 0 5 50 250 1000 5000 25000 ----- ----- ----- ----- ----- ----- ----- ----- search for Python|Perl in Perl -> sre8 0.007 0.008 0.010 0.010 0.020 0.073 0.349 sre16 0.007 0.007 0.008 0.010 0.020 0.075 0.353 re 0.097 0.097 0.101 0.103 0.118 0.175 0.480 regex 0.007 0.007 0.009 0.020 0.059 0.271 1.320 search for (Python|Perl) in Perl -> sre8 0.007 0.007 0.007 0.010 0.020 0.074 0.344 sre16 0.007 0.007 0.008 0.010 0.020 0.074 0.347 re 0.110 0.104 0.111 0.115 0.125 0.184 0.559 regex 0.006 0.006 0.009 0.019 0.057 0.285 1.432 search for Python in Python -> sre8 0.007 0.007 0.007 0.011 0.021 0.072 0.387 sre16 0.007 0.007 0.008 0.010 0.022 0.082 0.365 re 0.107 0.097 0.105 0.102 0.118 0.175 0.511 regex 0.009 0.008 0.010 0.018 0.036 0.139 0.708 search for .*Python in Python -> sre8 0.008 0.007 0.008 0.011 0.021 0.079 0.379 sre16 0.008 0.008 0.008 0.011 0.022 0.075 0.402 re 0.102 0.108 0.119 0.183 0.400 1.545 7.284 regex 0.013 0.019 0.072 0.318 1.231 8.035 45.366 search for .*Python.* in Python -> sre8 0.008 0.008 0.008 0.011 0.021 0.080 0.383 sre16 0.008 0.008 0.008 0.011 0.021 0.079 0.395 re 0.103 0.108 0.119 0.184 0.418 1.685 8.378 regex 0.013 0.020 0.073 0.326 1.264 9.961 46.511 search for .*(Python) in Python -> sre8 0.007 0.008 0.008 0.011 0.021 0.077 0.378 sre16 0.007 0.008 0.008 0.011 0.021 0.077 0.444 re 0.108 0.107 0.134 0.240 0.637 2.765 13.395 regex 0.026 0.112 3.820 87.322 (skipped) search for .*P.*y.*t.*h.*o.*n.* in Python -> sre8 0.010 0.010 0.014 0.031 0.093 0.419 2.212 sre16 0.010 0.011 0.014 0.030 0.093 0.419 2.292 re 0.112 0.121 0.195 0.521 1.747 8.298 40.877 regex 0.026 0.048 0.248 1.148 4.550 24.720 ... (searching for patterns in padded strings; sre8 is the sre engine compiled for 8-bit characters, sre16 is the same engine compiled for 16-bit characters)