Mailman 3 Better text processing support in py2k? - Python-Dev

Better text processing support in py2k?

Skip Montanaro

Dec. 28, 1999

4:41 p.m.

It just occurred to me as I was replying to a request on the main list, that Python's text handling capabilities could be a bit better than they are. This will probably not come as a revelation to many of you, but I finally put it together with the standard argument against beefing things up One fix would be to add regular expressions to the language core and have special syntax for them, as Perl has done. However, I don't like this solution because Python is a general-purpose language, and regular expressions are used for the single application domain of text processing. For other application domains, regular expressions may be of no interest, and you might want to remove them to save memory and code size. and the observation that Python does support some builtin objects and syntax that are fairly specific to some much more restricted application domains than text processing. I stole the above quote from Andrew Kuchling's Python Warts page, which I also happened to read earlier today. What AMK says makes perfect sense until you examine some of the other things that are in the language, like the Ellipsis object and complex numbers. If I recall correctly both were added as a result of the NumPy package development. I have nothing against ellipses or complex numbers. They are fine first class objects that should remain in the language. But I have never used either one in my day-to-day work. On the other hand, I read files and manipulate them with regular expressions all the time. I rather suspect that more people use Python for some sort of text processing than any other single application domain. Python should be good at it. While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention: 1. When using something like the simple file i/o idiom for line in f.readlines(): dofunstuff(line) the programmer should not have to care how big the file is. It should just work in a reasonably efficient manner without gobbling up all of memory. I realize this may require some change to the syntax of the common idiom. 2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module. Depending how far people want to go with things, adding some language syntax to support regular expressions might be in order. I don't see that as compelling as adding complex numbers however. Another possibility, now that Barry Warsaw has opened the floodgates, is to add regular expression methods to strings. 3. I've not yet used it, but I am told the pattern matching in Marc-Andre Lemburg's mxTextTools (http://starship.python.net/crew/lemburg/) is both powerful and efficient (though it certainly appears complex). Perhaps it deserves consideration for incorporation into the core Python distribution. I'm sure other people will come up with other suggestions. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Show replies by date

Andrew M. Kuchling

December 1999

5 p.m.

Skip Montanaro writes:

...

True, but note that you can compile Python with WITHOUT_COMPLEX defined to remove complex numbers.

...

What about 'for line in fileinput.input()', which already exists? (Hmmm... if you have an already open file object, I don't think you can pass it to fileinput.input(); maybe that should be fixed.) On a vaguely related note, since there are many things like parser generators and XML stuff and mxTextTools, I've been speculating about a text processing topic guide. If you know of Python packages related to text processing, please send me a private e-mail with a link. -- A.M. Kuchling http://starship.python.net/crew/amk/ Constraints often boost creativity. -- Jim Hugunin, 11 Feb 1999

Skip Montanaro

5:26 p.m.

Andrew> True, but note that you can compile Python with WITHOUT_COMPLEX Andrew> defined to remove complex numbers. That's true, but that wasn't my point. I'm not arguing for or against space efficiency, just that the the rather timeworn argument about not doing anything special to support text processing because Python is a general purpose language is a red herring. >> 1. When using something like the simple file i/o idiom >> for line in f.readlines(): >> dofunstuff(line) >> the programmer should not have to care how big the file is. Andrew> What about 'for line in fileinput.input()', which already Andrew> exists? (Hmmm... if you have an already open file object, I Andrew> don't think you can pass it to fileinput.input(); maybe that Andrew> should be fixed.) Well, a couple reasons jump to mind: 1. fileinput.FileInput isn't particularly efficient. At its heart, its __getitem__ method makes a simple readline() call instead of buffering some amount of readlines(sizehint) bytes. This can be fixed, but I'm not sure what would happen to its semantics. 2. As you pointed out, it's not all that general. My point, not at all well stated, is that the programmer shouldn't have to worry (much?) about the conditions under which he does file i/o. Right now, if I know the file is small(ish), I can do for line in f.readlines(): dofunstuff(line) but I have to know that the file won't be big, because readlines() will behave badly (perhaps even generate a MemoryError exception) if the file is large. In that case, I have to fall back to the safer (and slower) line = f.readline() while line: dofunstuff(line) line = f.readline() or the more efficient, but more cumbersome lines = f.readlines(sizehint) while lines: for line in lines: dofunstuff(line) lines = f.readlines(sizehint) That's three separate idioms the programmer has to be aware of when writing code to read a text file based upon the perceived need for speed, memory usage and desired clarity: fast/memory-intensive/clear slow/memory-conserving/not-as-clear fast/memory-conserving/fairly-muddy Any particular reason that the readline method can't return an iterator that supports __getitem__ and buffers input? (Again, remember this is for py2k, so the potential breakage such a change might cause is a consideration, but not a showstopper.) Andrew> On a vaguely related note, since there are many things like Andrew> parser generators and XML stuff and mxTextTools, I've been Andrew> speculating about a text processing topic guide. If you know of Andrew> Python packages related to text processing, please send me a Andrew> private e-mail with a link. This sounds like a good idea to me. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Tim Peters

1:08 a.m.

[Skip Montanaro, wants nicer text facilities]

...

... I rather suspect that more people use Python for some sort of text processing than any other single application domain.

Hmm. You're probably right, but I'm an exception.

...

Python should be good at it.

And I guess I'm an exception mostly *because* Perl is better at easy text crunching and Icon is better at hard text-crunching -- that is, I use the right tool for the job <wink>.

...

While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:

1. [*A* clear way to do memory- and time-efficient textfile input]

I agree, but unsure how to fix it. The best way to write this now is # f is some open file object. while 1: lines = f.readlines(BUFSIZE) if not lines: break for line in lines: process(line) and it's not something anyone figures out on their own -- or enjoys typing or explaining afterwards. Perl gets its line-at-a-time speed by peeking and poking C FILE structs directly in compiler- and platform-specific ways -- ways that vendors *should* have done in their own fgets implementations, but almost never do. I have no idea whether it works well with Perl's nascent notions of threading, but in the absence of that "the system" doesn't know Perl is cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one line at a time -- even mixing in C-level ungetc calls works (well, sometimes <0.1 wink -- they don't always peek and poke enough fields>)). The Python QIO extension module is much easier to port but less compatible (it doesn't use stdio, so QIO-opened files don't play well with others) and slower (although that's likely repairable -- he's got two passes over the buffer where one hairier pass should suffice).

...

2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module.

The irony here is that the re engine is very often unboundedly faster than the regex engine -- provided you're chewing over large strings. Some tests /F ran showed that the length-independent *overhead* of invoking re is about 10x higher than for regex. Presumably the bulk of that is due to re.py, i.e. that you get to the re engine via going thru Python layers on your way in and out, while regex was pure C. In any case, /F is working on a new engine (for Unicode), and I believe he has this all well in hand.

...

Depending how far people want to go with things, adding some language syntax to support regular expressions might be in order. ... 3. I've not yet used it, but I am told the pattern matching in Marc-Andre Lemburg's mxTextTools (http://starship.python.net/crew/lemburg/) is both powerful and efficient (though it certainly appears complex). Perhaps it deserves consideration for incorporation into the core Python distribution.

It's not complex, it's complicated -- and *that's* what makes it un-Pythonic <wink>. Tony Ibbs has written a friendly wrapper around mxTextTools that suppresses much of the non-essential complication. OTOH, if you go into this with a regexp mindset, it will run much slower than a real regexp package, because the bulk of the latter is devoted to doing optimization; mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls if you e.g. try to implement naive backtracking). You should go to the REBOL site and look at the description of REBOL's PARSE verb in the FAQ ... mumble, mumble ... at http://www.rebol.com/faq.html#11550948 Here's an example pulled from that page (this is a REBOL code fragment): digit: charset "0123456789" expr: [term ["+" | "-"] expr | term] term: [factor ["*" | "/"] term | factor] factor: [primary "**" factor | primary] primary: [value | "(" expr ")"] value: [digit value | digit] parse "1 + 2 ** 9" expr There hasn't been a pattern scheme this clean, convenient or powerful since SNOBOL4. It exploits REBOL's Forth-like (lack of!) syntax, and Smalltalk-like penchant for passing around thunks (anonymous closures -- "[...]" in REBOL builds a lexically-scoped entity called "a block", which can be treated as code (executed) or data (manipulated like a Python list) at will). Now the example doesn't show this, but you can freely mix computations into the middle of the patterns; only *some* of the words in the blocks have special meaning to PARSE. The fragment above is already way beyond what can be accomplished with regexps, but that's just the start of it. Perl too is slamming in more & more ways to get user code to interact with its regexp engine. So REBOL has a *very* nice approach to this; I believe it's unreasonably clumsy to mimic in Python primarily because of forward references (note e.g. that the block attached to "expr" above refers to "term" before the latter has been bound -- but the stuff inside [...] is just a closure so that doesn't matter -- it only matters that term gets bound before expr is *executed*). I hit a similar snag years ago when trying to mimic SNOBOL4's approach in Python. Perl's endless abuse of regexps is making that language more absurd by the month. The other major approach to mixing patterns with computation is due to Icon, another language where a regexp mindset is fatal. On a whim, I whipped up the attached, which illustrates a bit of the Icon approach in Pythonic terms (but without language support for generators, the *heart* of it can't really be captured). Here's an example of how this could be used to implement (the simplest form of) string.split: def mysplit(str): s = Searcher(str) white = CharSet(" \t\n") result = [] s.many(white) # consume initial whitespace while s.notmany(white): # consume non-whitespace result.append(s.get_match()) s.many(white) return result

...

...
...
mysplit(" \t Hey, that's\tpretty\n\n neat! ") ['Hey,', "that's", 'pretty', 'neat!']

The primary thing to note is that there's no seam between analyzing the string and doing computation on the partial results -- "the program is the pattern". This is what Icon does to perfection, Perl is moving toward, and REBOL is arriving at from a different direction. It's The Future <0.9 wink>. Without generators it's difficult to work backtracking into the Searcher class, but, as above, in my experience the backtracking feature of regexps is rarely *needed*! For example, at various points "split" wants to suck up all the whitespace characters, and that's *it* -- the backtracking possibility in the regexp \s+ is often a bug just waiting for unexpected *context* to trigger it. A hairy regexp is pure hell; but what simpler regexps can do don't require all that funky regexp machinery. BTW, the mxTextTools engine could be used to get blazing implementations of the primary Searcher methods (it excels at simple analysis). OTOH, making lots of calls to analyze short strings is slow. The only clean solutions to that are Perl's and Icon's (build everyting into one language so the compiler can optimize stuff away), and REBOL's (make no distinction between code and data, so that code can be analyzed & optimized at runtime -- and build the entire implementation around making closures and calls supernaturally fast). the-less-you-use-regexps-the-less-you-miss-'em<wink>-ly y'rs - tim class CharSet: def __init__(self, seq): self.seq = seq d = {} for ch in seq: d[ch] = 1 self.haskey = d.has_key def __call__(self, ch): return self.haskey(ch) def __add__(self, other): if isinstance(other, CharSet): other = other.seq return CharSet(self.seq + other) def _normalize_index(i, n): assert n >= 0 if i >= 0: return min(i, n) elif n == 0: return 0 # want smallest q s.t. i + q*n >= 0 # <-> q*n >= -i # <-> q >= -i/n # so q = ceiling(-i/n) = -floor(i/n) return i - (i/n)*n class Searcher: def __init__(self, str, lo=0, hi=None): """Create object to search in str[lo:hi]. lo defaults to 0. hi defaults to len(str). len(str) is repeatedly added to negative lo or hi until reaching a number >= 0. If lo > hi, a uselessly empty slice will be searched. The search cursor is initialized to lo. """ self.s = str self.lo = _normalize_index(lo, len(str)) if hi is None: self.hi = len(str) else: self.hi = _normalize_index(hi, len(str)) if self.lo > self.hi: self.hi = self.lo self.i = self.lo self.lastmatch = None, None def any(self, charset, consume=1): """Try to match single character in charset. Return true iff match succeeded. Advance cursor iff success and optional arg "consume" is true. """ i = self.i if i < self.hi and charset(self.s[i]): if consume: self.__consume(i+1) return 1 return 0 def notany(self, charset, consume=1): """Try to match single character not in charset. Return true iff match succeeded. Advance cursor iff success and optional arg "consume" is true. """ i = self.i if i < self.hi and not charset(self.s[i]): if consume: self.__consume(i+1) return 1 return 0 def many(self, charset, consume=1): """Try to match one or more characters in charset. Return true iff match succeeded. Advance cursor iff success and optional arg "consume" is true. """ i, n, s = self.i, self.hi, self.s j = i while j < n and charset(s[j]): j = j+1 if i < j: if consume: self.__consume(j) return 1 return 0 def notmany(self, charset, consume=1): """Try to match one or more characters not in charset. Return true iff match succeeded. Advance cursor iff success and optional arg "consume" is true. """ i, n, s = self.i, self.hi, self.s j = i while j < n and not charset(s[j]): j = j+1 if i < j: if consume: self.__consume(j) return 1 return 0 def match(self, str, consume=1): """Try to match string "str". Return true iff match succeeded. Advance cursor iff success and optional arg "consume" is true. """ i = self.i j = i + len(str) if self.s[i:j] == str: if consume: self.__consume(j) return 1 return 0 def get_str(self): """Return subject string.""" return self.s def get_lo(self): """Return low slice bound.""" return self.lo def get_hi(self): """Return high slice bound.""" return self.hi def get_pos(self): """Return current value of search cursor.""" return self.i def get_match_indices(self): """Return slice indices of last "consumed" match.""" return self.lastmatch def get_match(self): """Return last "consumed" matching substring.""" i, j = self.lastmatch if i is None: return ValueError("no match to return!") return self.s[i:j] def set_pos(self, pos, consume=1): """Set search cursor to new value. No return value. If optional arg "consume" is true, the last match is set to the slice between pos and the current cursor position. """ p = _normalize_index(pos, len(self.s)) if not self.lo <= p <= self.hi: raise ValueError("pos out of bounds: " + `pos`) if consume: self.__consume(p) else: self.i = p def move_pos(self, incr, consume=1): """Move the cursor by incr characters. No return value. If the new value is outside the slice bounds, it's clipped. If optional arg "consume" is true, the last match is set to the slice between the old and new cursor positions. """ newi = self.i + incr if newi < self.lo: newi = self.lo elif newi > self.hi: newi = self.hi if consume: self.__consume(newi) else: self.i = newi def __consume(self, newi): i, j = self.i, newi if i > j: i, j = j, i self.lastmatch = i, j self.i = newi

Fredrik Lundh

6:05 a.m.

Tim Peters is back from his vacation:

...

...
While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:

1. [*A* clear way to do memory- and time-efficient textfile input]

I agree, but unsure how to fix it. The best way to write this now is

# f is some open file object. while 1: lines = f.readlines(BUFSIZE) if not lines: break for line in lines: process(line)

and it's not something anyone figures out on their own -- or enjoys typing or explaining afterwards.

Perl gets its line-at-a-time speed by peeking and poking C FILE structs directly in compiler- and platform-specific ways -- ways that vendors *should* have done in their own fgets implementations, but almost never do. I have no idea whether it works well with Perl's nascent notions of threading, but in the absence of that "the system" doesn't know Perl is cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one line at a time -- even mixing in C-level ungetc calls works (well, sometimes <0.1 wink -- they don't always peek and poke enough fields>)).

The Python QIO extension module is much easier to port but less compatible (it doesn't use stdio, so QIO-opened files don't play well with others) and slower (although that's likely repairable -- he's got two passes over the buffer where one hairier pass should suffice).

we have something called SIO which uses memory mapping where possible, and just a more aggressive read-ahead for other cases. on a windows box, a traditional while/readline loop runs 3-5 times faster than before. with SRE instead of re, a while/readline/match loop runs up to 10 times faster than before. note that this is without *any* changes to the Python source code...

...

...
2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module.

The irony here is that the re engine is very often unboundedly faster than the regex engine -- provided you're chewing over large strings. Some tests /F ran showed that the length-independent *overhead* of invoking re is about 10x higher than for regex. Presumably the bulk of that is due to re.py, i.e. that you get to the re engine via going thru Python layers on your way in and out, while regex was pure C.

I've attached some old benchmarks. I think the current code base is a bit faster, but you get the idea.

...

In any case, /F is working on a new engine (for Unicode), and I believe he has this all well in hand.

with a little luck, the new module will replace both pcre and regex... not to mention that it's fairly easy to write your own front- end to the matching engine -- the expression parser and the compiler are both written in good old python. </F> $ python sre_bench.py 0 5 50 250 1000 5000 25000 ----- ----- ----- ----- ----- ----- ----- ----- search for Python|Perl in Perl -> sre8 0.007 0.008 0.010 0.010 0.020 0.073 0.349 sre16 0.007 0.007 0.008 0.010 0.020 0.075 0.353 re 0.097 0.097 0.101 0.103 0.118 0.175 0.480 regex 0.007 0.007 0.009 0.020 0.059 0.271 1.320 search for (Python|Perl) in Perl -> sre8 0.007 0.007 0.007 0.010 0.020 0.074 0.344 sre16 0.007 0.007 0.008 0.010 0.020 0.074 0.347 re 0.110 0.104 0.111 0.115 0.125 0.184 0.559 regex 0.006 0.006 0.009 0.019 0.057 0.285 1.432 search for Python in Python -> sre8 0.007 0.007 0.007 0.011 0.021 0.072 0.387 sre16 0.007 0.007 0.008 0.010 0.022 0.082 0.365 re 0.107 0.097 0.105 0.102 0.118 0.175 0.511 regex 0.009 0.008 0.010 0.018 0.036 0.139 0.708 search for .*Python in Python -> sre8 0.008 0.007 0.008 0.011 0.021 0.079 0.379 sre16 0.008 0.008 0.008 0.011 0.022 0.075 0.402 re 0.102 0.108 0.119 0.183 0.400 1.545 7.284 regex 0.013 0.019 0.072 0.318 1.231 8.035 45.366 search for .*Python.* in Python -> sre8 0.008 0.008 0.008 0.011 0.021 0.080 0.383 sre16 0.008 0.008 0.008 0.011 0.021 0.079 0.395 re 0.103 0.108 0.119 0.184 0.418 1.685 8.378 regex 0.013 0.020 0.073 0.326 1.264 9.961 46.511 search for .*(Python) in Python -> sre8 0.007 0.008 0.008 0.011 0.021 0.077 0.378 sre16 0.007 0.008 0.008 0.011 0.021 0.077 0.444 re 0.108 0.107 0.134 0.240 0.637 2.765 13.395 regex 0.026 0.112 3.820 87.322 (skipped) search for .*P.*y.*t.*h.*o.*n.* in Python -> sre8 0.010 0.010 0.014 0.031 0.093 0.419 2.212 sre16 0.010 0.011 0.014 0.030 0.093 0.419 2.292 re 0.112 0.121 0.195 0.521 1.747 8.298 40.877 regex 0.026 0.048 0.248 1.148 4.550 24.720 ... (searching for patterns in padded strings; sre8 is the sre engine compiled for 8-bit characters, sre16 is the same engine compiled for 16-bit characters)

Tim Peters

5:53 p.m.

[Fredrik Lundh, whose very nice eMatter book is on sale until the end of the 20th century (as real people think of it), although the eMatter distribution scheme has lots of problems [just an editorial note from a bot who has to-- for unknown reasons Fatbrain "is working on" --delete the Fatbrain registry tree and reregister the book almost every time he tries to open it <wink> ] ]

...

If so, there's potential for significantly more speed. Python does its line-at-a-time input with a character-at-a-time macro-in-a-loop, the same way naive vendors (read "almost all vendors") implement fgets. It's replacing that inner loop with direct peeking into the FILE buffer that gets Perl its dramatic speed -- despite that Perl has fancier input functionality (the oft-requested automagical "input record separator"). So it sounds like the Perl trick is orthogonal to SIO's tricks; Perl isn't doing mmaps or read-aheads or anything else fancy under the covers -- it only optimizes the inner loop!

...

If something more tangible than luck would help to make this come true, feel free to mention it <wink>.

...

Ah, good news / bad news. Perl refugees aren't accustomed to "precompiling" regexp objects, so write code that will cause regexps to get recompiled over & over. Even if you cache the results under the covers, the overhead of the Python call to the regexp compiler will likely take as long as the engine takes to search. Personally, in such cases, I think they should learn how to use the language <0.5 wink>.

M.-A. Lemburg

6:52 a.m.

Tim Peters wrote:

...

What is QIO ?

...

All true. mxTextTools provides the tools, not the magic. But this is also its strength: you can optimize the hell out of your particular parsing requirement without having to think about how the RE optimizer works.

...

Looks nice indeed, but how does executable code fit into that definition ? (mxTextTools allows you to write your own parsing elements in Python, BTW; it should be possible to use those mechanisms to achieve a similar intergration.)

...

That's why mxTextTools converts these search idioms into byte codes which it executes at C level. Some future version will even "precompile" the tuple input and then omit the type checks during the search... that should give another noticeable speedup. Note that recursion etc. can be done at C level too -- Python function calls are not needed.

...

Just for kicks, here is the mysplit() function using mxTextTools: from mx.TextTools import * table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace ('text',AllInSet + AppendMatch,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), ) def mysplit(text): return tag(text,table)[1] The timings: mysplit: 5.84 sec. string.split: 3.62 sec. Note that you can customize the above to split text at any character set you like, not just whitespace... without compiling or writing C code. The function mx.TextTools.setsplit() provides this functionality as pure C function. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: Get ready to party ! Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters

10:35 p.m.

[M.-A. Lemburg]

...

What is QIO ?

See DejaNews (I don't save URLs). "Quick" line-oriented text input adapted from INN. Someone rewrote that as a Python extension module.

...

...
http://www.rebol.com/faq.html#11550948

...

Looks nice indeed, but how does executable code fit into that definition ?

See the URL above I didn't save <wink>. PARSE's "pattern" argument is a block. Blocks can be (& often are) nested. Whether any given block is code or data is all the same to REBOL, so passing nested code blocks in PARSE's pattern argument is easy. Because blocks are lexically scoped, assignments (etc) inside a block are (well, can be) visible to its context; etc. It's a very Lispish approach. REBOL is essentially Scheme under the covers, but with syntax much more like Forth's (whitespace-separated strings of arbitrary non-whitespace characters, with few pre-assigned meanings or restrictions -- in fact, it's impossible for a compiler to determine where a REBOL function call begins or ends! can't be known until runtime).

...

(mxTextTools allows you to write your own parsing elements in Python, BTW; it should be possible to use those mechanisms to achieve a similar intergration.)

It can't capture the flavor -- although I don't know that it needs to <wink>. There's no distinction between "the pattern language" and "the computational language" in REBOL or Icon, and it's hard to explain what a maddening distinction that can be once you've lived without it. mxTextTools embedding would feel more like Icon, where the matching engine is fully exposed to the programmer (REBOL hides it, allowing only "approved" interactions).

...

...
OTOH, making lots of calls to analyze short strings is slow.

...

That's why mxTextTools converts these search idioms into byte codes which it executes at C level. Some future version will even "precompile" the tuple input and then omit the type checks during the search...that should give another noticeable speedup. Note that recursion etc. can be done at C level too -- Python function calls are not needed.

That's also the curse of having distinct languages; e.g., Python already had recursion, but you needed to reimplement it in a different way with different syntax and different rules in your pattern language. In Icon etc, there's no difference between a recursive pattern and a recursive function, except in *what* it computes. The machinery is all the same, and both more powerful and easier to learn because of that.

...

... Just for kicks, here is the mysplit() function using mxTextTools:

from mx.TextTools import *

table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace ('text',AllInSet + AppendMatch,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), )

def mysplit(text):

return tag(text,table)[1]

The timings: mysplit: 5.84 sec. string.split: 3.62 sec.

Note that you can customize the above to split text at any character set you like, not just whitespace... without compiling or writing C code.

That's equally true of the example I posted <wink>. Now what if I wanted to stop splitting right after I find a keyword, recognized as such because it's a key in some passed-in dictionary? In my example, I make an obvious local code change, from while s.notmany(white): # consume non-whitespace result.append(s.get_match()) s.many(white) to while s.notmany(white): # consume non-whitespace word = s.get_match() result.append(word) if dictionary.has_key(word): break s.many(white) What does it do to your example? Or what if the target string isn't "a string" (the code I posted only assumes the "str" object responds to indexing and slicing -- any buffer object is fine -- so my example doesn't change at all)? Or what if you need to pass the tokens on as they're found, pipeline style? Etc. This is why I do complex string processing in Icon <0.9 wink>. OTOH, at what it does well, mxTextTools runs quicker than Icon. Its biggest problem has always been that e.g. nobody knows what the hell (None,EOF,Here,-2), *means* at first glance -- or third <wink>. an-extreme-on-the-transparency-vs-speed-curve-ly y'rs - tim

M.-A. Lemburg

6:18 a.m.

Tim Peters wrote:

...

Ok, thanks.

...

If I understand the concept correctly, I think Python could do pretty much the same thing. The bummer is of course the need for new keywords and byte codes (although these could be split out into a separate text scanning engine). Using Python function calls would slow down things to an extent that would render the added functionality useless, well IMHO anyways ;-)

...

Of course its hard for a Turing Machine to capture the flavor of any high level language :-) When you're programming the mxTextTools Tagging Engine directly you feel like writing assembler... but things are moving in the right direction: Tony Ibbs has a nice meta-language and M.C. Fletcher his SimpleParse to cover up these insufficiencies.

...

Agreed.

...

You'd replace the 'text' tagobj with a callable object and write AllInSet + CallTag as command. The Tagging Engine will then call the object with arguments (taglist,text,l,r,subtags) and let it decide what to do. In your example it would check the dictionary and raise an exception in case a keyword is found to stop any further scanning. If it's not a keyword, it would simply append the found string to the taglist and return None. Here's the code: from mx.TextTools import * import exceptions stoplist = {'abc':1, 'def':1} class KeywordFound(exceptions.StandardError): def __init__(self, taglist): self.taglist = taglist def callable(taglist,text,l,r,subtags): taglist.append(text[l:r]) if stoplist.has_key(text[l:r]): raise KeywordFound(taglist) table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace (callable,AllInSet + CallTag,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), ) def mysplitex(text): try: return tag(text,table)[1] except KeywordFound,data: return data.taglist

...

The current version only handles string objects, but I am already beginning to convert all the APIs in mxTextTools to "s#" or "t#" style (can't decide which to use... "s#" is great for processing raw data, while "t#" more closely refers to text processing).

...

You can have all that extra magic via callable tag objects or callable matching functions. It's not exactly nice to write, but I'm sure that a meta-language could do the conversions for you.

...

The structure of those tag tables is very simple: (tagobject, command, argument[, jump offset in case of failure [, jump offset in case of success]]) Please remember that this is byte code, not some higher level abstraction. The design is very much inverted from what you'd usually do: design a nice language and then try to find suitable set of byte codes to make it work as intended. Anyway, I'll keep focussing on the speed aspect of mxTextTools; others can focus on abstractions, so that eventually everybody will be happy :-) Happy New Year, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: Get ready to party ! Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters

5:53 p.m.

...

...
This is why I do complex string processing in Icon <0.9 wink>.

[MAL]

...

You can have all that extra magic via callable tag objects or callable matching functions. It's not exactly nice to write, but I'm sure that a meta-language could do the conversions for you.

That wasn't my point: I do it in Icon because it *is* "exactly nice to write", and doesn't require any yet-another meta-language. It's all straightforward, in a way that separate schemes pasted together can never be (simply because they *are* "separate schemes pasted together" <wink>). The point of my Python examples wasn't that they could do something mxTextTools can't do, but that they were *Python* examples: every variation I mentioned (or that you're likely to think of) was easy to handle for any Python programmer because the "control flow" and "data type" etc aspects could be handled exactly the way they always are in *non* pattern-matching Python code too, rather than recoded in pattern-scheme-specific different ways (e.g., where I had a vanailla "if/break", you set up a special exception to tickle the matching engine). I'm not attacking mxTextTools, so don't feel compelled to defend it -- people using regexps in those examples are dead in the water. mxTextTools is very good at what it does; if we have a real disagreement, it's probably that I'm less optimistic about the prospects for higher-level wrappers (e.g., MikeF's SimpleParse is much slower than "a real" BNF parsing system (ARBNFPS), in part because he isn't doing all the optimizations ARBNFPS does, but also in part because ARBNFPS uses an underlying engine more optimized to its specific task than mxTextTool's more-general engine *can* be). So I don't see mxTextTools as being the answer to everything -- and if you hadn't written it, you would agree with that on first glance <wink>.

...

Anyway, I'll keep focussing on the speed aspect of mxTextTools; others can focus on abstractions, so that eventually everybody will be happy :-)

You and I will be, anyway <wink>.

M.-A. Lemburg

January 2000

1:22 p.m.

Tim Peters wrote:

...

Oh, I wasn't defending it -- I know that it is cryptic and sometimes a pain to use. But given that you don't have to invoke a C compiler to get a raw speed I find it a rather useful alternative to code fast utility functions which would otherwise have to be written in C. The other reason it exists is simply because I don't like the recursive style of regexps too much. mxTextTools is simple and straightforward. Backtracking is still possible, but not recommended.

...

Oh, I'm sure it *is* the answer to all out problems ;-) ... def main(*dummy): ... from mx.TextTools import * tag("",((main, Skip + CallTag, 0),))

...

Happy New Year :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: Happy New Century ! Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Andrew M. Kuchling

December 1999

5 p.m.

Skip Montanaro writes:

...

True, but note that you can compile Python with WITHOUT_COMPLEX defined to remove complex numbers.

...

Skip Montanaro

5:26 p.m.

Tim Peters

1:08 a.m.

[Skip Montanaro, wants nicer text facilities]

...

... I rather suspect that more people use Python for some sort of text processing than any other single application domain.

Hmm. You're probably right, but I'm an exception.

...

Python should be good at it.

And I guess I'm an exception mostly *because* Perl is better at easy text crunching and Icon is better at hard text-crunching -- that is, I use the right tool for the job <wink>.

...

While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:

1. [*A* clear way to do memory- and time-efficient textfile input]

...

2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module.

...

Depending how far people want to go with things, adding some language syntax to support regular expressions might be in order. ... 3. I've not yet used it, but I am told the pattern matching in Marc-Andre Lemburg's mxTextTools (http://starship.python.net/crew/lemburg/) is both powerful and efficient (though it certainly appears complex). Perhaps it deserves consideration for incorporation into the core Python distribution.

...

...
...
mysplit(" \t Hey, that's\tpretty\n\n neat! ") ['Hey,', "that's", 'pretty', 'neat!']

Fredrik Lundh

6:05 a.m.

Tim Peters is back from his vacation:

...

...
While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:

1. [*A* clear way to do memory- and time-efficient textfile input]

I agree, but unsure how to fix it. The best way to write this now is

# f is some open file object. while 1: lines = f.readlines(BUFSIZE) if not lines: break for line in lines: process(line)

and it's not something anyone figures out on their own -- or enjoys typing or explaining afterwards.

Perl gets its line-at-a-time speed by peeking and poking C FILE structs directly in compiler- and platform-specific ways -- ways that vendors *should* have done in their own fgets implementations, but almost never do. I have no idea whether it works well with Perl's nascent notions of threading, but in the absence of that "the system" doesn't know Perl is cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one line at a time -- even mixing in C-level ungetc calls works (well, sometimes <0.1 wink -- they don't always peek and poke enough fields>)).

The Python QIO extension module is much easier to port but less compatible (it doesn't use stdio, so QIO-opened files don't play well with others) and slower (although that's likely repairable -- he's got two passes over the buffer where one hairier pass should suffice).

...

...
2. The re module needs to be sped up, if not to catch up with Perl, then to catch up with the deprecated regex module.

The irony here is that the re engine is very often unboundedly faster than the regex engine -- provided you're chewing over large strings. Some tests /F ran showed that the length-independent *overhead* of invoking re is about 10x higher than for regex. Presumably the bulk of that is due to re.py, i.e. that you get to the re engine via going thru Python layers on your way in and out, while regex was pure C.

I've attached some old benchmarks. I think the current code base is a bit faster, but you get the idea.

...

In any case, /F is working on a new engine (for Unicode), and I believe he has this all well in hand.

Tim Peters

5:53 p.m.

...

If something more tangible than luck would help to make this come true, feel free to mention it <wink>.

...

M.-A. Lemburg

6:52 a.m.

Tim Peters wrote:

...

What is QIO ?

...

Tim Peters

December 1999

10:35 p.m.

[M.-A. Lemburg]

...

What is QIO ?

See DejaNews (I don't save URLs). "Quick" line-oriented text input adapted from INN. Someone rewrote that as a Python extension module.

...

...
http://www.rebol.com/faq.html#11550948

...

Looks nice indeed, but how does executable code fit into that definition ?

...

(mxTextTools allows you to write your own parsing elements in Python, BTW; it should be possible to use those mechanisms to achieve a similar intergration.)

...

...
OTOH, making lots of calls to analyze short strings is slow.

...

That's why mxTextTools converts these search idioms into byte codes which it executes at C level. Some future version will even "precompile" the tuple input and then omit the type checks during the search...that should give another noticeable speedup. Note that recursion etc. can be done at C level too -- Python function calls are not needed.

...

... Just for kicks, here is the mysplit() function using mxTextTools:

from mx.TextTools import *

table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace ('text',AllInSet + AppendMatch,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), )

def mysplit(text):

return tag(text,table)[1]

The timings: mysplit: 5.84 sec. string.split: 3.62 sec.

Note that you can customize the above to split text at any character set you like, not just whitespace... without compiling or writing C code.

M.-A. Lemburg

6:18 a.m.

Tim Peters wrote:

...

Ok, thanks.

...

Agreed.

...

You can have all that extra magic via callable tag objects or callable matching functions. It's not exactly nice to write, but I'm sure that a meta-language could do the conversions for you.

...

Tim Peters

5:53 p.m.

...

...
This is why I do complex string processing in Icon <0.9 wink>.

[MAL]

...

You can have all that extra magic via callable tag objects or callable matching functions. It's not exactly nice to write, but I'm sure that a meta-language could do the conversions for you.

...