May I have developer status on the SourceForge CVS, please? I maintain two standard-library modules (shlex and netrc) and have been involved with the development of several others (including Cmd, smtp, httplib, and multifile). My only immediate plan for what to do with developer access is to add the browser-launch capability previously discussed on this list. My general interest is in improving the standard class library, especially in the areas of Internet-protocol support (urllib, ftp, telnet, pop, imap, smtp, nntplib, etc.) and mini-language toolkits and frameworks (shlex. netrc, Cmd, ConfigParser). If the Internet-protocol support in the library were broken out as a development category, I would be willing to fill the patch-handler slot for it. -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> See, when the GOVERNMENT spends money, it creates jobs; whereas when the money is left in the hands of TAXPAYERS, God only knows what they do with it. Bake it into pies, probably. Anything to avoid creating jobs. -- Dave Barry
"Eric S. Raymond" wrote:
...
My only immediate plan for what to do with developer access is to add the browser-launch capability previously discussed on this list. My general interest is in improving the standard class library, especially in the areas of Internet-protocol support (urllib, ftp, telnet, pop, imap, smtp, nntplib, etc.) and mini-language toolkits and frameworks (shlex. netrc, Cmd, ConfigParser).
As an aside: I would be pumpld about getting a generic lexer into the Python distribution. Greg Ewing was working on one and there are various others out there. http://www.cosc.canterbury.ac.nz/~greg/python/Plex/ -- Paul Prescod - Not encumbered by corporate consensus The calculus and the rich body of mathematical analysis to which it gave rise made modern science possible, but it was the algorithm that made the modern world possible. - The Advent of the Algorithm, by David Berlinski
Paul Prescod <paul@prescod.net>:
As an aside: I would be pumpld about getting a generic lexer into the Python distribution. Greg Ewing was working on one and there are various others out there. http://www.cosc.canterbury.ac.nz/~greg/python/Plex/
Yes, this would be a good thing. I'm also talking with John Aycock about his elegant SPARK toolkit for generating Earley-algorithm parsers. Once that comes out of beta, I would consider it a core-library candidate. -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> No matter how one approaches the figures, one is forced to the rather startling conclusion that the use of firearms in crime was very much less when there were no controls of any sort and when anyone, convicted criminal or lunatic, could buy any type of firearm without restriction. Half a century of strict controls on pistols has ended, perversely, with a far greater use of this weapon in crime than ever before. -- Colin Greenwood, in the study "Firearms Control", 1972
"Eric S. Raymond" wrote:
...
Yes, this would be a good thing. I'm also talking with John Aycock about his elegant SPARK toolkit for generating Earley-algorithm parsers. Once that comes out of beta, I would consider it a core-library candidate.
I pointed one of my co-workers at Spark and he loved the lexer but said that the parser ended up being too slow to be useful. I didn't know enough about the Earley-algorithm to suggest how he could re-organize his grammar to optimize for it. If naive Python programmers cannot generate usable parsers then it may not be appropriate for the standard library. -- Paul Prescod - Not encumbered by corporate consensus The calculus and the rich body of mathematical analysis to which it gave rise made modern science possible, but it was the algorithm that made the modern world possible. - The Advent of the Algorithm (pending), by David Berlinski
Paul Prescod <paul@prescod.net>:
I pointed one of my co-workers at Spark and he loved the lexer but said that the parser ended up being too slow to be useful. I didn't know enough about the Earley-algorithm to suggest how he could re-organize his grammar to optimize for it. If naive Python programmers cannot generate usable parsers then it may not be appropriate for the standard library.
I'm using a SPARK-generated parser plus shlex in CML2. This does not seem to create a speed problem. -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> No kingdom can be secured otherwise than by arming the people. The possession of arms is the distinction between a freeman and a slave. -- "Political Disquisitions", a British republican tract of 1774-1775
paul wrote:
As an aside: I would be pumpld about getting a generic lexer into the Python distribution.
how about this quick and dirty proposal: - add a new primitive to SRE: (?P#n), where n is a small integer. this primitive sets the match object's "index" variable to n when the engine stumbles upon it. - given a list of "phrases", combine them into a single regular expression like this: (?:phrase1(?P#1))|(?:phrase2(?P#2))|... - apply match repeatedly to the input string. for each match, use the index attribute to figure out what phrase we matched. see below for a slightly larger example. </F> import sre class Scanner: def __init__(self, lexicon): self.lexicon = lexicon p = [] for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p))) self.scanner = sre.compile("|".join(p)) def scan(self, string): result = [] append = result.append match = self.scanner.match i = 0 while 1: m = match(string, i) if not m: break j = m.end() if i == j: break action = self.lexicon[m.index][1] if callable(action): self.match = match action = action(self, m.group()) if action is not None: append(action) i = j return result, string[i:] def s_ident(scanner, token): return token def s_operator(scanner, token): return "operator%s" % token def s_float(scanner, token): return float(token) def s_int(scanner, token): return int(token) scanner = Scanner([ (r"[a-zA-Z_]\w*", s_ident), (r"\d+\.\d*", s_float), (r"\d+", s_int), (r"=|\+|-|\*|/", s_operator), (r"\s+", None), ]) tokens, tail = scanner.scan("sum = 3*foo + 312.50 + bar") print tokens if tail: print "syntax error at", tail ## prints: ## ['sum', 'operator=', 3, 'operator*', 'foo', 'operator+', ## 312.5, 'operator+', 'bar']
Fredrik Lundh wrote:
...
- add a new primitive to SRE: (?P#n), where n is a small integer. this primitive sets the match object's "index" variable to n when the engine stumbles upon it.
How about an interned string instead?
- given a list of "phrases", combine them into a single regular expression like this:
(?:phrase1(?P#1))|(?:phrase2(?P#2))|...
Will sre do anything about optimizing common prefixes and so forth? Overall, I like your proposal. -- Paul Prescod - Not encumbered by corporate consensus The calculus and the rich body of mathematical analysis to which it gave rise made modern science possible, but it was the algorithm that made the modern world possible. - The Advent of the Algorithm (pending), by David Berlinski
[Paul Prescod]
As an aside: I would be pumped about getting a generic lexer into the Python distribution.
[Fredrik Lundh]
how about this quick and dirty proposal:
- add a new primitive to SRE: (?P#n), where n is a small integer. this primitive sets the match object's "index" variable to n when the engine stumbles upon it.
Note that the lack of "something like this" is one of the real barriers to speeding SPARK's lexing, and the speed of a SPARK lexer now (well, last I looked into this) can be wildly dependent on the order in which you define your lexing methods (partly because there's no way to figure out which lexing method matched without iterating through all the groups to find the first that isn't None). The same kind of irritating iteration is needed in IDLE and pyclbr too (disguised as unrolled if/elif/elif/... chains), and in tokenize.py (there *really* disguised in a convoluted way, by doing more string tests on the matched substring to *infer* which of the regexp pattern chunks must have matched). OTOH, arbitrary small integers are not Pythonic. Your example *generates* them in order to guarantee they're unique, which is a bad sign (it implies users can't do this safely by hand, and I believe that's the truth of it too):
for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
How about instead enhancing existing (?P<name>pattern) notation, to set a new match object attribute to name if & when pattern matches? Then arbitrary info associated with a named pattern can be gotten at via dicts via the pattern name, & the whole mess should be more readable. On the third hand, I'm really loathe to add more gimmicks to stinking regexps. But, on the fourth hand, no alternative yet has proven popular enough to move away from those suckers. if-you-can't-get-a-new-car-at-least-tune-up-the-old-one-ly y'rs - tim
On the third hand, I'm really loathe to add more gimmicks to stinking regexps. But, on the fourth hand, no alternative yet has proven popular enough to move away from those suckers.
if-you-can't-get-a-new-car-at-least-tune-up-the-old-one-ly y'rs - tim
Right. Actually, if it helps, i'm working on porting re2c to python. Because it was written properly, it's rather simple (in fact, i've only needed to modify one file, add some if's to ee if we want python generation, and output the python code instead of c code) The lexers it generates for c/C++ are much faster than flex lexers, because they are directly coded. I haven't benchmarked it against SPARK yet, but i would imagine it would blow it away, for the same reason it blows away flex. --Dan
Daniel Berlin wrote:
Actually, if it helps, i'm working on porting re2c to python.
Pointers ?
Because it was written properly, it's rather simple (in fact, i've only needed to modify one file, add some if's to ee if we want python generation, and output the python code instead of c code) The lexers it generates for c/C++ are much faster than flex lexers, because they are directly coded. I haven't benchmarked it against SPARK yet, but i would imagine it would blow it away, for the same reason it blows away flex.
Perhaps you should also look at the tagging engine in mxTextTools (you know where...) ?! It's very low-level, but it makes nice target for optimizing parser generators since it provides a Python interface to raw C speed. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
tim wrote:
OTOH, arbitrary small integers are not Pythonic. Your example *generates* them in order to guarantee they're unique, which is a bad sign.
this feature itself has been on the todo list for quite a while; the (?P#n) syntax just exposes the inner workings (the "small integer" is simply some- thing that fits in a SRE_CODE word). as you say, it's probably a good idea to hide it a bit better...
for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
How about instead enhancing existing (?P<name>pattern) notation, to set a new match object attribute to name if & when pattern matches? Then arbitrary info associated with a named pattern can be gotten at via dicts via the pattern name, & the whole mess should be more readable.
good idea. and fairly easy to implement, I think. on the other hand, that means creating more real groups. and groups don't come for free... maybe this functionality should only be available through the scanner class? it can compile the patterns separately, and combine the data structures before passing them to the code generator. a little bit more code to write, but less visible oddities.
On the third hand, I'm really loathe to add more gimmicks to stinking regexps. But, on the fourth hand, no alternative yet has proven popular enough to move away from those suckers.
if-you-can't-get-a-new-car-at-least-tune-up-the-old-one-ly y'rs - tim
hey, SRE is a new car. same old technology, though. only smaller ;-) btw, if someone wants to play with this, I just checked in a new SRE snapshot. a little bit of documentation can be found here: http://hem.passagen.se/eff/2000_07_01_bot-archive.htm#416954 </F>
tim wrote:
for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
How about instead enhancing existing (?P<name>pattern) notation, to set a new match object attribute to name if & when pattern matches? Then arbitrary info associated with a named pattern can be gotten at via dicts via the pattern name, & the whole mess should be more readable.
I just added "lastindex" and "lastgroup" attributes to the match object. "lastindex" is the integer index of the last matched capturing group, "lastgroup" the corresponding name (or None, if the group didn't have a name). both attributes are None if no group were matched. </F>
Might it be worth taking the lexer discussion to the String-SIG? The more public the discussion, the better, and it is why the SIG is there... --amk
Blast from the past! [/F]
for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
[Tim]
How about instead enhancing existing (?P<name>pattern) notation, to set a new match object attribute to name if & when pattern matches? Then arbitrary info associated with a named pattern can be gotten at via dicts via the pattern name, & the whole mess should be more readable.
[/F Sent: Sunday, July 02, 2000 6:35 PM]
I just added "lastindex" and "lastgroup" attributes to the match object.
"lastindex" is the integer index of the last matched capturing group, "lastgroup" the corresponding name (or None, if the group didn't have a name). both attributes are None if no group were matched.
Reviewing this before 2.0 has been on my todo list for 3+ months, and finally got to it. Good show! I converted some of my by-hand scanners to use lastgroup, and like it a whole lot. I know you understand why this is Good, so here's a simple example of an "after" tokenizer for those who don't (this one happens to tokenize REXX-like PARSE stmts): import re _token = re.compile(r""" (?P<space> \s+) | (?P<var> [a-zA-Z_]\w*) | (?P<dontcare> \.) | (?P<number> \d+) | (?P<punc> [-+=()]) | (?P<string> " [^"\\\n]* (?: \\. [^"\\\n]*)* " | ' [^'\\\n]* (?: \\. [^'\\\n]*)* ' ) """, re.VERBOSE).match del re (T_SPACE, T_VAR, T_DONTCARE, T_NUMBER, T_PUNC, T_STRING, T_EOF, ) = range(7) # For debug output. _enum2name = ["T_SPACE", "T_VAR", "T_DONTCARE", "T_NUMBER", "T_PUNC", "T_STRING", "T_EOF", ] _group2action = { "space": (T_SPACE, None), "var": (T_VAR, None), "dontcare": (T_DONTCARE, None), "number": (T_NUMBER, int), "punc": (T_PUNC, None), "string": (T_STRING, eval), } def tokenize(s, tokeneater): i, n = 0, len(s) while i < n: m = _token(s, i) if not m: raise ParseError(s, i) group = m.lastgroup enum, action = _group2action[group] val = m.group(group) if action is not None: val = action(val) tokeneater(enum, val) i = m.end() tokeneater(T_EOF, None) The tokenize function here used to be a mass of if/elif stmts trying to figure out which group had matched. Now it's all table-driven: easier to write, reuse & maintain, and quicker to boot. +1. the-aged-may-be-slow-but-they-never-forget<wink>-ly y'rs - tim
Tim Peters wrote:
...
Reviewing this before 2.0 has been on my todo list for 3+ months, and finally got to it. Good show! I converted some of my by-hand scanners to use lastgroup, and like it a whole lot. I know you understand why this is Good, so here's a simple example of an "after" tokenizer for those who don't (this one happens to tokenize REXX-like PARSE stmts):
Is there a standard technique for taking a regexp like this and applying it to data fed in a little at a time? Other than buffering the data forever? That's something else I would like in a "standard Python lexer", if that's the goal. -- Paul Prescod - Not encumbered by corporate consensus Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/homes/perlis-alan/quotes.html
[Eric S. Raymond]
May I have developer status on the SourceForge CVS, please? I maintain two standard-library modules (shlex and netrc) and have been involved with the development of several others (including Cmd, smtp, httplib, and multifile).
My only immediate plan for what to do with developer access is to add the browser-launch capability previously discussed on this list. My general interest is in improving the standard class library, especially in the areas of Internet-protocol support (urllib, ftp, telnet, pop, imap, smtp, nntplib, etc.) and mini-language toolkits and frameworks (shlex. netrc, Cmd, ConfigParser).
If the Internet-protocol support in the library were broken out as a development category, I would be willing to fill the patch-handler slot for it.
Eric, I just added you -- go nuts! Don't forget your docstrings, and try hard not to add new modules Guido will hate <0.9 wink -- but new modules do merit python-dev discussion first>. Ah, one more: the layout of the "Edit Member Permissions" admin page on SF is completely screwed up for me, so you got whatever the default permissions are. This looked fine to me a few days ago, but we've added several members since then. Would one of the admins using Netscape please check that page for sane display? I can't yet tell whether it's an IE5 or SF problem.
Tim Peters writes:
Ah, one more: the layout of the "Edit Member Permissions" admin page on SF is completely screwed up for me, so you got whatever the default permissions are. This looked fine to me a few days ago, but we've added several members since then.
Would one of the admins using Netscape please check that page for sane display? I can't yet tell whether it's an IE5 or SF problem.
It looks fine to me -- you may be plagued by a slow network connection. ;) I've updated Eric's permissions so he can use the patch manager properly. -Fred -- Fred L. Drake, Jr. <fdrake at beopen.com> BeOpen PythonLabs Team Member
participants (8)
-
Andrew Kuchling
-
Daniel Berlin
-
Eric S. Raymond
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
M.-A. Lemburg
-
Paul Prescod
-
Tim Peters