Re: [Python-ideas] Where/how to propose an addition to a standard module?

On Oct 13, 2008, at 8:46 AM, pruebauno@latinmail.com wrote:
Well, I suppose if you're already used to RE, then maybe it's not obvious that to an RE newbie, this: regex = re.compile("The (?P<object>.*?) in (?P<location>.*?) falls mainly in the (?P<subloc>.*?).") d = regex.match(text).groupdict() is far harder to read and type correctly than this: templ = Template("This $object in $location falls mainly in the $subloc") d = templ.match(text) Any other example would show the same simplification. Of course, if you're the sort of person who uses RE, you probably don't use Template.substitute either, since you probably like and are comfortable with the string % operator. But Template.substitute was introduced to make it easier to handle the common, simple substitution operations, and I believe adding a Template.match method would do the same thing for common, simple matching operations. Here's a more fleshed-out proposal, with rationale and references -- see if this makes it any clearer why I think this would be a fine addition to the Template class. Abstract Introduces a new function on the string.Template [1] class, match(), to perform the approximate inverse of the existing substitute() function. That is, it attempts to match an input string against a template, and if successful, returns a dictionary providing the matched text for each template field. Rationale PEP 292 [2] added a simplified string substitution feature, allowing users to easily substitute text for named fields in a template string. The inverse operation is also useful: given a template and an input string, one wishes to find the text in the input string matching the fields in the template. However, Python currently has no easy way to do it. While this named matching operation can be accomplished using RegEx, the constructions required are somewhat complex and error prone. It can also be done using third-party modules such as pyparse, but again the setup requires more code and is not obvious to programmers inexperienced with that module. In addition, the Template class already has all the data needed to perform this operation, so it is a natural fit to simply add a new method on this class to perform a match, in addition to the existing method to perform a substitution. Proposal Proposed is the addition of one new function, on the existing Template class, as follows: def match(text, greedy=false) 'match' is a new function which accepts one required parameter, an input string; and one optional parameter, 'greedy', which determines whether matches should be done in a greedy manner, equivalent to regex pattern '(.*)'; or in a non-greedy manner, equivalent to '(.*?)'. If the input string can be matched to the template pattern (respecting the 'greedy' flag), then match returns a dictionary, where each field in the pattern maps to the corresponding part of the input string. If the input string cannot be matched to the template pattern, then match returns None. Examples: >>> from string import Template >>> s = Template('$name was born in ${country}') >>> print s.match('Guido was born in the Netherlands') {'name':'Guido', 'country':'the Netherlands'} >>> print s.match('Spam was born as a canned ham') None Note that when the match is successful, the resulting dictionary could be passed through Template.substitute to reconstitute the original input string. Conversely, any string created by Template.substitute could be matched by Template.match (though in unusual cases, the resulting dictionary might not exactly match the original, e.g. if the string could be matched in multiple ways). Thus, .match and .substitute are inverse operations. References [1] Template Strings http://www.python.org/doc/2.5.2/lib/node40.html [2] PEP 292: Simpler String Substitutions http://www.python.org/dev/peps/pep-0292/

On Mon, Oct 13, 2008 at 10:16 AM, Joe Strout <joe@strout.net> wrote:
If I were proposing something like this, I'd be using the new formatting syntax that's supposed to become the standard in Python 3.0: http://docs.python.org/dev/3.0/library/string.html#format-string-syntax That would mean something like:: '{name} was born in {country}' Steve -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy

On Mon, Oct 13, 2008 at 12:51 PM, Steven Bethard <steven.bethard@gmail.com> wrote:
I agree, but it seems something more is needed here. I would like to be able to parse something that isn't separated by whitespace, and I'd like to be able to tell Python to turn it into an int if I need to. We could keep a similar syntax to the 3.0 formatting syntax, and just change the semantics: ""First, thou shalt count to {n!i}" (or {n!d}) could parse an integer out of the string, and "{!x:[0-9A-Fa-f]*}" + (":{!x:[0-9A-Fa-f]*}" * 7) could be used to parse MAC addresses into a list of 8 hex values. I could easily see this getting way too complex, so maybe you all should just ignore me. -- Cheers, Leif

Leif Walsh wrote:
Given that I have never used the Template class and am using 3.0 str.format, that I believe the former was the bridge that led to the latter, and am not enamored of writing REs, I think adding str.match would be a great idea. To the extent possible, s == form.format(form.match(s)) and args == form.match(form.format(args). Note that args can either be a sequence or mapping. This might encourage more people to switch faster from % to .format. Given that Guido wants this to happen, he might look favorably
Let us not make the perfect be the enemy of the possible.
so maybe you all should just ignore me.
Nope. Terry Jan Reedy

On Oct 13, 2008, at 2:43 PM, Terry Reedy wrote:
Well, I'm all for that, but I've never used the 3.0 str.format, and find the documentation on it somewhat unenlightening. Can you mock up an example of how this might work? In simple form it looks very similar to Template.substitute, but it leaves lots of room for getting not-so-simple. Does that present a problem, in that a great number of format strings would not be useable in reverse? Or would we simply say: if you want to use the reverse (match) operation, then you have to restrict yourself to simple named fields? Or, would we define a slightly different syntax for use with match, that lets you specify numeric conversions or whatever, and give up the idea that these are truly inverse operations? Best, - Joe

Joe Strout wrote:
You probably read too much of it. I probably read it all but only learned the basics, intending to review more on a need-to-know basis.
Can you mock up an example of how this might work?
My ideas and opinions. 1. Start with the simplest possible tasks. temp_p1 = 'Testing {0}' temp_k1 = 'Testing {word}' p1 = ('matches',) # or perhaps use list instead of tuple k1 = {'word':'matches'} text1 = 'Testing matches' form_p1 = temp_p1.format(*p1) form_k1 = temp_k1.format(**k1) print(form_p1 == form_k1==text1) #prints True today with 3.0c1 #tests temp_p1.match(text1) == p1 temp_k1.match(text1) == k1 (Come to think of it, easiest would have no literal text, but I already wrote the above.) Now, write the simplest code that makes these tests pass. Easy. Use str.startswith() to match the literal, non-field text. Add text1e='Bad matches' and consider what exception to raise on non-literal match. 2. Now complicate the task. Add text after the substitution field. Write test and then code with str.endswith Still do not need re, but might also want to do re version. 3. Add more fields: two positional, two keyword, one of each. Probably need to 'refactor' and use re. 4,5,6. Add field attributes, int formatting, and float formatting, depending on what translated to re's.
Not to me. The most common substitute is straight string interpolation. I suspect that that and int and float field formatting will cover 80% of use cases. For others... 'use the re module'.
If necessary, but I suspect more is reasonably possible. But yes, 'you have to use a subset of possible formats'. Or, would we define a slightly different syntax for use with
match, that lets you specify numeric conversions or whatever, and give up the idea that these are truly inverse operations?
No, not yet another syntax ;-). Terry Jan Reedy

On Mon, Oct 13, 2008 at 10:16 AM, Joe Strout <joe@strout.net> wrote:
If I were proposing something like this, I'd be using the new formatting syntax that's supposed to become the standard in Python 3.0: http://docs.python.org/dev/3.0/library/string.html#format-string-syntax That would mean something like:: '{name} was born in {country}' Steve -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy

On Mon, Oct 13, 2008 at 12:51 PM, Steven Bethard <steven.bethard@gmail.com> wrote:
I agree, but it seems something more is needed here. I would like to be able to parse something that isn't separated by whitespace, and I'd like to be able to tell Python to turn it into an int if I need to. We could keep a similar syntax to the 3.0 formatting syntax, and just change the semantics: ""First, thou shalt count to {n!i}" (or {n!d}) could parse an integer out of the string, and "{!x:[0-9A-Fa-f]*}" + (":{!x:[0-9A-Fa-f]*}" * 7) could be used to parse MAC addresses into a list of 8 hex values. I could easily see this getting way too complex, so maybe you all should just ignore me. -- Cheers, Leif

Leif Walsh wrote:
Given that I have never used the Template class and am using 3.0 str.format, that I believe the former was the bridge that led to the latter, and am not enamored of writing REs, I think adding str.match would be a great idea. To the extent possible, s == form.format(form.match(s)) and args == form.match(form.format(args). Note that args can either be a sequence or mapping. This might encourage more people to switch faster from % to .format. Given that Guido wants this to happen, he might look favorably
Let us not make the perfect be the enemy of the possible.
so maybe you all should just ignore me.
Nope. Terry Jan Reedy

On Oct 13, 2008, at 2:43 PM, Terry Reedy wrote:
Well, I'm all for that, but I've never used the 3.0 str.format, and find the documentation on it somewhat unenlightening. Can you mock up an example of how this might work? In simple form it looks very similar to Template.substitute, but it leaves lots of room for getting not-so-simple. Does that present a problem, in that a great number of format strings would not be useable in reverse? Or would we simply say: if you want to use the reverse (match) operation, then you have to restrict yourself to simple named fields? Or, would we define a slightly different syntax for use with match, that lets you specify numeric conversions or whatever, and give up the idea that these are truly inverse operations? Best, - Joe

Joe Strout wrote:
You probably read too much of it. I probably read it all but only learned the basics, intending to review more on a need-to-know basis.
Can you mock up an example of how this might work?
My ideas and opinions. 1. Start with the simplest possible tasks. temp_p1 = 'Testing {0}' temp_k1 = 'Testing {word}' p1 = ('matches',) # or perhaps use list instead of tuple k1 = {'word':'matches'} text1 = 'Testing matches' form_p1 = temp_p1.format(*p1) form_k1 = temp_k1.format(**k1) print(form_p1 == form_k1==text1) #prints True today with 3.0c1 #tests temp_p1.match(text1) == p1 temp_k1.match(text1) == k1 (Come to think of it, easiest would have no literal text, but I already wrote the above.) Now, write the simplest code that makes these tests pass. Easy. Use str.startswith() to match the literal, non-field text. Add text1e='Bad matches' and consider what exception to raise on non-literal match. 2. Now complicate the task. Add text after the substitution field. Write test and then code with str.endswith Still do not need re, but might also want to do re version. 3. Add more fields: two positional, two keyword, one of each. Probably need to 'refactor' and use re. 4,5,6. Add field attributes, int formatting, and float formatting, depending on what translated to re's.
Not to me. The most common substitute is straight string interpolation. I suspect that that and int and float field formatting will cover 80% of use cases. For others... 'use the re module'.
If necessary, but I suspect more is reasonably possible. But yes, 'you have to use a subset of possible formats'. Or, would we define a slightly different syntax for use with
match, that lets you specify numeric conversions or whatever, and give up the idea that these are truly inverse operations?
No, not yet another syntax ;-). Terry Jan Reedy
participants (4)
-
Joe Strout
-
Leif Walsh
-
Steven Bethard
-
Terry Reedy