a feature i'd like to see in python #2: indexing of match objects
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss. ben
Ben Wing schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss.
Please post a patch to sf.net/projects/python (or its successor). Several issues need to be taken into account: - documentation and test cases must be updated to integrate the new API - for slicing, you need to consider not only omitted indices, but also "true" slices (e.g. m[1:5]) - how should you deal with negative indices? - should len(m) be supported? Regards, Martin
Martin v. Löwis wrote:
Several issues need to be taken into account:
the most important issue is that if you want an object to behave as a sequence of something, you need to decide what that something is before you start tinkering with the syntax. under Ben's simple proposal, m[:][1] and m[1] would be two different things. I'm not sure that's a good idea, really. </F>
Fredrik Lundh schrieb:
the most important issue is that if you want an object to behave as a sequence of something, you need to decide what that something is before you start tinkering with the syntax.
under Ben's simple proposal, m[:][1] and m[1] would be two different things. I'm not sure that's a good idea, really.
Ah, right; I misread his proposal as saying that m[:] should return [m[0]] + list(m.groups()) (rather, I expected that m.groups() would include m.group(0)). To answer your first question: it is clearly groups that you want to index, just as the .group() method indexes groups. The typical equivalences should hold, of course, e.g. m[1:5][1] == m[2] etc. Regards, Martin
Martin v. Löwis wrote:
Ah, right; I misread his proposal as saying that m[:] should return [m[0]] + list(m.groups()) (rather, I expected that m.groups() would include m.group(0)).
match groups are numbered 1..N, not 0..(N-1), in both the API and in the RE syntax (and we don't have much control over the latter).
To answer your first question: it is clearly groups that you want to index, just as the .group() method indexes groups.
so what should len(m) do? </F>
Fredrik Lundh schrieb:
Ah, right; I misread his proposal as saying that m[:] should return [m[0]] + list(m.groups()) (rather, I expected that m.groups() would include m.group(0)).
match groups are numbered 1..N, not 0..(N-1), in both the API and in the RE syntax (and we don't have much control over the latter).
py> m = re.match("a(b)","ab") py> m.group(0) 'ab' py> m.group(1) 'b'
To answer your first question: it is clearly groups that you want to index, just as the .group() method indexes groups.
so what should len(m) do?
That's a question: should len be supported at all? If so, it's clear that len(m) == len(m[:]). Regards, Martin
Martin v. Löwis wrote:
match groups are numbered 1..N, not 0..(N-1), in both the API and in the RE syntax (and we don't have much control over the latter).
py> m = re.match("a(b)","ab") py> m.group(0) 'ab' py> m.group(1) 'b'
0 isn't a group, it's an alias for the full match. </F>
Fredrik Lundh schrieb:
match groups are numbered 1..N, not 0..(N-1), in both the API and in the RE syntax (and we don't have much control over the latter). py> m = re.match("a(b)","ab") py> m.group(0) 'ab' py> m.group(1) 'b'
0 isn't a group, it's an alias for the full match.
So what is the proper term for the things that the .group() method returns? According to http://docs.python.org/lib/match-objects.html it returns "subgroups of the match". So the things to be indexed in this proposal are subgroups of the match. Regards, Martin
Martin v. Löwis wrote:
Fredrik Lundh schrieb:
match groups are numbered 1..N, not 0..(N-1), in both the API and in the RE syntax (and we don't have much control over the latter). py> m = re.match("a(b)","ab") py> m.group(0) 'ab' py> m.group(1) 'b' 0 isn't a group, it's an alias for the full match.
So what is the proper term for the things that the .group() method returns? According to
http://docs.python.org/lib/match-objects.html
it returns "subgroups of the match".
So the things to be indexed in this proposal are subgroups of the match.
Precisely. But your example had only one group "(b)" in it, which is retrieved using m.group(1). So the subgroups are numbered starting from 1 and subgroup 0 is a special case which returns the whole match. I know what the Zen says about special cases, but in this case the rules were apparently broken with impunity. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden schrieb:
Precisely. But your example had only one group "(b)" in it, which is retrieved using m.group(1). So the subgroups are numbered starting from 1 and subgroup 0 is a special case which returns the whole match.
I know what the Zen says about special cases, but in this case the rules were apparently broken with impunity.
Well, the proposal was to interpret m[i] as m.group(i), for all values of i. I can't see anything confusing with that. Regards, Martin
Martin v. Löwis wrote:
I know what the Zen says about special cases, but in this case the rules were apparently broken with impunity.
Well, the proposal was to interpret m[i] as m.group(i), for all values of i. I can't see anything confusing with that.
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw? </F>
Fredrik Lundh wrote:
Martin v. Löwis wrote:
I know what the Zen says about special cases, but in this case the rules were apparently broken with impunity.
Well, the proposal was to interpret m[i] as m.group(i), for all values of i. I can't see anything confusing with that.
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw?
What about them? They aren't supposed to be supported by every object that allows subscript, are they? And why not just not implement len()? As for the [:] <-> groups() issue, [:] would have to be consistent with indexing and return the whole match and the subgroups. (Or, the API could be overhauled completely of course, remember it's Py3k.) Georg
Fredrik Lundh schrieb:
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw?
I take it that you are objecting to that feature, then? Regards, Martin
Martin v. Löwis wrote:
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw?
I take it that you are objecting to that feature, then?
I haven't seen a complete and self-consistent proposal yet, so that's not easy to say. </F>
Fredrik Lundh wrote:
Martin v. Löwis wrote:
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw?
I take it that you are objecting to that feature, then?
I haven't seen a complete and self-consistent proposal yet, so that's not easy to say.
</F>
my current proposal can be summarized: 1. m[x] == m.group(x) for x an integer >= 0. 2. all other sequence properties should be consistent with this numbering and with the view of `m' as basically an array. 3. m[name] == m.group(name) for name a string; names are aliases for group numbers. this implies, for example, that negative indices count from the end, that len(m) == 1 + m.lastindex, that the expression `m[1:]' should be the same as `m.groups()', that `foo in m' is true if `foo' is equal to any group in m or to the whole string, etc. property 3 should also probably imply that names should be allowed as slice indices -- a name is just an alias for a group number, and should behave the same way. an alternative would be to view a match object as a hash table. then, slices would presumably be disallowed, and `foo in m' would be true if `foo' is a group number in range, or a name of a group. but i don't like this as much; for example, it's not clear what len(m) should return in the case of a named group -- does it count the group once (since a name is just an alias), or twice? (btw i never really thought until now about the inconsistency in the 'in' operator between arrays and hash tables.) ben
On Sun, 3 Dec 2006, Fredrik Lundh wrote:
Martin v. Löwis wrote:
Well, the proposal was to interpret m[i] as m.group(i), for all values of i. I can't see anything confusing with that.
it can quickly become rather confusing if you also interpret m[:] as m.groups(), not to mention if you add len() and arbitrary slicing to the mix. what about m[] and m[i,j,k], btw?
I'd say, don't pretend m is a sequence. Pretend it's a mapping. Then the conceptual issues go away. -- ?!ng
Ka-Ping Yee wrote:
I'd say, don't pretend m is a sequence. Pretend it's a mapping. Then the conceptual issues go away.
almost; that would mean returning KeyError instead of IndexError for groups that don't exist, which means that the common pattern a, b, c = m.groups() cannot be rewritten as _, a, b, c = m which would, perhaps, be a bit unfortunate. taking everything into account, I think we should simply map __getitem__ to group, and stop there. no len(), no slicing, no sequence or mapping semantics. if people want full sequence behaviour with len and slicing and iterators and whatnot, they can do list(m) first. </F>
Fredrik Lundh wrote:
Ka-Ping Yee wrote:
I'd say, don't pretend m is a sequence. Pretend it's a mapping. Then the conceptual issues go away.
almost; that would mean returning KeyError instead of IndexError for groups that don't exist, which means that the common pattern
a, b, c = m.groups()
cannot be rewritten as
_, a, b, c = m
which would, perhaps, be a bit unfortunate.
taking everything into account, I think we should simply map __getitem__ to group, and stop there. no len(), no slicing, no sequence or mapping semantics. if people want full sequence behaviour with len and slicing and iterators and whatnot, they can do list(m) first.
i'm ok either way -- that is, either with the proposal i previously published, or with this restricted idea. ben
Ben Wing wrote:
i'm ok either way -- that is, either with the proposal i previously published, or with this restricted idea.
ok, I'll whip up a patch for the minimal version of the proposal, if nobody beats me to it (all that's needed is a as_sequence struct with a item slot that basically just calls match_getslice_by_index). </F>
On 5 Dec 2006, at 09:02, Ben Wing wrote:
Fredrik Lundh wrote:
Ka-Ping Yee wrote:
taking everything into account, I think we should simply map __getitem__ to group, and stop there. no len(), no slicing, no sequence or mapping semantics. if people want full sequence behaviour with len and slicing and iterators and whatnot, they can do list(m) first.
i'm ok either way -- that is, either with the proposal i previously published, or with this restricted idea.
I prefer your previous version. It matches my expectations as a user of regular expression matching and as someone with experience of other regexp implementations. (The current groups() method *doesn't* match those expectations, incidentally. I know I've been tripped up in the past because it didn't include the full match as element 0.) Basically, I don't see the advantage in the restrictions Frederik is proposing (other than possibly being simpler to implement, though not actually all that much, I think). Yes, it's a little unusual in that you'd be able to index the match "array" with either integer indices or using names, but I don't view that as a problem, and I don't see how not supporting len() or other list features like slicing and iterators helps. What's more, I think it will be confusing for Python newbies because they'll see someone doing m[3] and assume that m is a list-like object, then complain when things like for match in m: print match or m[3:4] fail to do what they expect. Yes, you might say "it's a match object, not a list". But, it seems to me, that's really in the same vein as "don't type quit or exit, press Ctrl-D". Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton wrote:
(The current groups() method *doesn't* match those expectations, incidentally. I know I've been tripped up in the past because it didn't include the full match as element 0.)
that's because there is no "group 0" in a regular expression; that's just a historical API convenience thing. groups are numbered from 1 and upwards, and "groups()" returns all the actual groups.
What's more, I think it will be confusing for Python newbies because they'll see someone doing
m[3]
and assume that m is a list-like object, then complain when things like
for match in m: print match
that'll work, of course, which might be confusing for people who think they understand how for-in works but don't ;)
or
m[3:4]
fail to do what they expect.
the problem with slicing is that people may 1) expect a slice to return a new object *of the same type* (which opens up a *gigantic* can of worms, both on the implementation level and on the wtf-is-this-thing- really level), and 2) expect things like [::-1] to work, which opens up another can of worms. I prefer the "If the implementation is easy to explain, it may be a good idea." design principle over "can of worms" design principle. </F>
Fredrik Lundh wrote:
the problem with slicing is that people may 1) expect a slice to return a new object *of the same type* (which opens up a *gigantic* can of worms, both on the implementation level and on the wtf-is-this-thing- really level), and 2) expect things like [::-1] to work, which opens up another can of worms. I prefer the "If the implementation is easy to explain, it may be a good idea." design principle over "can of worms" design principle.
This is a good point - I know I consider "m[0:0] == type(m)()" to be a property a well-behaved sequence should preserve. Since match objects can't really do that, better not to pretend to be a sequence at all. With slicing out of the equation, that only leaves the question of whether or not len(m) should work. I believe it would be nice for len(m) to be supported, so that reversed(m) works along with iter(m). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
On 5 Dec 2006, at 15:51, Fredrik Lundh wrote:
Alastair Houghton wrote:
What's more, I think it will be confusing for Python newbies because they'll see someone doing
m[3]
and assume that m is a list-like object, then complain when things like
for match in m: print match
that'll work, of course, which might be confusing for people who think they understand how for-in works but don't ;)
Or (as in my case) guessed at how it works because they can't be bothered to check the code and can't remember from the last time they looked. I don't spend a great deal of time in the guts of Python. But I do use it and have a couple of extensions that I've written for it (one of which I was contemplating releasing publicly and that is impacted by this change---it provides, amongst other things, an alternate implementation of the "re" API, so I'm going to want to implement this too).
or
m[3:4]
fail to do what they expect.
the problem with slicing is that people may 1) expect a slice to return a new object *of the same type*
What I would have expected is that it supported a similar set of sequence methods---that is, that it returned something with a similar signature. I don't see why code would care about it being the exact same type. Anyway, clearly what people will expect here (talking about the match object API) is that m[3:4] would give them a list (or some equivalent sequence object) containing groups 3 and 4. Why do you think someone would expect a match object?
2) expect things like [::-1] to work, which opens up another can of worms.
As long as they aren't expecting it to return the same type of object, is there a can of worms here?
I prefer the "If the implementation is easy to explain, it may be a good idea." design principle over "can of worms" design principle.
As someone who is primarily a *user* of Python, I prefer the idea that sequence objects should operate consistently to the idea that there might be some that don't. By which I mean that anything that supports indexing using integer values should ideally support slicing (including things like [::-1]). Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 5 Dec 2006, at 15:51, Fredrik Lundh wrote:
Alastair Houghton wrote:
or
m[3:4]
fail to do what they expect.
the problem with slicing is that people may 1) expect a slice to return a new object *of the same type*
What I would have expected is that it supported a similar set of sequence methods---that is, that it returned something with a similar signature. I don't see why code would care about it being the exact same type.
The problem is that either we return a list (easy), or we return something that is basically another match object (not quite so easy). Either way, we would be confusing one set of users or another. By not including slicing functionality by default, we sidestep the confusion.
Anyway, clearly what people will expect here (talking about the match object API) is that m[3:4] would give them a list (or some equivalent sequence object) containing groups 3 and 4. Why do you think someone would expect a match object?
Because that is what all other slicing operations in base Python do. List, tuple, string, unicode, array, buffer, ... Even extension writers preserve the functionality with Numeric, etc. When you slice a sequence, you get back a slice of that sequence, of the same type you started out with.
I prefer the "If the implementation is easy to explain, it may be a good idea." design principle over "can of worms" design principle.
As someone who is primarily a *user* of Python, I prefer the idea that sequence objects should operate consistently to the idea that there might be some that don't. By which I mean that anything that supports indexing using integer values should ideally support slicing (including things like [::-1]).
You are being inconsistant. You want list, tuple, etc. to be consistant, but you don't want match objects to be consistant. Sorry, but that is silly. Better to not support slices than to confuse the hell out of people by returning a tuple or list from a match slicing. If people want slicing, they can do list(m)[x:y]. If their matches are of sufficient size where that is a "slow" operation, then they can do [m[i] for i in xrange(x,y)] . - Josiah
On 6 Dec 2006, at 20:29, Josiah Carlson wrote:
The problem is that either we return a list (easy), or we return something that is basically another match object (not quite so easy). Either way, we would be confusing one set of users or another. By not including slicing functionality by default, we sidestep the confusion.
But I don't believe that *anyone* will find it confusing that it returns a list. It's much more likely to be confusing to people that they have to write list(m)[x:y] or [m[i] for i in xrange(x,y)] when m[x] and m[y] work just fine.
As someone who is primarily a *user* of Python, I prefer the idea that sequence objects should operate consistently to the idea that there might be some that don't. By which I mean that anything that supports indexing using integer values should ideally support slicing (including things like [::-1]).
You are being inconsistant. You want list, tuple, etc. to be consistant, but you don't want match objects to be consistant. Sorry, but that is silly. Better to not support slices than to confuse the hell out of people by returning a tuple or list from a match slicing.
That's not true *and* I object to your characterisation of the idea as "silly". What I'm saying is that the idea of slicing always returning the same exact type of object is pointless consistency, because nobody will care *provided* the thing that is returned supports a sensible set of operations given the original type. Look, I give in. There's no point trying to convince any of you further, and I don't have the time or energy to press the point. Implement it as you will. If necessary it can be an extension of my "re" replacement that slicing is supported on match objects. Kind regards, Alastair. -- http://alastairs-place.net
On 12/6/06, Alastair Houghton <alastair@alastairs-place.net> wrote: [from previous message]:
Anyway, clearly what people will expect here (talking about the match object API) is that m[3:4] would give them a list (or some equivalent sequence object) containing groups 3 and 4. Why do you think someone would expect a match object?
It's much more likely to be confusing to people that they have to write
list(m)[x:y] or [m[i] for i in xrange(x,y)] when m[x] and m[y] work just fine.
<>
Look, I give in. There's no point trying to convince any of you further, and I don't have the time or energy to press the point. Implement it as you will. If necessary it can be an extension of my "re" replacement that slicing is supported on match objects.
Keep in mind when implementing that m[3:4] should contain only the element at index 3, not both 3 and 4, as you've seemed to imply twice. cheers, -Mike
On 7 Dec 2006, at 00:39, Mike Klaas wrote:
Keep in mind when implementing that m[3:4] should contain only the element at index 3, not both 3 and 4, as you've seemed to imply twice.
Yes, you're quite right. I was writing off the top of my head and I'm still a relative newbie to Python coding. Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 6 Dec 2006, at 20:29, Josiah Carlson wrote:
The problem is that either we return a list (easy), or we return something that is basically another match object (not quite so easy). Either way, we would be confusing one set of users or another. By not including slicing functionality by default, we sidestep the confusion.
But I don't believe that *anyone* will find it confusing that it returns a list.
We'll have to agree to disagree.
As someone who is primarily a *user* of Python, I prefer the idea that sequence objects should operate consistently to the idea that there might be some that don't. By which I mean that anything that supports indexing using integer values should ideally support slicing (including things like [::-1]).
You are being inconsistant. You want list, tuple, etc. to be consistant, but you don't want match objects to be consistant. Sorry, but that is silly. Better to not support slices than to confuse the hell out of people by returning a tuple or list from a match slicing.
That's not true *and* I object to your characterisation of the idea as "silly". What I'm saying is that the idea of slicing always returning the same exact type of object is pointless consistency, because nobody will care *provided* the thing that is returned supports a sensible set of operations given the original type.
In Python 2.5:
a <_sre.SRE_Match object at 0x008F6020> dir(a) ['__copy__', '__deepcopy__', 'end', 'expand', 'group', 'groupdict', 'groups', 'span', 'start']
Not including end, expand, group, groupdict, groups, span, and start may be confusing to some number of users. Why? Because of the historical invariant already present in the standard library (with the exception of buffer, I was wrong about that one). *We* may not be confused, but it's not about us (I'm personally happy to use the .group() interface); it's about relative newbies who, generally speaking, desire/need consistency (see [1] for a paper showing that certain kinds of inconsistancies are bad - at least in terms of grading - for new computer science students). Being inconsistant because it's *easy*, is what I consider silly. We've got the brains, we've got the time, if we want slicing, lets produce a match object. If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it. I honestly don't care about whether slicing should go in or not, I use (?=) when I don't want a group. What I really don't want is someone coming in days after 2.6 is released complaining about match slicing not supporting things they think they need. Better to just tell them: use list(m)[x:y] or islice(iterable, [start,] stop [, step]) (both of which should work on arbitrary iterables, the latter of which works on *infinite* iterables) or produce a match object. All or nothing. Half-assing it is a waste.
Look, I give in. There's no point trying to convince any of you further, and I don't have the time or energy to press the point. Implement it as you will. If necessary it can be an extension of my "re" replacement that slicing is supported on match objects.
I'm sorry to see you give up so easily. One thing to realize/remember is that basically everyone who frequents python-dev has their own "make life easier" function/class library for those things that have been rejected for general inclusion in Python. - Josiah [1] http://www.cs.mdx.ac.uk/research/PhDArea/saeed/paper1.pdf
On 7 Dec 2006, at 01:01, Josiah Carlson wrote:
*We* may not be confused, but it's not about us (I'm personally happy to use the .group() interface); it's about relative newbies who, generally speaking, desire/need consistency (see [1] for a paper showing that certain kinds of inconsistancies are bad - at least in terms of grading - for new computer science students). Being inconsistant because it's *easy*, is what I consider silly. We've got the brains, we've got the time, if we want slicing, lets produce a match object.
Oh, it isn't that I don't want to produce a match object; I think you've mistaken my intention in that respect. I'd be equally happy for it to be a match object, *but*...
If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
...if you return match objects from slicing, you have problems like m [::-1].groups(). *I* don't know what that should return. What led me to think that a tuple or list would be appropriate is the idea that slicing was a useful operation and that I felt it was unlikely that anyone would want to call the match object methods on a slice, coupled with the fact that slices clearly have problems with some of the match object methods. A match object, plus sequence functionality, minus match object methods, is basically just a sequence. If you're worried about types, you could do something like this: generic match object | +--------------+-------------+ | | real match object match object slice where the "generic match object" perhaps doesn't have all the methods that a "real match object" would have. (In the extreme case, generic match object might basically just be a sequence type.) Then slicing something that was a "generic match object" always gives you a "generic match object", but it might not support all the methods that the original match object supported.
Half-assing it is a waste.
Sure. We're agreed there :-)
Look, I give in. There's no point trying to convince any of you further, and I don't have the time or energy to press the point. Implement it as you will. If necessary it can be an extension of my "re" replacement that slicing is supported on match objects.
I'm sorry to see you give up so easily. One thing to realize/remember is that basically everyone who frequents python-dev has their own "make life easier" function/class library for those things that have been rejected for general inclusion in Python.
It's just that I'm tired and have lots of other things that need doing as well. Maybe I do have a bit more time to talk about it, we'll see. Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 01:01, Josiah Carlson wrote:
*We* may not be confused, but it's not about us (I'm personally happy to use the .group() interface); it's about relative newbies who, generally speaking, desire/need consistency (see [1] for a paper showing that certain kinds of inconsistancies are bad - at least in terms of grading - for new computer science students). Being inconsistant because it's *easy*, is what I consider silly. We've got the brains, we've got the time, if we want slicing, lets produce a match object.
Oh, it isn't that I don't want to produce a match object; I think you've mistaken my intention in that respect. I'd be equally happy for it to be a match object, *but*...
If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
...if you return match objects from slicing, you have problems like m [::-1].groups(). *I* don't know what that should return.
I would argue that any 'step' != 1 has no semantically correct result for slicing on a match object, so we shouldn't support it. In that sense, buffer also doesn't support step != 1, but that's because it's __getitem__ method doesn't accept slice objects, and uses the (I believe deprecated or removed in Py3k) __getslice__ method. We can easily check for such things in the __getitem__ method (to also support the removal? of __getslice__) and raise an exception. For those who want reversed slices, they can use reversed(m[x:y]), etc.
What led me to think that a tuple or list would be appropriate is the idea that slicing was a useful operation and that I felt it was unlikely that anyone would want to call the match object methods on a slice, coupled with the fact that slices clearly have problems with some of the match object methods. A match object, plus sequence functionality, minus match object methods, is basically just a sequence.
Explicit is better than implicit. Refuse the temptation to guess. Let us give them the full match object. If they want to do something silly with slicing (reversed, skipping some, etc.), make them use list() or islice().
If you're worried about types, you could do something like this:
generic match object | +--------------+-------------+ | | real match object match object slice
I believe the above is unnecessary. Slicing a match could produce another match. It's all internal data semantics. - Josiah
On 7 Dec 2006, at 02:01, Josiah Carlson wrote:
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 01:01, Josiah Carlson wrote:
If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
...if you return match objects from slicing, you have problems like m [::-1].groups(). *I* don't know what that should return.
I would argue that any 'step' != 1 has no semantically correct result for slicing on a match object, so we shouldn't support it.
OK, but even then, if you're returning a match object, how about the following:
m = re.match('(A)(B)(C)(D)(E)', 'ABCDE') print m[0] ABCDE n = m[2:5] print list(n) ['B', 'C', 'D'] print n[0] B print n.group(0) B
The problem I have with it is that it's violating the invariant that match objects should return the whole match in group(0). It's these kinds of things that make me think that slices shouldn't have all of the methods of a match object. I think that's probably why various others have suggested not supporting slicing, but I don't think it's necessary to avoid it as long as it has clearly specified behaviour.
If you're worried about types, you could do something like this:
generic match object | +--------------+-------------+ | | real match object match object slice
I believe the above is unnecessary. Slicing a match could produce another match. It's all internal data semantics.
Sure. My point, though, was that you could view (from an external perspective) all results as instances of "generic match object", which might not have as many methods. Interestingly, at present, the match object type itself is an implementation detail; e.g. for SRE, it's an _sre.SRE_Match object. It's only the API that's documented, not the type. Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 02:01, Josiah Carlson wrote:
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 01:01, Josiah Carlson wrote:
If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
...if you return match objects from slicing, you have problems like m [::-1].groups(). *I* don't know what that should return.
I would argue that any 'step' != 1 has no semantically correct result for slicing on a match object, so we shouldn't support it.
OK, but even then, if you're returning a match object, how about the following:
m = re.match('(A)(B)(C)(D)(E)', 'ABCDE') print m[0] ABCDE n = m[2:5] print list(n) ['B', 'C', 'D'] print n[0] B print n.group(0) B
The problem I have with it is that it's violating the invariant that match objects should return the whole match in group(0).
If we were going to go with slicing, then it would be fairly trivial to include the whole match range. Some portion of the underlying structure knows where the start of group 2 is, and knows where the end of group 5 is, so we can slice or otherwise use that for subsequent sliced groups.
Interestingly, at present, the match object type itself is an implementation detail; e.g. for SRE, it's an _sre.SRE_Match object. It's only the API that's documented, not the type.
I believe that is the case with all built in cPython structures. - Josiah
On 7 Dec 2006, at 21:47, Josiah Carlson wrote:
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 02:01, Josiah Carlson wrote:
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 01:01, Josiah Carlson wrote:
If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
...if you return match objects from slicing, you have problems like m [::-1].groups(). *I* don't know what that should return.
I would argue that any 'step' != 1 has no semantically correct result for slicing on a match object, so we shouldn't support it.
OK, but even then, if you're returning a match object, how about the following:
m = re.match('(A)(B)(C)(D)(E)', 'ABCDE') print m[0] ABCDE n = m[2:5] print list(n) ['B', 'C', 'D'] print n[0] B print n.group(0) B
The problem I have with it is that it's violating the invariant that match objects should return the whole match in group(0).
If we were going to go with slicing, then it would be fairly trivial to include the whole match range. Some portion of the underlying structure knows where the start of group 2 is, and knows where the end of group 5 is, so we can slice or otherwise use that for subsequent sliced groups.
But then you're proposing that this thing (which looks like a tuple, when you're indexing it) should slice in a funny way. i.e. m = re.match('(A)(B)(C)(D)(E)', 'ABCDE') print m[0] ABCDE print list(m) ['ABCDE', 'A', 'B', 'C', 'D', 'E'] n = m[2:5] print list(n) ['BCD', 'B', 'C', 'D'] print len(n) 4 p = list(m)[2:5] print p ['B', 'C', 'D'] print len(p) Or are you saying that m[2:5][0] != m[2:5].group(0) but m[0] == m.group(0) ?? Either way I think that's *really* counter-intuitive. Honestly, I don't think that slicing should be supported if it's going to have to result in match objects, because I can't see a way to make them make sense. I think that's Frederik's objection also, but unlike me he doesn't feel that the slice operation should return something different (e.g. a tuple). Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton <alastair@alastairs-place.net> wrote:
On 7 Dec 2006, at 21:47, Josiah Carlson wrote:
If we were going to go with slicing, then it would be fairly trivial to include the whole match range. Some portion of the underlying structure knows where the start of group 2 is, and knows where the end of group 5 is, so we can slice or otherwise use that for subsequent sliced groups.
But then you're proposing that this thing (which looks like a tuple, when you're indexing it) should slice in a funny way. i.e.
Let me be clear: I'm not proposing anything. I have little to no interest in seeing slices available to match objects, and as said in a message 20 minutes prior to the message you are replying to, "Make the slice return a list, don't allow slicing, or make it a full on group variant. I don't really care at this point." My statement in the email you replied to above was to say that if we wanted it to return a group, then we could include subsequent .group(0) with the same semantics as the original match object. At this point it doesn't matter, Frederik will produce what he wants to produce, and I'm sure most of us will be happy with the outcome. Those that are unhappy will need to write their own patch or deal with being unhappy. - Josiah
On 8 Dec 2006, at 16:38, Josiah Carlson wrote:
My statement in the email you replied to above was to say that if we wanted it to return a group, then we could include subsequent .group (0) with the same semantics as the original match object.
And my reply was simply to point out that that's not workable.
At this point it doesn't matter, Frederik will produce what he wants to produce, and I'm sure most of us will be happy with the outcome. Those that are unhappy will need to write their own patch or deal with being unhappy.
I believe I've already conceded that twice. Kind regards, Alastair. -- http://alastairs-place.net
On 12/6/06, Josiah Carlson <jcarlson@uci.edu> wrote:
*We* may not be confused, but it's not about us (I'm personally happy to use the .group() interface); it's about relative newbies who, generally speaking, desire/need consistency (see [1] for a paper showing that certain kinds of inconsistancies are bad - at least in terms of grading - for new computer science students). Being inconsistant because it's *easy*, is what I consider silly. We've got the brains, we've got the time, if we want slicing, lets produce a match object. If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
The idea that slicing a match object should produce a match object sounds like a foolish consistency to me. It's a useful invariant of lists that slicing them returns lists. It's not a useful invariant of sequences in general. This is similar to how it's a useful invariant that indexing a string returns a string; indexing a list generally does not return a list. I only found a couple __getslice__ definitions in a quick perusal of stdlib. ElementTree.py's _ElementInterface class returns a slice from a contained list; whereas sre_parse.py's SubPattern returns another SubPattern. UserList and UserString also define __getslice__ but I don't consider them representative of the standards of non-string/list classes. As an aside, if you're trying to show that inconsistencies in a language are bad by referencing a paper showing that people who used consistent (if incorrect) mental models scored better than those who did not, you may have to explain further; I don't see the connection. -- Michael Urman http://www.tortall.net/mu/blog
"Michael Urman" <murman@gmail.com> wrote:
On 12/6/06, Josiah Carlson <jcarlson@uci.edu> wrote:
*We* may not be confused, but it's not about us (I'm personally happy to use the .group() interface); it's about relative newbies who, generally speaking, desire/need consistency (see [1] for a paper showing that certain kinds of inconsistancies are bad - at least in terms of grading - for new computer science students). Being inconsistant because it's *easy*, is what I consider silly. We've got the brains, we've got the time, if we want slicing, lets produce a match object. If we don't want slicing, or if prodicing a slice would produce a semantically questionable state, then lets not do it.
The idea that slicing a match object should produce a match object sounds like a foolish consistency to me. It's a useful invariant of lists that slicing them returns lists. It's not a useful invariant of sequences in general. This is similar to how it's a useful invariant that indexing a string returns a string; indexing a list generally does not return a list.
The string and unicode case for S[i] is special. Such has already been discussed ad-nauseum. As for seq[i:j] returning an object of the same type, if it was "foolish consistency", then why is it consistent across literally the entire standard library (except for buffer), and (in my experience) many 3rd party libraries?
I only found a couple __getslice__ definitions in a quick perusal of stdlib. ElementTree.py's _ElementInterface class returns a slice from a contained list; whereas sre_parse.py's SubPattern returns another SubPattern. UserList and UserString also define __getslice__ but I don't consider them representative of the standards of non-string/list classes.
As an aside, if you're trying to show that inconsistencies in a language are bad by referencing a paper showing that people who used consistent (if incorrect) mental models scored better than those who did not, you may have to explain further; I don't see the connection.
The idea is that those who were consistant in their behavior, regardless of whether they were incorrect, can be trained to do things the correct way. That is to say, people who understand that X = Y will behave consistently regardless of context tend to do better than those who believe that it will do different things. Introducing inconsistencies because it is *easy* for the writer of an API, makes it more difficult to learn said API. In this context, the assumption that one makes when slicing in Python (as stated by someone else whom I can't remember in this thread): X[0:0] == type(X)(). That works _everywhere_ in Python where slices are allowed (except for buffers, which are generally rarely used except by certain crazies (like myself)). By not making it true here, we would be adding an exception to the rule. Special cases aren't special enough to break the rules. I'm not going to go all gloom and doom on you; maybe no one will ever have a situation where it is necessary. But implementing "slice of match returns a slice" isn't impossible, whether it is done via subclass, or by direct manipulation of the match struct. And not implementing the functionality because we are *lazy* isn't a terribly good excuse to give someone if/when they run into this. - Josiah
On 12/6/06, Josiah Carlson <jcarlson@uci.edu> wrote:
Special cases aren't special enough to break the rules.
Sure, but where is this rule that would be broken? I've seen it invoked, but I've never felt it myself. I seriously thought of slicing as returning a list of elements per range(start,stop,skip), with the special case being str (and then unicode followed)'s type preservation. This is done because a list of characters is a pain to work with in most contexts, so there's an implicit ''.join on the list. And because assuming that the joined string is the desired result, it's much faster to have just built it in the first place. A pure practicality beats purity argument. We both arrive at the same place in that we have a model describing the behavior for list/str/unicode, but they're different models when extended outside. Now that I see the connection you're drawing between your argument and the paper, I don't believe it's directly inspired by the paper. I read the paper to say those who could create and work with a set of rules, could learn to work with the correct rules. Consistency in Python makes things easier on everyone because there's less to remember, not because it makes us better learners of the skills necessary for programming well. The arguments I saw in the paper only addressed the second point. -- Michael Urman http://www.tortall.net/mu/blog
"Michael Urman" <murman@gmail.com> wrote:
On 12/6/06, Josiah Carlson <jcarlson@uci.edu> wrote:
Special cases aren't special enough to break the rules.
Sure, but where is this rule that would be broken? I've seen it invoked, but I've never felt it myself. I seriously thought of slicing as returning a list of elements per range(start,stop,skip), with the special case being str (and then unicode followed)'s type preservation.
Tuple slicing doesn't return lists. Array slicing doesn't return lists. None of Numarray, Numeric, or Numpy array slicing returns lists. Only list slicing returns lists in current stdlib and major array package Python. Someone please correct me if I am wrong.
This is done because a list of characters is a pain to work with in most contexts, so there's an implicit ''.join on the list. And because assuming that the joined string is the desired result, it's much faster to have just built it in the first place. A pure practicality beats purity argument.
Python returns strings from string slices not because of a "practicality beats purity" argument, but because not returning a string from string slicing is *insane*. The operations on strings tend to not be of the kind that is done with lists (insertion, sorting, etc.), they are typically linguistic, parsing, or data related (chop, append, prepend, scan for X, etc.). Also, typically, each item in a list is a singular item. Whereas in a string, typically _blocks of characters_ represent a single item: spaces, words, short, long, float, double, etc. (the latter being related to packed representations of data structures, as is the case in some socket protocols). Semantically, lists differ from strings in various *substantial* ways, which is why you will never see a typical user of Python asking for list.partition(lst). Strings have the sequence interface out of *convenience*, not because strings are like a list. Don't try to combine the two ideas. Also, when I said that strings/unicode were special and "has already been discussed ad-nauseum", I wasn't kidding. Take string and unicode out of the discussion and search google for the thousands of other threads that talk about why string and unicode are the way they are. They don't belong in this conversation.
We both arrive at the same place in that we have a model describing the behavior for list/str/unicode, but they're different models when extended outside.
But this isn't about str/unicode (or buffer). For all other types available in Python with a slice operation, slicing them returns the same type as the original sequence. Expand that to 3rd party modules, and the case still holds for every package (at least those that I have seen and used). If you can point out an example (not str/unicode/buffer) for which this rule is broken in the stdlib or even any *major* 3rd party library, I'll buy you a cookie if we ever meet.
Now that I see the connection you're drawing between your argument and the paper, I don't believe it's directly inspired by the paper. I read the paper to say those who could create and work with a set of rules, could learn to work with the correct rules. Consistency in Python makes things easier on everyone because there's less to remember, not because it makes us better learners of the skills necessary for programming well. The arguments I saw in the paper only addressed the second point.
Right, but if you have a set of rules: 1. RULE 1 2. RULE 2 3. RULE 3 The above will be easier to understand than: 1. RULE 1 2. a. RULE 2 if X b. RULE 2a otherwise 3. RULE 3 This is the case even ignoring the paper. Special cases are more difficult to learn than no special cases. Want a real world example? English. The english language is so fraught with special cases that it is the only language for which dyslexia is known to exist (or was known to exist for many years, I haven't kept up on it). In the context of the paper, their findings suggested that those who could work with a *consistent* set of rules could be taught the right consistent rules. Toss in an inconsistency? Who knows if in this case it will make *any* difference in Python; regular expressions are already confusing for many people. This is a special case. There's a zen. Does practicality beat purity zen apply? I don't know. At this point I've just about stopped caring. Make the slice return a list, don't allow slicing, or make it a full on group variant. I don't really care at this point. Someone write a patch and lets go with it. Adding slicing producing a list should be easy after the main patch is done, and we can emulate it in Python if necessary. - Josiah
Michael Urman wrote:
The idea that slicing a match object should produce a match object sounds like a foolish consistency to me.
well, the idea that adding m[x] as a convenience alias for m.group(x) automatically turns m into a list-style sequence that also has to support full slicing sounds like an utterly foolish consistency to me. the OP's original idea was to make a common use case slightly easier to use. if anyone wants to argue for other additions to the match object API, they should at least come up with use cases based on real existing code. (and while you guys are waiting, I suggest you start a new thread where you discuss some other inconsistency that would be easy to solve with more code in the interpreter, like why "-", "/", and "**" doesn't work for strings, lists don't have a "copy" method, sets and lists have different API:s for adding things, we have hex() and oct() but no bin(), str.translate and unicode.translate take different arguments, etc. get to work!) </F>
On 7 Dec 2006, at 07:15, Fredrik Lundh wrote:
Michael Urman wrote:
The idea that slicing a match object should produce a match object sounds like a foolish consistency to me.
well, the idea that adding m[x] as a convenience alias for m.group(x) automatically turns m into a list-style sequence that also has to support full slicing sounds like an utterly foolish consistency to me.
How about we remove the word "foolish" from the debate?
the OP's original idea was to make a common use case slightly easier to use. if anyone wants to argue for other additions to the match object API, they should at least come up with use cases based on real existing code.
An example where it might be useful: m = re.match('(?:([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) (?P<rect>rect)' '|([0-9]+) ([0-9]+) ([0-9]+) (?P<circle>circle))', lineFromFile) if m['rect']: drawRectangle(m[1:5]) elif m['circle']: drawCircle(m[1:3], m[3]) Is that really so outlandish? I'm not saying that this is necessarily the best way, but why force people to write list(m)[1:5] or [m[i] for i in xrange(1,5)] ?? If the only reason is that some of the match object APIs, which I maintain are very unlikely to be wanted on a slice anyway, can't possibly produce consistent results, then why not just do away with the APIs and return a tuple or something instead? That way you can treat the match object as if it were just a tuple (which it could easily have been).
(and while you guys are waiting, I suggest you start a new thread where you discuss some other inconsistency that would be easy to solve with more code in the interpreter, like why "-", "/", and "**" doesn't work for strings, lists don't have a "copy" method, sets and lists have different API:s for adding things, we have hex() and oct() but no bin(), str.translate and unicode.translate take different arguments, etc. get to work!)
Oh come on! Comparing this with exponentiating strings is just not helpful. Kind regards, Alastair. -- http://alastairs-place.net
Alastair Houghton schrieb:
How about we remove the word "foolish" from the debate?
We should table the debate. If you really want that feature, write a PEP. You want it, some people are opposed; a PEP is the procedure to settle the difference. Regards, Martin
On 7 Dec 2006, at 18:54, Martin v. Löwis wrote:
Alastair Houghton schrieb:
How about we remove the word "foolish" from the debate?
We should table the debate. If you really want that feature, write a PEP. You want it, some people are opposed; a PEP is the procedure to settle the difference.
As I said a couple of e-mails back, I don't really have the time (I have lots of other things to do, most of them more important [to me, anyway]). If someone else agrees and wants to do it, great. If not, as I said before, I'm happy to let whoever do whatever. I might not agree, but that's my problem. Kind regards, Alastair. -- http://alastairs-place.net
On Thu, Dec 07, 2006, Alastair Houghton wrote:
An example where it might be useful:
m = re.match('(?:([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) (?P<rect>rect)' '|([0-9]+) ([0-9]+) ([0-9]+) (?P<circle>circle))', lineFromFile)
if m['rect']: drawRectangle(m[1:5]) elif m['circle']: drawCircle(m[1:3], m[3])
Is that really so outlandish?
Likely; normally I would expect that drawRectangle would break on string arguments instead of ints. I think that the amount of usefulness compared to problems doesn't really make this worth it. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Member of the Groucho Marx Fan Club
On 12/7/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
(and while you guys are waiting, I suggest you start a new thread where you discuss some other inconsistency that would be easy to solve with more code in the interpreter, like why "-", "/", and "**" doesn't work for strings, lists don't have a "copy" method, sets and lists have different API:s for adding things, we have hex() and oct() but no bin(), str.translate and unicode.translate take different arguments, etc. get to work!)
Personally I'd love a way to get an unbound method that handles either str or unicode instances. Perhaps py3k's unicode realignment will effectively give me that. (And agreed on there being no reason that supporting indexing requires supporting slicing. But also agreed that match slicing could be as useful as indexing. Really I don't use regexps enough in Python to have a position; I was more interested in figuring out where the type(m) == type(m[:]) idea had come from, as I had never formed it.) -- Michael Urman http://www.tortall.net/mu/blog
"Michael Urman" <murman@gmail.com> wrote:
On 12/7/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
(and while you guys are waiting, I suggest you start a new thread where you discuss some other inconsistency that would be easy to solve with more code in the interpreter, like why "-", "/", and "**" doesn't work for strings, lists don't have a "copy" method, sets and lists have different API:s for adding things, we have hex() and oct() but no bin(), str.translate and unicode.translate take different arguments, etc. get to work!)
Personally I'd love a way to get an unbound method that handles either str or unicode instances. Perhaps py3k's unicode realignment will effectively give me that.
Immutable byte strings won't exist in Py3k, and the mutable byte strings (bytes) won't support very many, if any current string/unicode methods. No bytes.replace, bytes.split, bytes.partition, etc. So no, Py3k's unicode change won't get you that. All it will get you is that every string you interact with; literals, file.read, etc., will all be text (equivalent to Python 2.x unicode). - Josiah
Fredrik Lundh wrote:
Michael Urman wrote:
The idea that slicing a match object should produce a match object sounds like a foolish consistency to me.
well, the idea that adding m[x] as a convenience alias for m.group(x) automatically turns m into a list-style sequence that also has to support full slicing sounds like an utterly foolish consistency to me.
Maybe instead of considering a match object to be a sequence, a match object should be considered a map? After all, we do have named, as well as numbered, groups...? -- Talin
Talin wrote:
Maybe instead of considering a match object to be a sequence, a match object should be considered a map?
sure, except for one small thing. from earlier in this thread:
Ka-Ping Yee wrote:
I'd say, don't pretend m is a sequence. Pretend it's a mapping. Then the conceptual issues go away.
to which I replied:
almost; that would mean returning KeyError instead of IndexError for groups that don't exist, which means that the common pattern
a, b, c = m.groups()
cannot be rewritten as
_, a, b, c = m
which would, perhaps, be a bit unfortunate.
</F>
Fredrik Lundh wrote:
Talin wrote:
Maybe instead of considering a match object to be a sequence, a match object should be considered a map?
sure, except for one small thing. from earlier in this thread:
Ka-Ping Yee wrote:
I'd say, don't pretend m is a sequence. Pretend it's a mapping. Then the conceptual issues go away.
to which I replied:
almost; that would mean returning KeyError instead of IndexError for groups that don't exist, which means that the common pattern
a, b, c = m.groups()
cannot be rewritten as
_, a, b, c = m
which would, perhaps, be a bit unfortunate.
I think the confusion lies between the difference between 'group' (which takes either an integer or string argument, and behaves like a map), and 'groups' (which returns a tuple of the numbered arguments, and behaves like a sequence.) The original proposal was to make m[n] a synonym for m.group(n). "group()" is clearly map-like in its behavior. It seems to me that there's exactly three choices: -- Match objects behave like 'group' -- Match objects behave like 'groups' -- Match objects behave like 'group' some of the time, and like 'groups' some of the time, depending on how you refer to it. In case 1, a match object is clearly a map; In case 2, it's clearly a sequence; In case 3, it's neither, and all talk of consistency with either map or sequence is irrelevant. -- Talin
Talin wrote:
The original proposal was to make m[n] a synonym for m.group(n). "group()" is clearly map-like in its behavior.
so have you checked what exception m.group(n) raises when you try to access a group that doesn't exist ? frankly, speaking as the original SRE author, I will now flip the bikeshed bit on all contributors to this thread, and consider it closed. I'll post a patch shortly. </F>
Fredrik Lundh wrote:
Talin wrote:
The original proposal was to make m[n] a synonym for m.group(n). "group()" is clearly map-like in its behavior.
so have you checked what exception m.group(n) raises when you try to access a group that doesn't exist ?
The KeyError vs IndexError distinction is unreliable enough that I'll typically just catch LookupError if I don't know exactly what type I'm dealing with. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Maybe instead of considering a match object to be a sequence, a match object should be considered a map? After all, we do have named, as well as numbered, groups...?
To me, that makes a lot more sense. Bill
Martin v. Löwis wrote:
Steve Holden schrieb:
Precisely. But your example had only one group "(b)" in it, which is retrieved using m.group(1). So the subgroups are numbered starting from 1 and subgroup 0 is a special case which returns the whole match.
I know what the Zen says about special cases, but in this case the rules were apparently broken with impunity.
Well, the proposal was to interpret m[i] as m.group(i), for all values of i. I can't see anything confusing with that.
I don't suppose that would be any more confusing than the present case. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
So the subgroups are numbered starting from 1 and subgroup 0 is a special case which returns the whole match.
But the subgroups can be nested too, so it's not really as special as all that. -- Greg
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Dec 3, 2006, at 9:22 AM, Martin v. Löwis wrote:
Ben Wing schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss.
Please post a patch to sf.net/projects/python (or its successor).
Several issues need to be taken into account: - documentation and test cases must be updated to integrate the new API - for slicing, you need to consider not only omitted indices, but also "true" slices (e.g. m[1:5]) - how should you deal with negative indices? - should len(m) be supported?
what about m['named_group_1'] etc? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRXL2LnEjvBPtnXfVAQILTwP/SRfvOXXhUXIBK52ByqwuhCwF+K/HfEYu 0+j/L3WQXFE4sZ1CHT6TaMT/K6tbhE7zuGamKmk1+CtSPQAKluwJ8d2/y6Ubp4KE S24sP8NOzKgDg/aTn5zFS/Up7HDfhMIWGCLbg5rY+/Bl48skEkqeo4w07XKwJzky CvxsrJb4wQY= =OPjI -----END PGP SIGNATURE-----
Barry Warsaw schrieb:
Several issues need to be taken into account: - documentation and test cases must be updated to integrate the new API - for slicing, you need to consider not only omitted indices, but also "true" slices (e.g. m[1:5]) - how should you deal with negative indices? - should len(m) be supported?
what about m['named_group_1'] etc?
That should also be taken into consideration; I suggest to support it. Regards, Martin
On Sun, Dec 03, 2006, "Martin v. L?wis" wrote:
Ben Wing schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss.
Please post a patch to sf.net/projects/python (or its successor).
Given the list of issues and subsequent discussion so far, I think a PEP will be required. This needs more documentation than the typical patch. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Member of the Groucho Marx Fan Club
Aahz schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss. Please post a patch to sf.net/projects/python (or its successor).
Given the list of issues and subsequent discussion so far, I think a PEP will be required. This needs more documentation than the typical patch.
I disagree. So far, nobody has spoken against the proposed feature. It's really a small addition of a new method to an existing type. Entire classes have been added to the standard library without a PEP. People can still criticize the patch when its posted (and it's not clear that the OP is even willing to produce a patch). Regards, Martin
Martin v. Löwis wrote:
Aahz schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss.
Please post a patch to sf.net/projects/python (or its successor).
Given the list of issues and subsequent discussion so far, I think a PEP will be required. This needs more documentation than the typical patch.
I disagree. So far, nobody has spoken against the proposed feature. It's really a small addition of a new method to an existing type. Entire classes have been added to the standard library without a PEP. People can still criticize the patch when its posted (and it's not clear that the OP is even willing to produce a patch).
i've never worked up a python patch before, but i imagine this wouldn't be too hard. it seems that m[1] should be m.group(1), and everything else should follow. i forgot about m[0] when making my slice proposal; i suppose then that m[:] should just do what we expect, and m[1:] = m.groups(). len(m) = 1 + number of groups, m['name'] = m.group('name'). the only strangeness here is the numbering of groups starting at 1, and making 0 be a special case. this isn't any more (or less) of a problem for the indexing form than it is for m.group(), and it's well known from various other languages. we could always consider making groups start at 0 for python 3000, but this seems to me like a gratuitous incompatibility with the rest of the world. ben
Ben Wing wrote:
the only strangeness here is the numbering of groups starting at 1, and making 0 be a special case. this isn't any more (or less) of a problem for the indexing form than it is for m.group(), and it's well known from various other languages. we could always consider making groups start at 0 for python 3000, but this seems to me like a gratuitous incompatibility with the rest of the world.
As Greg pointed out, this is just a special case of the fact that subgroups can be nested with the ordering governed by the location of the left parenthesis: .>>> import re .>>> m = re.match("a(b(c))", "abc") .>>> m.group(0) 'abc' .>>> m.group(1) 'bc' .>>> m.group(2) 'c' That said, I like the definitions in your last message: len(m) == 1 + len(m.groups()) m[:] == [m.group(0)] + m.groups() all(m[i] == m.group(i) for i in range(len(m))) all(m[k] == m.group(k) for k in m.groupdict().keys()) The internally inconsistent* m.group() and m.groups() methods could even be slated for removal in Py3k (replaced by the subscript operations). Cheers, Nick. *The inconsistency being that group() considers the whole match to be group 0, while groups() does not. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Nick Coghlan wrote:
*The inconsistency being that group() considers the whole match to be group 0, while groups() does not.
The real inconsistency seems to be that the groups are being treated as an array when they're really a tree. Maybe a different API altogether would be better, e.g. m.range --> the whole match m.subgroups[i] --> another match object with its own range and subgroups attributes -- Greg
On Sun, Dec 03, 2006 at 07:38:21PM +0100, "Martin v. L?wis" wrote:
Aahz schrieb:
this one is fairly simple. if `m' is a match object, i'd like to be able to write m[1] instead of m.group(1). (similarly, m[:] should return the same as list(m.groups()).) this would remove some of the verbosity of regexp code, with probably a net gain in readability; certainly no loss. Please post a patch to sf.net/projects/python (or its successor).
Given the list of issues and subsequent discussion so far, I think a PEP will be required. This needs more documentation than the typical patch.
I disagree. So far, nobody has spoken against the proposed feature. It's really a small addition of a new method to an existing type. Entire classes have been added to the standard library without a PEP. People can still criticize the patch when its posted (and it's not clear that the OP is even willing to produce a patch).
Agreed. Just implement it including test cases testing and demoing the corner cases. Making match objects have sequence and dict behaviour for groups is imnsho just common sense. -greg
participants (17)
-
"Martin v. Löwis" -
Aahz -
Alastair Houghton -
Barry Warsaw -
Ben Wing -
Bill Janssen -
Fredrik Lundh -
Georg Brandl -
Greg Ewing -
Gregory P. Smith -
Josiah Carlson -
Ka-Ping Yee -
Michael Urman -
Mike Klaas -
Nick Coghlan -
Steve Holden -
Talin