[Python-bugs-list] [ python-Bugs-672491 ] 2.3a1 computes lastindex incorrectly

SourceForge.net noreply@sourceforge.net
Sat, 19 Apr 2003 17:59:48 -0700


Bugs item #672491, was opened at 2003-01-22 15:28
Message generated for change (Comment added) made by niemeyer
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=672491&group_id=5470

Category: Regular Expressions
Group: Python 2.3
>Status: Closed
>Resolution: Fixed
Priority: 6
Submitted By: Martin v. Löwis (loewis)
>Assigned to: Gustavo Niemeyer (niemeyer)
Summary: 2.3a1 computes lastindex incorrectly

Initial Comment:
In Python 2.[012], the code

import re
exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
print exp.match("namespace").lastgroup

prints "NCName". In Python 2.3a1, it prints "None". The
problem is that last index is 2, instead of 1, as it
should be.

----------------------------------------------------------------------

>Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-20 00:59

Message:
Logged In: YES 
user_id=7887

I backed out the changes made in 2.84 which changed the
behavior, and maintained the changes which fixed the bug.
Applied as:

Modules/_sre.c: 2.90

I'm also including some examples in the lastindex
documentation, explaining how it works.

Sorry about the trouble this may have caused.


----------------------------------------------------------------------

Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 23:33

Message:
Logged In: YES 
user_id=7887

Concluding:

- I agree that my implementation is uncompliant with the
previous behavior, and I think it should be modified to
behave like before.

- There was a bug which was fixed by my implementation, and
should be fixed when porting to the old behavior.

- The documentation is very misleading, and seems closer to
my implementation than the old behavior. This should be fixed.


----------------------------------------------------------------------

Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 23:26

Message:
Logged In: YES 
user_id=7887

> As you can see, it reports the correct value for lastindex
given
> the 2.22 definition.

The problem here is defining what's the 2.2 definition.

I've checked with other examples, and it looks like your
definition is correct in all cases where the shown bug is
not acting.

If that's the case, the documentation is *very* misleading,
and should be fixed.


----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2003-04-19 22:24

Message:
Logged In: YES 
user_id=86307

I'll just add my two cents here.  Gustavo, I think given your 
defintion of lastindex, there's no reason for the state to track a 
seperate lastindex field; the correct value could easily be 
determined by examining the mark array after matching is 
completed (in the process of initializing the Match object).  On 
the other hand, I don't think there is an obvious way of 
determining lastindex given the 2.22 definition without either 
examining the compiled pattern or tracking it as the pattern 
executes.

I also agree that it is an incompatible change.  Although the 
implementation was partly broken, I think the clear intent of the 
2.22 code was to report the matching group with the last closing 
parens.

FYI, I posted a patch here to revert back to the previous behavior:

  http://www.python.org/sf/712900

You two may want to look at it to see if it looks like it's on the right 
track.  Here is what a python with this patch reports on this 
example from Gustavo:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
1

As you can see, it reports the correct value for lastindex given 
the 2.22 definition.  


----------------------------------------------------------------------

Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 22:02

Message:
Logged In: YES 
user_id=7887

Why do you think lastindex is incorrect? Isn't 2 the lastindex?

>>> import re
>>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
>>> match = exp.match("namespace")
>>> match.group(0)
'namespace'
>>> match.group(1)
'namespace'
>>> match.group(2)
'e'

It works like this in all Python versions. Also, if you
agree that group 2 is unnamed, shouldn't  lastindex be 2? It
actually *is* the lastindex, as you see above.

It is incompatible with old versions because old versions
were broken, and didn't follow what was documented. It was
indeed a side effect of a bug. For example, is it logical
that if we remove the second group, lastindex is still 1 (in
Python 2.2)?

>>> exp = re.compile("(?P<NCName>[a-zA-Z_])")
>>> print exp.match("namespace").lastindex
1

> It is illogical because group 1 ends *after* group 2 ends,
> as the closing parenthesis of group 1 is after the closing
> parenthesis of group 2.

How this changes anything? As we agreed, groups are numbered
by the opening parenthesis.

Hummm.. perhaps you think that the old behavior was to show
what group had the last closing parenthesis? No.. that
wasn't the case either. Look at this example, in Python 2.2:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
4
>>> re.compile("(a(b)?)((c)d)?").match("abce").groups()
('ab', 'b', None, None)

In Python 2.3:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
2
>>> re.compile("(a(b)?)((c)d)?").match("abce").groups()
('ab', 'b', None, None)

> It is unhelpful because one of the primary purposes of
> lastgroup is to find efficiently the entire match if you
> have a list of top-level alternatives

I don't understand this. If you want the entire match, just
use group 0. Did I misunderstand it? Can you show me some
example?

Can you show me how I've broken the Scanner?

If PyXML is broken, it trusted in an undocumented and broken
feature. Should we maintain it broken to avoid breaking PyXML?


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-04-19 21:25

Message:
Logged In: YES 
user_id=21627

Gustavo, I agree that the numbering of groups is and should
be by opening parentheses, i.e. that group 1 is <NCName>,
and group 2 is unnamed.

I also agree that *if* lastindex is 2, lastgroup should be None.

I still think this value is incorrect, though. It is
illogical, unhelpful, and incompatible with earlier
versions. lastindex should be 1, and lastgroup should be
"NCName".

It is illogical because group 1 ends *after* group 2 ends,
as the closing parenthesis of group 1 is after the closing
parenthesis of group 2.

It is unhelpful because one of the primary purposes of
lastgroup is to find efficiently the entire match if you
have a list of top-level alternatives, such as you do in a
scanner. With Python <2.3, you can put named alternatives
into a regex, and use lastindex to find out the alternative.
Indeed, this is what sre.Scanner does; I believe you have
broken that class as well.

It is incompatible because earlier Python versions behaved
differently. As a result of this change, PyXML does not work
with Python 2.3.

----------------------------------------------------------------------

Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 21:03

Message:
Logged In: YES 
user_id=7887

Martin, the lastgroup/lastindex handling was quite broken in
Python < 2.3. We can easily see this if testing your example
in the following way:

>>> import re
>>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
>>> match = exp.match("namespace")
>>> match.groups()
('namespace', 'e')
>>> match.groupdict()
{'NCName': 'namespace'}

This has the same result in any python you execute. This
also demonstrates that the group numbering has always been
assigned by the opening parenthesis.

About the None result, that's also correct. In the example,
we notice that the second unnamed group is correctly
matching 'e'. Accordingly to the sre module documentation,
that's how lastgroups should behave:

lastgroup
    The name of the last matched capturing group, or None if
the group didn't have a name, or if no group was matched at all.

In the case above, the group didn't have a name. If we check
lastindex, we'll see it's correctly set to "2".

Greg, your example is correctly showing one of the bugs in
lastgroup/index handling in Python < 2.3. OTOH, the current
result of 2 is right, given the "count on open" rule, which
was always used. Here's an example in 2.3:

>>> re.match('((x))y(?:(a)b|ac)', 'xyac').groups()
('x', 'x', None)
>>> re.match('((x))y(?:(a)b|ac)', 'xyac').lastindex
2

(notice that groups always start in 1, as group 0 is the
"whole match" group)

Martin, if you agree, please close the bug. If you have any
doubts, it'll be a pleasure to explain.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-04-19 07:32

Message:
Logged In: YES 
user_id=21627

Assigning to Gustavo, since he wrote 2.84. Gustavo, can you
confirm that your change broke this feature? Can you also
agree that it should be fixed?



----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2003-02-05 01:56

Message:
Logged In: YES 
user_id=86307

I believe the discrepancy was deliberately introduced in revision 2.84 of _sre.c.  I agree with you that lastindex should return the the index of the matching group with the rightmost closing parenthesis (perhaps some elaboration in the docs is also in order).  If this is the correct interpretation, two places need to be patched: 1) the handling of SRE_OP_MARK needs to be reverted to the 2.22 code and 2) the code in the lastmark_restore function needs to be tweaked so that lastindex is not accidentally set to the last matched group entered.

Thinking further though, given a (contrived) pattern like this: 

  re.match('((x))y(?:(a)b|ac)', 'xyac')

what should lastindex be?  I assume 1, given the definition above (lastindex = matching group with rightmost close parens).  In 2.22 it is 3, since group 3 matched before the branch failed at the 'b'.  In 2.3a1 it is 2, since lastindex is restored (after the branch fails) using the saved lastmark.  Anyway, if it should be 1, then I think _sre.c will have to save lastindex as well as lastmark when processing the three opcodes which may end up calling lastmark_restore.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=672491&group_id=5470