[New-bugs-announce] [issue19536] MatchObject should offer __getitem__()

Brandon Rhodes report at bugs.python.org
Sat Nov 9 16:15:27 CET 2013


New submission from Brandon Rhodes:

Regular expression re.MatchObject objects are sequences.
They contain at least one “group” string, possibly more,
which are integer-indexed starting at zero.
Today, groups can be accessed in one of two ways.

(1) You can call the method match.group(N).

(2) You can call glist = match.groups()
    and then access each group as glist[N-1].
    Note the obvious off-by-one error:
    .groups() does not include “group zero”,
    which contains the entire match,
    and therefore its indexes are off-by-one
    from the values you would pass to .group().

I propose that MatchObject gain a __getitem__(N) method
whose return value for every N is the same as .group(N)
as I think that match[N] is a quite obvious syntax for
asking for one particular group of an RE match.

The only objection I can see to this proposal
is the obvious asymmetry between Group Zero and all
subsequent groups of a regular expression pattern:
zero means “the whole thing” whereas each of the others
holds the content of a particular explicit set of parens.
Looping over the elements match[0], match[1], ... of a
pattern like this:

    r'(\d\d\d\d)/(\d\d)/(\d\d)'

will give you *first* the *entire* match, and only then
turn its attention to the three parenthesized substrings.

My retort is that concentric groups can happen anyway:
that Group Zero, holding the entire match, is not really
as special as the newcomer might suspect, because you can
always wind up with groups inside of other groups; it is
simply part of the semantics of regular expressions that
groups might overlap or might contain one another, as in:

    r'((\d\d)/(\d\d)) Description: (.*)'

Here, we see that concentricity is not a special property
of Group Zero, but in fact something that can happen quite
naturally with other groups.

The caller simply needs to imagine every regular expression
being surrounded by an “automatic set of parentheses” to
understand where Group Zero comes from, and how it will be
ordered in the resulting sequence of groups relative to
the subordinate groups within the string.

If one or two people voice agreement here in this issue,
I will be very happy to offer a patch.

----------
components: Regular Expressions
messages: 202480
nosy: brandon-rhodes, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: MatchObject should offer __getitem__()
type: enhancement
versions: Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19536>
_______________________________________


More information about the New-bugs-announce mailing list