[Python-Dev] Possibly inconsistent behavior in re groupdict

Gordon R. Burgess gordon at parasamgate.com
Wed Sep 28 18:03:38 EDT 2016


Hi Guido - thanks for your thoughts on this.

This came up for me when writing an HL7 library, where the raw data is
all bytes - it seemed a little odd that the names went in as bytes and
came out as str - especially given the way the re library expects
consistency between the patterns and targets - but I also appreciate
the point about breaking code.  (Including mine, which has a comment on
it that says, "match.groupdict returns a dict with str keys in Python
3.5" :D)

Cheers,

Gordon


-----Original Message-----
From: Guido van Rossum <guido at python.org>
Reply-to: guido at python.org
To: Gordon R. Burgess <gordon at parasamgate.com>
Cc: Python-Dev <python-dev at python.org>
Subject: Re: [Python-Dev] Possibly inconsistent behavior in re
groupdict
Date: Sun, 25 Sep 2016 21:36:20 -0700

Hi Gordon,

You pose an interesting question that I don't think anyone has posed
before. Having thought about it, I think that the keys in the group
dict are similar to the names of variables or attributes, and I think
treating them always as strings makes sense. For example, I might
write a function that allows passing in a pattern and a search string,
both either str or bytes, where the function would expect fixed keys
in the group dict:

def extract_key_value(pattern, target):
    m = re.match(pattern, target)
    return m and m.groupdict['key'], m.groupdict['value']

There might be a problem with decoding the group name from the
pattern, so sticking to ASCII group names would be wise.

There's also the backwards compatibility concern: even if we did want
to change this, would we want to break existing code (like the above)
that might currently work?

--Guido

On Sun, Sep 25, 2016 at 5:25 PM, Gordon R. Burgess
<gordon at parasamgate.com> wrote:
> 
> I've been lurking for a couple of months, working up the confidence
> to
> ask the list about this behavior - I've searched through the PEPs but
> couldn't find any specific reference to it.
> 
> In a nutshell, in the Python 3.5 library re patterns and search
> buffers
> both need to be either unicode or byte strings - but the keys in the
> groupdict are always returned as str in either case.
> 
> I don't know whether or not this is by design, but it would make more
> sense to me if when searching a bytes object with a bytes pattern the
> keys returned in the groupdict were bytes as well.
> 
> I reworked the example a little just now so it would run it on 2.7 as
> well; on 2.7 the keys in the dictionary correspond to the mode of the
> pattern as expected (and bytes and unicode are interconverted
> silently)
> - code and output are inline below.
> 
> Thanks for your time,
> 
> Gordon
> 
> [Code]
> 
> import sys
> import re
> from datetime import datetime
> 
> data = (u"first string (unicode)",
>          b"second string (bytes)")
> 
> pattern = [re.compile(u"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)"),
>            re.compile(b"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)")]
> 
> print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" %
>       (datetime.now(), sys.version))
> for p in pattern:
>     for d in data:
>         try:
>             result = "groupdict: %s" % (p.match(d) and
> p.match(d).groupdict())
>         except Exception as e:
>             result = "error: %s" % e.args[0]
>         print("mode: %s\npattern: %s\ndata: %s\n%s\n" %
>               (type(p.pattern).__name__, p.pattern, d, result))
> 
> [Output]
> 
> gordon at w540:~/workspace/regex_demo$ python3 regex_demo.py
> *** re consistency check ***
> Run: 2016-09-25 20:06:29.472332
> Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58)
> [GCC 6.2.0 20160901]
> 
> mode: str
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: first string (unicode)
> groupdict: {'ordinal': 'first', 'type': 'unicode'}
> 
> mode: str
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: b'second string (bytes)'
> error: cannot use a string pattern on a bytes-like object
> 
> mode: bytes
> pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)'
> data: first string (unicode)
> error: cannot use a bytes pattern on a string-like object
> 
> mode: bytes
> pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)'
> data: b'second string (bytes)'
> groupdict: {'ordinal': b'second', 'type': b'bytes'}
> 
> gordon at w540:~/workspace/regex_demo$ python regex_demo.py
> *** re
> consistency check ***
> Run: 2016-09-25 20:06:23.375322
> Version: Python
> 2.7.12+ (default, Sep  1 2016, 20:27:38)
> [GCC 6.2.0 20160822]
> 
> mode: unicode
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: first string (unicode)
> groupdict: {u'ordinal': u'first', u'type': u'unicode'}
> 
> mode: unicode
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: second string (bytes)
> groupdict: {u'ordinal': 'second', u'type': 'bytes'}
> 
> mode: str
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: first string (unicode)
> groupdict: {'ordinal': u'first', 'type': u'unicode'}
> 
> mode: str
> pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
> data: second string (bytes)
> groupdict: {'ordinal': 'second', 'type': 'bytes'}
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido
> %40python.org





More information about the Python-Dev mailing list