Possibly inconsistent behavior in re groupdict

I've been lurking for a couple of months, working up the confidence to ask the list about this behavior - I've searched through the PEPs but couldn't find any specific reference to it. In a nutshell, in the Python 3.5 library re patterns and search buffers both need to be either unicode or byte strings - but the keys in the groupdict are always returned as str in either case. I don't know whether or not this is by design, but it would make more sense to me if when searching a bytes object with a bytes pattern the keys returned in the groupdict were bytes as well. I reworked the example a little just now so it would run it on 2.7 as well; on 2.7 the keys in the dictionary correspond to the mode of the pattern as expected (and bytes and unicode are interconverted silently) - code and output are inline below. Thanks for your time, Gordon [Code] import sys import re from datetime import datetime data = (u"first string (unicode)", b"second string (bytes)") pattern = [re.compile(u"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)"), re.compile(b"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)")] print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" % (datetime.now(), sys.version)) for p in pattern: for d in data: try: result = "groupdict: %s" % (p.match(d) and p.match(d).groupdict()) except Exception as e: result = "error: %s" % e.args[0] print("mode: %s\npattern: %s\ndata: %s\n%s\n" % (type(p.pattern).__name__, p.pattern, d, result)) [Output] gordon@w540:~/workspace/regex_demo$ python3 regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:29.472332 Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58) [GCC 6.2.0 20160901] mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': 'first', 'type': 'unicode'} mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: b'second string (bytes)' error: cannot use a string pattern on a bytes-like object mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: first string (unicode) error: cannot use a bytes pattern on a string-like object mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: b'second string (bytes)' groupdict: {'ordinal': b'second', 'type': b'bytes'} gordon@w540:~/workspace/regex_demo$ python regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:23.375322 Version: Python 2.7.12+ (default, Sep 1 2016, 20:27:38) [GCC 6.2.0 20160822] mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {u'ordinal': u'first', u'type': u'unicode'} mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {u'ordinal': 'second', u'type': 'bytes'} mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': u'first', 'type': u'unicode'} mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {'ordinal': 'second', 'type': 'bytes'}

Hi Gordon, You pose an interesting question that I don't think anyone has posed before. Having thought about it, I think that the keys in the group dict are similar to the names of variables or attributes, and I think treating them always as strings makes sense. For example, I might write a function that allows passing in a pattern and a search string, both either str or bytes, where the function would expect fixed keys in the group dict: def extract_key_value(pattern, target): m = re.match(pattern, target) return m and m.groupdict['key'], m.groupdict['value'] There might be a problem with decoding the group name from the pattern, so sticking to ASCII group names would be wise. There's also the backwards compatibility concern: even if we did want to change this, would we want to break existing code (like the above) that might currently work? --Guido On Sun, Sep 25, 2016 at 5:25 PM, Gordon R. Burgess <gordon@parasamgate.com> wrote:
I've been lurking for a couple of months, working up the confidence to ask the list about this behavior - I've searched through the PEPs but couldn't find any specific reference to it.
In a nutshell, in the Python 3.5 library re patterns and search buffers both need to be either unicode or byte strings - but the keys in the groupdict are always returned as str in either case.
I don't know whether or not this is by design, but it would make more sense to me if when searching a bytes object with a bytes pattern the keys returned in the groupdict were bytes as well.
I reworked the example a little just now so it would run it on 2.7 as well; on 2.7 the keys in the dictionary correspond to the mode of the pattern as expected (and bytes and unicode are interconverted silently) - code and output are inline below.
Thanks for your time,
Gordon
[Code]
import sys import re from datetime import datetime
data = (u"first string (unicode)", b"second string (bytes)")
pattern = [re.compile(u"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)"), re.compile(b"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)")]
print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" % (datetime.now(), sys.version)) for p in pattern: for d in data: try: result = "groupdict: %s" % (p.match(d) and p.match(d).groupdict()) except Exception as e: result = "error: %s" % e.args[0] print("mode: %s\npattern: %s\ndata: %s\n%s\n" % (type(p.pattern).__name__, p.pattern, d, result))
[Output]
gordon@w540:~/workspace/regex_demo$ python3 regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:29.472332 Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58) [GCC 6.2.0 20160901]
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': 'first', 'type': 'unicode'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: b'second string (bytes)' error: cannot use a string pattern on a bytes-like object
mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: first string (unicode) error: cannot use a bytes pattern on a string-like object
mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: b'second string (bytes)' groupdict: {'ordinal': b'second', 'type': b'bytes'}
gordon@w540:~/workspace/regex_demo$ python regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:23.375322 Version: Python 2.7.12+ (default, Sep 1 2016, 20:27:38) [GCC 6.2.0 20160822]
mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {u'ordinal': u'first', u'type': u'unicode'}
mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {u'ordinal': 'second', u'type': 'bytes'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': u'first', 'type': u'unicode'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {'ordinal': 'second', 'type': 'bytes'}
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)

Hi Guido - thanks for your thoughts on this. This came up for me when writing an HL7 library, where the raw data is all bytes - it seemed a little odd that the names went in as bytes and came out as str - especially given the way the re library expects consistency between the patterns and targets - but I also appreciate the point about breaking code. (Including mine, which has a comment on it that says, "match.groupdict returns a dict with str keys in Python 3.5" :D) Cheers, Gordon -----Original Message----- From: Guido van Rossum <guido@python.org> Reply-to: guido@python.org To: Gordon R. Burgess <gordon@parasamgate.com> Cc: Python-Dev <python-dev@python.org> Subject: Re: [Python-Dev] Possibly inconsistent behavior in re groupdict Date: Sun, 25 Sep 2016 21:36:20 -0700 Hi Gordon, You pose an interesting question that I don't think anyone has posed before. Having thought about it, I think that the keys in the group dict are similar to the names of variables or attributes, and I think treating them always as strings makes sense. For example, I might write a function that allows passing in a pattern and a search string, both either str or bytes, where the function would expect fixed keys in the group dict: def extract_key_value(pattern, target): m = re.match(pattern, target) return m and m.groupdict['key'], m.groupdict['value'] There might be a problem with decoding the group name from the pattern, so sticking to ASCII group names would be wise. There's also the backwards compatibility concern: even if we did want to change this, would we want to break existing code (like the above) that might currently work? --Guido On Sun, Sep 25, 2016 at 5:25 PM, Gordon R. Burgess <gordon@parasamgate.com> wrote:
I've been lurking for a couple of months, working up the confidence to ask the list about this behavior - I've searched through the PEPs but couldn't find any specific reference to it.
In a nutshell, in the Python 3.5 library re patterns and search buffers both need to be either unicode or byte strings - but the keys in the groupdict are always returned as str in either case.
I don't know whether or not this is by design, but it would make more sense to me if when searching a bytes object with a bytes pattern the keys returned in the groupdict were bytes as well.
I reworked the example a little just now so it would run it on 2.7 as well; on 2.7 the keys in the dictionary correspond to the mode of the pattern as expected (and bytes and unicode are interconverted silently) - code and output are inline below.
Thanks for your time,
Gordon
[Code]
import sys import re from datetime import datetime
data = (u"first string (unicode)", b"second string (bytes)")
pattern = [re.compile(u"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)"), re.compile(b"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)")]
print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" % (datetime.now(), sys.version)) for p in pattern: for d in data: try: result = "groupdict: %s" % (p.match(d) and p.match(d).groupdict()) except Exception as e: result = "error: %s" % e.args[0] print("mode: %s\npattern: %s\ndata: %s\n%s\n" % (type(p.pattern).__name__, p.pattern, d, result))
[Output]
gordon@w540:~/workspace/regex_demo$ python3 regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:29.472332 Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58) [GCC 6.2.0 20160901]
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': 'first', 'type': 'unicode'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: b'second string (bytes)' error: cannot use a string pattern on a bytes-like object
mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: first string (unicode) error: cannot use a bytes pattern on a string-like object
mode: bytes pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)' data: b'second string (bytes)' groupdict: {'ordinal': b'second', 'type': b'bytes'}
gordon@w540:~/workspace/regex_demo$ python regex_demo.py *** re consistency check *** Run: 2016-09-25 20:06:23.375322 Version: Python 2.7.12+ (default, Sep 1 2016, 20:27:38) [GCC 6.2.0 20160822]
mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {u'ordinal': u'first', u'type': u'unicode'}
mode: unicode pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {u'ordinal': 'second', u'type': 'bytes'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: first string (unicode) groupdict: {'ordinal': u'first', 'type': u'unicode'}
mode: str pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\) data: second string (bytes) groupdict: {'ordinal': 'second', 'type': 'bytes'}
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido %40python.org
participants (2)
-
Gordon R. Burgess
-
Guido van Rossum