Mailman 3 Possibly inconsistent behavior in re groupdict - Python-Dev

26 Sep 2016

      I've been lurking for a couple of months, working up the confidence to
ask the list about this behavior - I've searched through the PEPs but
couldn't find any specific reference to it.

In a nutshell, in the Python 3.5 library re patterns and search buffers
both need to be either unicode or byte strings - but the keys in the
groupdict are always returned as str in either case.

I don't know whether or not this is by design, but it would make more
sense to me if when searching a bytes object with a bytes pattern the
keys returned in the groupdict were bytes as well.

I reworked the example a little just now so it would run it on 2.7 as
well; on 2.7 the keys in the dictionary correspond to the mode of the
pattern as expected (and bytes and unicode are interconverted silently)
- code and output are inline below.

Thanks for your time,

Gordon

[Code]

import sys
import re
from datetime import datetime

data = (u"first string (unicode)",
         b"second string (bytes)")

pattern = [re.compile(u"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)"),
           re.compile(b"(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)")]

print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" %
      (datetime.now(), sys.version))
for p in pattern:
    for d in data:
        try:
            result = "groupdict: %s" % (p.match(d) and
p.match(d).groupdict())
        except Exception as e:
            result = "error: %s" % e.args[0]
        print("mode: %s\npattern: %s\ndata: %s\n%s\n" %
              (type(p.pattern).__name__, p.pattern, d, result))

[Output]

gordon@w540:~/workspace/regex_demo$ python3 regex_demo.py 
*** re consistency check ***
Run: 2016-09-25 20:06:29.472332
Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58) 
[GCC 6.2.0 20160901]

mode: str
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: first string (unicode)
groupdict: {'ordinal': 'first', 'type': 'unicode'}

mode: str
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: b'second string (bytes)'
error: cannot use a string pattern on a bytes-like object

mode: bytes
pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)'
data: first string (unicode)
error: cannot use a bytes pattern on a string-like object

mode: bytes
pattern: b'(?P<ordinal>\\w+) .*\\((?P<type>\\w+)\\)'
data: b'second string (bytes)'
groupdict: {'ordinal': b'second', 'type': b'bytes'}

gordon@w540:~/workspace/regex_demo$ python regex_demo.py 
*** re
consistency check ***
Run: 2016-09-25 20:06:23.375322
Version: Python
2.7.12+ (default, Sep  1 2016, 20:27:38) 
[GCC 6.2.0 20160822]

mode: unicode
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: first string (unicode)
groupdict: {u'ordinal': u'first', u'type': u'unicode'}

mode: unicode
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: second string (bytes)
groupdict: {u'ordinal': 'second', u'type': 'bytes'}

mode: str
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: first string (unicode)
groupdict: {'ordinal': u'first', 'type': u'unicode'}

mode: str
pattern: (?P<ordinal>\w+) .*\((?P<type>\w+)\)
data: second string (bytes)
groupdict: {'ordinal': 'second', 'type': 'bytes'}

Possibly inconsistent behavior in re groupdict

Gordon R. Burgess

Guido van Rossum

Gordon R. Burgess

tags

participants (2)