[Python-Dev] Misc re.match() complaint

Tue Jul 16 04:18:34 CEST 2013

On 16/07/2013 01:25, Guido van Rossum wrote:
> On Mon, Jul 15, 2013 at 5:10 PM, MRAB <python at mrabarnett.plus.com> wrote:
>> On 16/07/2013 00:30, Gregory P. Smith wrote:
>>>
>>>
>>> On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum <guido at python.org
>>> <mailto:guido at python.org>> wrote:
>>>
>>>     In a discussion about mypy I discovered that the Python 3 version of
>>>     the re module's Match object behaves subtly different from the Python
>>>     2 version when the target string (i.e. the haystack, not the needle)
>>>     is a buffer object.
>>>
>>>     In Python 2, the type of the return value of group() is always either
>>>     a Unicode string or an 8-bit string, and the type is determined by
>>>     looking at the target string -- if the target is unicode, group()
>>>     returns a unicode string, otherwise, group() returns an 8-bit string.
>>>     In particular, if the target is a buffer object, group() returns an
>>>     8-bit string. I think this is the appropriate behavior: otherwise
>>>     using regular expression matching to extract a small substring from a
>>>     large target string would unnecessarily keep the large target string
>>>     alive as long as the substring is alive.
>>>
>>>     But in Python 3, the behavior of group() has changed so that its
>>>     return type always matches that of the target string. I think this is
>>>     bad -- apart from the lifetime concern, it means that if your target
>>>     happens to be a bytearray, the return value isn't even hashable!
>>>
>>>     Does anyone remember whether this was a conscious decision? Is it too
>>>     late to fix?
>>>
>>>
>>> Hmm, that is not what I'd expect either. I would never expect it to
>>> return a bytearray; I'd normally assume that .group() returned a bytes
>>> object if the input was binary data and a str object if the input was
>>> unicode data (str) regardless of specific types containing the input
>>> target data.
>>>
>>> I'm going to hazard a guess that not much, if anything, would be
>>> depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
>>> earlier users are stuck with an extra bytes() call and data copy in
>>> these cases I guess.
>>>
>> I'm not sure I understand the complaint.
>>
>> I get this for Python 2.7:
>>
>> Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on
>> win
>> 32
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import array
>>>>> import re
>>>>> re.match(r"a", array.array("b", "a")).group()
>> array('b', [97])
>>
>> It's the same even in Python 2.4.
>
> Ah, but now try it with buffer():
>
>>> re.search('yz+', buffer('xyzzy')).group()
> 'yzz'
>>>>
>
> The equivalent in Python 3 (using memoryview) returns a memoryview:
>
>>>> re.search(b'yz+', memoryview(b'xyzzy')).group()
> <memory at 0x10d03a688>
>>>>
>
> And I still think that any return type for group() except bytes or str
> is wrong. (Except possibly a subclass of these.)
>
On the other hand, I think that it's not unreasonable that the output
is the same type as the input. You could reason that what it's doing is
returning a slice of the input, and that slice should be the same type
as its source.

Incidentally, the regex module does what Python 3's re module currently
does, even in Python 2. Nobody's complained!