[Python-Dev] Misc re.match() complaint

Tue Jul 16 02:25:11 CEST 2013

On Mon, Jul 15, 2013 at 5:10 PM, MRAB <python at mrabarnett.plus.com> wrote:
> On 16/07/2013 00:30, Gregory P. Smith wrote:
>>
>>
>> On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum <guido at python.org
>> <mailto:guido at python.org>> wrote:
>>
>>     In a discussion about mypy I discovered that the Python 3 version of
>>     the re module's Match object behaves subtly different from the Python
>>     2 version when the target string (i.e. the haystack, not the needle)
>>     is a buffer object.
>>
>>     In Python 2, the type of the return value of group() is always either
>>     a Unicode string or an 8-bit string, and the type is determined by
>>     looking at the target string -- if the target is unicode, group()
>>     returns a unicode string, otherwise, group() returns an 8-bit string.
>>     In particular, if the target is a buffer object, group() returns an
>>     8-bit string. I think this is the appropriate behavior: otherwise
>>     using regular expression matching to extract a small substring from a
>>     large target string would unnecessarily keep the large target string
>>     alive as long as the substring is alive.
>>
>>     But in Python 3, the behavior of group() has changed so that its
>>     return type always matches that of the target string. I think this is
>>     bad -- apart from the lifetime concern, it means that if your target
>>     happens to be a bytearray, the return value isn't even hashable!
>>
>>     Does anyone remember whether this was a conscious decision? Is it too
>>     late to fix?
>>
>>
>> Hmm, that is not what I'd expect either. I would never expect it to
>> return a bytearray; I'd normally assume that .group() returned a bytes
>> object if the input was binary data and a str object if the input was
>> unicode data (str) regardless of specific types containing the input
>> target data.
>>
>> I'm going to hazard a guess that not much, if anything, would be
>> depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
>> earlier users are stuck with an extra bytes() call and data copy in
>> these cases I guess.
>>
> I'm not sure I understand the complaint.
>
> I get this for Python 2.7:
>
> Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on
> win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import array
>>>> import re
>>>> re.match(r"a", array.array("b", "a")).group()
> array('b', [97])
>
> It's the same even in Python 2.4.

Ah, but now try it with buffer():

>> re.search('yz+', buffer('xyzzy')).group()
'yzz'
>>>

The equivalent in Python 3 (using memoryview) returns a memoryview:

>>> re.search(b'yz+', memoryview(b'xyzzy')).group()
<memory at 0x10d03a688>
>>>

And I still think that any return type for group() except bytes or str
is wrong. (Except possibly a subclass of these.)

-- 
--Guido van Rossum (python.org/~guido)