[Python-Dev] Misc re.match() complaint
Guido van Rossum
guido at python.org
Tue Jul 16 02:25:11 CEST 2013
On Mon, Jul 15, 2013 at 5:10 PM, MRAB <python at mrabarnett.plus.com> wrote:
> On 16/07/2013 00:30, Gregory P. Smith wrote:
>>
>>
>> On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum <guido at python.org
>> <mailto:guido at python.org>> wrote:
>>
>> In a discussion about mypy I discovered that the Python 3 version of
>> the re module's Match object behaves subtly different from the Python
>> 2 version when the target string (i.e. the haystack, not the needle)
>> is a buffer object.
>>
>> In Python 2, the type of the return value of group() is always either
>> a Unicode string or an 8-bit string, and the type is determined by
>> looking at the target string -- if the target is unicode, group()
>> returns a unicode string, otherwise, group() returns an 8-bit string.
>> In particular, if the target is a buffer object, group() returns an
>> 8-bit string. I think this is the appropriate behavior: otherwise
>> using regular expression matching to extract a small substring from a
>> large target string would unnecessarily keep the large target string
>> alive as long as the substring is alive.
>>
>> But in Python 3, the behavior of group() has changed so that its
>> return type always matches that of the target string. I think this is
>> bad -- apart from the lifetime concern, it means that if your target
>> happens to be a bytearray, the return value isn't even hashable!
>>
>> Does anyone remember whether this was a conscious decision? Is it too
>> late to fix?
>>
>>
>> Hmm, that is not what I'd expect either. I would never expect it to
>> return a bytearray; I'd normally assume that .group() returned a bytes
>> object if the input was binary data and a str object if the input was
>> unicode data (str) regardless of specific types containing the input
>> target data.
>>
>> I'm going to hazard a guess that not much, if anything, would be
>> depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
>> earlier users are stuck with an extra bytes() call and data copy in
>> these cases I guess.
>>
> I'm not sure I understand the complaint.
>
> I get this for Python 2.7:
>
> Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on
> win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import array
>>>> import re
>>>> re.match(r"a", array.array("b", "a")).group()
> array('b', [97])
>
> It's the same even in Python 2.4.
Ah, but now try it with buffer():
>> re.search('yz+', buffer('xyzzy')).group()
'yzz'
>>>
The equivalent in Python 3 (using memoryview) returns a memoryview:
>>> re.search(b'yz+', memoryview(b'xyzzy')).group()
<memory at 0x10d03a688>
>>>
And I still think that any return type for group() except bytes or str
is wrong. (Except possibly a subclass of these.)
--
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev
mailing list