Misc re.match() complaint

In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? -- --Guido van Rossum (python.org/~guido)

On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum <guido@python.org> wrote:
Hmm, that is not what I'd expect either. I would never expect it to return a bytearray; I'd normally assume that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. I'm going to hazard a guess that not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases I guess. -gps

Ok, created http://bugs.python.org/issue18468. On Mon, Jul 15, 2013 at 4:30 PM, Gregory P. Smith <greg@krypto.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On 16/07/2013 00:30, Gregory P. Smith wrote:
I'm not sure I understand the complaint. I get this for Python 2.7: Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information.
It's the same even in Python 2.4.

On Mon, Jul 15, 2013 at 5:10 PM, MRAB <python@mrabarnett.plus.com> wrote:
Ah, but now try it with buffer():
re.search('yz+', buffer('xyzzy')).group() 'yzz'
The equivalent in Python 3 (using memoryview) returns a memoryview:
re.search(b'yz+', memoryview(b'xyzzy')).group() <memory at 0x10d03a688>
And I still think that any return type for group() except bytes or str is wrong. (Except possibly a subclass of these.) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum writes:
And I still think that any return type for group() except bytes or str is wrong. (Except possibly a subclass of these.)
I'm not sure I understand. Do you mean in the context of the match object API, where constructing "(target, match.start(), match.end())" to get a group-like object that refers to the target rather than copying the text is simple? (Such objects are very useful in the restricted application of constructing a programmable text editor.) Or is this something deeper, that a group *is* a new object in principle?

On Mon, Jul 15, 2013 at 7:03 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm not sure I understand you. :-( The group() method on the match object returned by re.match() and re.search() returns a string-ish object representing the matched substring. (I'm using "string-ish" to allow for both unicode and bytes, which are exactly the two matching modes supported be the re module.) In most contexts (text editors excluded) the program will use this string just as it would use any other string, perhaps using it to open a file, perhaps as a key into some cache, and so on. I can clearly see the reasons why you want the target string to allow other types besides str and bytes, in particular other things that are known to represent sequences of bytes, such as bytearray and memoryview. These reasons primarily have to do with optimizing the representation of the target string in case it takes up a large amount of memory, or other situations where we'd like to reduce the number of times each byte is copied before we see it. But I don't see as much of a use case for group() returning an object of the same type as the target string. In particular in the case of a target string that is a bytearray, group() has to copy the bytes regardless of whether it creates a bytes or a bytearray instance. And I do see a use case for group() returning an immutable object.
Or is this something deeper, that a group *is* a new object in principle?
No, I just think of it as returning "a string" and I think it's most useful if that is always an immutable object, even if the target string is some other bytes buffer. FWIW, it feels as if the change in behavior is probably just due to how slices work. -- --Guido van Rossum (python.org/~guido)

On 16 July 2013 12:20, Guido van Rossum <guido@python.org> wrote:
I took a look at the way the 2.7 re code works, and the change does indeed appear to be due to the difference in the way slices work for buffer and memoryview objects: Slicing a buffer creates an 8-bit string:
buffer(b"abc")[0:1] 'a'
Slicing a memoryview creates another memoryview:
memoryview(b"abc")[0:1] <memory at 0x7f3320541b98>
Unfortunately, memoryview doesn't currently allow subclasses, so it isn't easy to create a derivative that coerces to bytes on slicing :( Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hm. I'd still like to change this, but I understand it's debatable... Is the group() method written in C or Python? If it's in C it should be simple enough to let it just do a little bit of pointer math and construct a bytes object from the given area of memory -- after all, it must have a pointer to that memory area in order to do the matching in the first place (although I realize the code may be separated by a gulf of abstraction :-). --Guido On Mon, Jul 15, 2013 at 8:03 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 16 July 2013 14:53, Guido van Rossum <guido@python.org> wrote:
It shouldn't be too bad - I tracked it down through sre_compile, and everything seems to funnel into match_getslice_by_index [1], so it should be possible to detect the non-bytes, non-strings there and coerce them to bytes. OTOH, you can already get the same effect by explicitly wrapping the input in memoryview before passing it to re, and then converting the output to bytes to release the reference to the underlying data, and doing that doesn't raise ugly backwards compatibility concerns.... Cheers, Nick. [1] http://hg.python.org/cpython/file/daf9ea42b610/Modules/_sre.c#l3198 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le Mon, 15 Jul 2013 21:53:42 -0700, Guido van Rossum <guido@python.org> a écrit :
Hm. I'd still like to change this, but I understand it's debatable... Is the group() method written in C or Python?
Is there a strong enough use case to change it? I can't say the current behaviour seems very useful either, but some people may depend on it. I already find it a bit weird that you're passing a bytearray or memoryview to re.match(), to be honest :-) Regards Antoine.

On Tue, Jul 16, 2013 at 12:55 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Is there a strong enough use case to change it? I can't say the current behaviour seems very useful either, but some people may depend on it.
This is the crucial question. I personally see the current behavior as an artifact of the (lack of) design process, not as a conscious decision. Given that we also have m.string, m.start(grp) and m.end(grp), those who need something matching the original type (or even something that is known to be a reference into the original object) can use that API; for most use cases, all you care about is is the selected group as a string, and it is more useful if that is always an immutable string (bytes or str). The situation is most egregious if the target string is a bytearray, where there is currently no way to get the result as an immutable bytes object without an extra copy. (There's no API that lets you create a bytes object directly from a slice of a bytearray.) In terms of backwards compatibility, I wouldn't want to do this in a bugfix release, but for a feature release I think it's fine -- the number of applications that could be bitten by this must be extremely small (and the work-around is backward-compatible: just use m.string[m.start() : m.stop()]).
I already find it a bit weird that you're passing a bytearray or memoryview to re.match(), to be honest :-)
Yes, this is somewhat of an odd corner, but actually most built-in APIs taking bytes also take anything else that can be coerced to bytes (io.open() seems to be the exception, and it feels like an accident -- os.open() *does* accept bytearray and friends). This is quite useful for code that interacts with C code or system calls -- often you have a large buffer shared between C and Python code for efficiency, and being able to do pretty much anything to the buffer that you can do to a bytes object (apart from using it as a dict key) helps a lot. -- --Guido van Rossum (python.org/~guido)

Terry Reedy writes:
The problem is that IIUC '"a string"' is intentionally *not* referring to the usual "str or bytes objects" (at least that's one of the standard uses for scare quotes, to indicate an unusual usage). Either the docstring is using "string" in a similarly ambiguous way, or else it's incorrect under the interpretation that buffer objects are *not* "strings", so they should be inadmissible as targets. Something should be fixed, and I suppose it should be the return type of group(). BTW, I suggest that Terry's usage of "string" (to mean "str or bytes" in 3.x, "unicode or str" in 2.x) be adopted, and Guido's "stringish" be given expanded meaning, including buffer objects. Then we can say informally that in searching and matching a target is a stringish, the pattern is a stringish (?) or compiled re, but the group method returns a string. Steve

On 7/17/2013 12:15 AM, Stephen J. Turnbull wrote:
There are no 'scare quotes' in the doc. I put quote marks on things to indicated that I was quoting. I do not know how Guido regarded his marks.
Saying that input arguments can be "Unicode strings as well as 8-bit strings' (the wording is from 2.x, carried over to 3.x) does not necessary exclude other inputs. CPython is somethimes more more permissive than the doc requires. If the doc said str, bytes, butearray, or memoryview, then other implementations would have to do the same to be conforming. I do not know if that is intended or not. The question is whether CPython should be just as permissive as to the output types of .group(). (And what, if any requirement should be imposed on other implementations.)
This word is an adjective, not a noun.
Guido's idea to fix (tighten up) the output in 3.4 is fine with me. -- Terry Jan Reedy

On 17/07/13 19:05, Terry Reedy wrote:
Saying that input arguments can be "Unicode strings as well as 8-bit strings' (the wording is from 2.x, carried over to 3.x) does not necessary exclude other inputs.
"8-bit strings" seems somewhat ambiguous to me. In UTF-8, many Unicode strings are 8-bit, as they can be with Python 3.3's flexible string format. I prefer to stick to Unicode or text string, versus byte string. Pedants who point out that "byte" does not necessarily mean 8-bits, and therefore we should talk about octets, will be slapped with a large halibut :-) -- Steven

When precision is needed I say things like 'a str object' or 'a bytes object'. There is no shame in a bit of verbosity around such issues, especially in the reference docs (tutorials are a different issue). On Wed, Jul 17, 2013 at 4:50 AM, Steven D'Aprano <steve@pearwood.info> wrote:
-- --Guido van Rossum (python.org/~guido)

On 17/07/2013 05:15, Stephen J. Turnbull wrote:
Instead of "stringish", how about "stringoid"? To me, "stringish" is an adjective, but "stringoid" can be a noun or an adjective. According to http://dictionary.reference.com: """ -oid —suffix forming adjectives, —suffix forming nouns indicating likeness, resemblance, or similarity: anthropoid """

Hi, On Wed, Jul 17, 2013 at 6:15 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
"string" means "str", "bytes" means "bytes", "bytes-like object" means "any object that supports the buffer protocol" [0] (including bytes). "string and bytes-like object" includes all of them. I don't think we need to introduce new terms. Best Regards, Ezio Melotti [0]: http://docs.python.org/3/glossary.html#term-bytes-like-object

On 16/07/2013 01:25, Guido van Rossum wrote:
On the other hand, I think that it's not unreasonable that the output is the same type as the input. You could reason that what it's doing is returning a slice of the input, and that slice should be the same type as its source. Incidentally, the regex module does what Python 3's re module currently does, even in Python 2. Nobody's complained!

On Mon, Jul 15, 2013 at 7:18 PM, MRAB <python@mrabarnett.plus.com> wrote:
By now I'm pretty sure that is why it changed. But I am challenging how useful that is, compared to always returning something immutable.
Incidentally, the regex module does what Python 3's re module currently does, even in Python 2. Nobody's complained!
Well, you'd only see complaints from folks who (a) use the regex module, (b) use it with a buffer object as the target string, and (c) try to use the group() return value as a dict key. Each of these is probably a small majority of all users. -- --Guido van Rossum (python.org/~guido)

On 16 Jul 2013 09:17, "Guido van Rossum" <guido@python.org> wrote:
Does anyone remember whether this was a conscious decision?
I doubt it was a conscious decision - an unfortunate amount of the standard library's handling of the text model change falls into the category of "implementation accident" :(
Is it too late to fix?
Like Greg, I'm comfortable with the idea of calling "bug" on this one, fixing it in 3.4 and making a note in the "Porting to Python 3.4" section of the What's New guide. Cheers, Nick.

On 7/15/2013 7:14 PM, Guido van Rossum wrote:
In both Python 2 and Python 3, the second sentence of the docs is "Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings." The Python 3 version goes on to say that patterns and targets must match. "However, Unicode strings and 8-bit strings cannot be mixed." I normally consider '8-bit string' to mean 'bytes'. It certainly meant that in Python 2. We use 'buffer object' or 'object satisfying the buffer protocol' to mean 'bytes, byte_arrays, or memoryviews'. I wonder if the change was an artifact of changing the code to prohibit mixing Unicode and bytes. Going on "match.group([group1, ...]) Returns one or more subgroups of the match. If there is a single argument, the result is a single string;" In both 2.x and 3.x docs, I usually understand generic 'string' to mean 'Unicode or bytes'. In any case, The sentence and a half from 'Returns' to 'string' is *exactly the same* as in the 2.x docs. As near as I could tell looking by the, the rest of the entry for match.group is unchanged from 2.x to 3.x. So it is easy to think that the behavior change is an unintended regression. -- Terry Jan Reedy

On 16 July 2013 19:18, Terry Reedy <tjreedy@udel.edu> wrote:
I wonder if the change was an artifact of changing the code to prohibit mixing Unicode and bytes.
I'm pretty sure we the only thing we changed in 3.x is to migrate re to the PEP 3118 buffer API, and the behavioural change Guido is seeing is actually the one between the 2.x buffer (which returns 8-bit strings when sliced) and other types (including memoryview) which return instances of themselves. Getting the old buffer behaviour in 3.x without an extra copy operation should just be a matter of wrapping the input with memoryview (to avoid copying the group elements in the match object) and the output with bytes (to avoid keeping the entire original object alive just to reference a few small pieces of it that were matched by the regex):
Given that, I'm inclined to keep the existing behaviour on backwards compatibility grounds. To make the above code work on both 2.x *and* 3.x without making an extra copy, it's possible to keep the bytes call (it should be a no-op on 2.x) and dynamically switch the type used to wrap the input between buffer in 2.x and memoryview in 3.x (unfortunately, the 2.x memoryview doesn't work for this case, as the 2.x re API doesn't accept it as valid input). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

