[Python-Dev] Disabling Unicode readbuffer interface

Thu, 21 Sep 2000 12:58:57 +0200

"Martin v. Loewis" wrote:
> 
> I just tried to disable the getreadbufferproc on Unicode objects. Most
> of the test suite continues to work.

Martin, haven't you read my last post to Guido ? 

Completely disabling getreadbuf is not a solution worth considering --
it breaks far too much code which the test suite doesn't even test,
e.g. MarkH's win32 stuff produces tons of Unicode object which
then can get passed to potentially all of the stdlib. The test suite
doesn't check these cases.

Here's another possible solution to the problem:

    Special case Unicode in getargs.c's code for "s#" only and leave
    getreadbuf enabled. "s#" could then return the default encoded
    value for the Unicode object while SRE et al. could still use 
    PyObject_AsReadBuffer() to get at the raw data.

> test_unicode fails, which is caused by "s#" not working anymore when
> in readbuffer_encode when testing the unicode_internal encoding. That
> could be fixed (*).

True. It currently relies on the fact the "s#" returns the internal
raw data representation for Unicode.

> More concerning, sre fails when matching a unicode string. sre uses
> the getreadbufferproc to get to the internal representation. If it has
> sizeof(Py_UNICODE) times as many bytes as it is long, we got a unicode
> buffer (?!?).
> 
> I'm not sure what the right solution would be in this case: I *think*
> sre should have more specific knowledge of Unicode objects, so it
> should support objects with a buffer interface representing a 1-byte
> character string, or Unicode objects. Actually, is there anything
> wrong with sre operating on string and unicode objects only? It
> requires that the buffer has a single segment, anyway...

Ouch... but then again, it's a (documented ?) feature of re and
sre that they work on getreadbuf compatible objects, e.g.
mmap'ed files, so they'll have to use "s#" for accessing the
data.

Of course, with the above solution, SRE could use the 
PyObject_AsReadBuffer() API to get at the binary data.

> Regards,
> Martin
> 
> (*) The 'internal encoding' function should directly get to the
> representation of the unicode object, and readbuffer_encode could
> become Python:
> 
> def readbuffer_encode(o,errors="strict"):
>   b = buffer(o)
>   return str(b),len(b)
> 
> or be removed altogether, as it would (rightfully) stop working on
> unicode objects.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/