Mailman 3 Use unbound bytes methods with objects supporting the buffer protocol - Python-ideas

newer
Cross-platform pickling of Path...

Use unbound bytes methods with objects supporting the buffer protocol

older
Re: [Python-ideas] allow `lambda'...

Serhiy Storchaka

July 13, 2016

4:40 p.m.

Unbound methods can be used as functions in python. bytes.lower(b) is the same as b.lower() if b is an instance of bytes. Many functions and methods that work with bytes accept not just bytes, but arbitrary objects that support the buffer protocol. Including bytes methods: >>> b'a:b'.split(memoryview(b':')) [b'a', b'b'] But the first argument of unbound bytes method can be only a bytes instance. >>> bytes.split(memoryview(b'a:b'), b':') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: descriptor 'split' requires a 'bytes' object but received a 'memoryview' I think it would be helpful to allow using unbound bytes methods with arbitrary objects that support the buffer protocol as the first argument. This would allow to avoid unneeded copying (the primary purpose of the buffer protocol). >>> bytes.split(memoryview(b'a:b'), b':') [b'a', b'b']

Show replies by date

Terry Reedy

July 2016

9:09 p.m.

On 7/13/2016 12:40 PM, Serhiy Storchaka wrote:

...

Unbound methods can be used as functions in python.

According to my naive understanding, in Python 3, there are not supposed to be 'unbound methods'. Functions accessed as a class attribute are supposed to *be* functions, and not just usable as a function. This is true, at least for Python-coded classes. class C(): def f(self, other): return self + other print(C.f) # <function C.f at 0x000001BAFC97D598> print(C.f(1,2)) # 3 print(C.f('a', 'b')) # 'ab' This works because the Python code is generic and being accessed as a class attribute does not impose additional restrictions on inputs beyond those inherent in the code itself.

...

...
...
C.f('a', 2) Traceback (most recent call last): File "<pyshell#29>", line 1, in <module> C.f('a', 2) File "<pyshell#5>", line 3, in f return self + other TypeError: must be str, not int

...

...
...
'a'+2 Traceback (most recent call last): File "<pyshell#30>", line 1, in <module> 'a'+2 TypeError: must be str, not int

If the situation is different for C-coded functions and classes, them it would seem impossible to write a drop-in replacement for Python-coded classes.

...

bytes.lower(b) is the same as b.lower() if b is an instance of bytes. Many functions and methods that work with bytes accept not just bytes, but arbitrary objects that support the buffer protocol. Including bytes methods:

>>> b'a:b'.split(memoryview(b':')) [b'a', b'b']

But the first argument of unbound bytes method can be only a bytes instance.

>>> bytes.split(memoryview(b'a:b'), b':') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: descriptor 'split' requires a 'bytes' object but received a 'memoryview'

My naive expectation was that bytes.split should be a built-in function, like open -- but it is not

...

...
...
open <built-in function open> bytes.split <method 'split' of 'bytes' objects>

and that the C coded split function would type check both args for being bytes-like, as it does with the second.

...

...
...
b'a:b'.split(':') Traceback (most recent call last): File "<pyshell#32>", line 1, in <module> b'a:b'.split(':') TypeError: a bytes-like object is required, not 'str'

Assuming that the descriptor check is not just an unintened holdover from 2.x, it seems that for C-coded functions used as methods, type-checking the first arg was conceptually factored out and replaced by a generic check in the descriptor mechanism. In this case, the descriptor check is stricter that you would like. Is it stricter than necessary? If the memoryview were passed to the code for bytes.check, would the code successfully run to conclusion? Is it sufficiently generic at the machine bytes level?

...

I think it would be helpful to allow using unbound bytes methods with arbitrary objects that support the buffer protocol as the first argument. This would allow to avoid unneeded copying (the primary purpose of the buffer protocol).

>>> bytes.split(memoryview(b'a:b'), b':') [b'a', b'b']

If the descriptor check cannot be selectively loosened, a possible solution might be a base class for all bytes-like buffer protocol classes that would have all method functions that work with all bytes-like objects. -- Terry Jan Reedy

Nick Coghlan

8:52 a.m.

On 14 July 2016 at 07:09, Terry Reedy <tjreedy@udel.edu> wrote:

...

It's intentional - the default C level descriptors typecheck their first argument, since getting that wrong may cause a segfault in most cases.

...

A custom wrapper descriptor that checks for "supports the buffer protocol" rather than "is a bytes-like object" is certainly possible, so I believe Serhiy's question here is more a design question around "Should they?" than it is a technical question around "Can they?". Given the way this would behave if "bytes" was implemented in Python rather than C (i.e. unbound methods would rely on ducktyping, even for the first argument), +1 from me for making the unbound methods for bytes compatible with arbitrary objects supporting the buffer protocol. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

eryk sun

4:31 p.m.

On Thu, Jul 14, 2016 at 8:52 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

The buffer protocol is a bit generic for duck typing. Instead the bytes methods could check for a memoryview with a format that's "B" or "b". >>> a = np.array([1,2,3,4], dtype='int16') >>> b = np.array([1,2,3,4], dtype='uint8') >>> memoryview(a).format 'h' >>> memoryview(b).format 'B' It's possible to cast if necessary, e.g. memoryview(a).cast('B'). No copy of the data is made, so it's still reasonably efficient. This preserves raising a TypeError for operations that are generally nonsensical, such as attempting to split() an array of short integers as if it's just bytes.