Re: [Python-Dev] Adding bytes.frombuffer() constructor to PEP 467 (was: [Python-ideas] Adding bytes.frombuffer() constructor

12 Oct 2016

      I don't think it makes sense to add any more ideas to PEP 467. That
needed to be a PEP because it proposed breaking backwards
compatibility in a couple of areas, and because of the complex history
of Python 3's "bytes-as-tuple-of-ints" and Python 2's "bytes-as-str"
semantics.

Other enhancements to the binary data handling APIs in Python 3 can be
considered on their own merits.

On 12 October 2016 at 14:08, INADA Naoki <songofacandy@gmail.com> wrote:
...
Memoryview problem
=================
To avoid redundant copy of `line = bytes(buf)[:n]`, current solution
is using memoryview.
First code I wrote is: `line = bytes(memoryview(buf)[:n])`.
On CPython, it works fine.  But `del buff[:n+2]` in next line may fail
on other Python
implementations.  Changing bytearray size is inhibited while
memoryview is alive.
So right code is:
with memoryview(buf) as m:
    line = bytes(m[:n])
The problem of memoryview approach is:
* Overhead: creating temporary memoryview, __enter__, and __exit__. (see below)
* It isn't "one obvious way": Developers including me may forget to
use context manager.
  And since it works on CPython, it's hard to point it out.
To add to the confusion, there's also
https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes
giving:

    line = memoryview(buf)[:n].tobytes()

However, folks *do* need to learn that many mutable data types will
lock themselves against modification while you have a live memory view
on them, so it's important to release views promptly and reliably when
we don't need them any more.
...
Quick benchmark:
(temporary bytes)
$ python3 -m perf timeit -s 'buf =
bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(buf)[:3]'
....................
Median +- std dev: 652 ns +- 19 ns
(temporary memoryview without "with"
$ python3 -m perf timeit -s 'buf =
bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(memoryview(buf)[:3])'
....................
Median +- std dev: 886 ns +- 26 ns
(temporary memoryview with "with")
$ python3 -m perf timeit -s 'buf = bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- '
with memoryview(buf) as m:
    bytes(m[:3])
'
....................
Median +- std dev: 1.11 us +- 0.03 us
This is normal though, as memory views trade lower O(N) costs (reduced
data copying) for higher O(1) setup costs (creating and managing the
view, indirection for data access).
...
Proposed solution
===============
Adding one more constructor to bytes:
# when length=-1 (default), use until end of *byteslike*.
    bytes.frombuffer(byteslike, length=-1, offset=0)
With ths API
with memoryview(buf) as m:
        line = bytes(m[:n])
becomes
line = bytes.frombuffer(buf, n)
Does that need to be a method on the builtin rather than a separate
helper function, though? Once you define:

    def snapshot(buf, length=None, offset=0):
        with memoryview(buf) as m:
            return m[offset:length].tobytes()

then that can be replaced by a more optimised C implementation without
users needing to care about the internal details.

That is, getting back to a variant on one of Serhiy's suggestions in
the last PEP 467 discussion, it may make sense for us to offer a
"buffertools" library that's specifically aimed at supporting
efficient buffer manipulation operations that minimise data copying.
The pure Python implementations would work entirely through
memoryview, but we could also have selected C accelerated operations
if that showed a noticeable improvement on asyncio's benchmarks.

Regards,
Nick.

P.S. The length/offset API design is also problematic due to the way
it differs from range() & slice(), but I don't think it makes sense to
get into that kind of detail before discussing the larger question of
adding a new helper module for working efficiently with memory buffers
vs further widening the method API for the builtin bytes type

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia