[Python-ideas] struct.unpack should support open files

eryk sun eryksun at gmail.com
Tue Dec 25 22:52:44 EST 2018

On 12/25/18, Steven D'Aprano <steve at pearwood.info> wrote:
> On Tue, Dec 25, 2018 at 04:51:18PM -0600, eryk sun wrote:
>> Alternatively, we can memory-map the file via mmap. An important
>> difference is that the mmap buffer interface is low-level (e.g. no
>> file pointer and the offset has to be page aligned), so we have to
>> slice out bytes for the given offset and size. We can avoid copying
>> via memoryview slices.
> Seems awfully complicated. How do we do all these things, and what
> advantage does it give?

Refer to the mmap and memoryview docs. It is more complex, not
significantly, but not something I'd suggest to a novice. Anyway,
another disadvantage is that this requires a real OS file, not just a
file-like interface. One possible advantage is that we can work
naively and rely on the OS to move pages of the file to and from
memory on demand. However, making this really convenient requires the
ability to access memory directly with on-demand conversion, as is
possible with ctypes (records & arrays) or numpy (arrays).

Out of the box, multiprocessing works like this for shared-memory
access. For example:

    import ctypes
    import multiprocessing

    class Record(ctypes.LittleEndianStructure):
        _pack_ = 1
        _fields_ = (('a', ctypes.c_int),
                    ('b', ctypes.c_char * 4))

    a = multiprocessing.Array(Record, 2)
    a[0].a = 1
    a[0].b = b'spam'
    a[1].a = 2
    a[1].b = b'eggs'

    >>> a._obj
    <multiprocessing.sharedctypes.Record_Array_2 object at 0x7f96974c9f28>

Shared values and arrays are accessed out of a heap that uses arenas
backed by mmap instances:

    >>> a._obj._wrapper._state
    ((<multiprocessing.heap.Arena object at 0x7f96991faf28>, 0, 16), 16)
    >>> a._obj._wrapper._state[0][0].buffer
    <mmap.mmap object at 0x7f96974c4d68>

The two records are stored in this shared memory:

    >>> a._obj._wrapper._state[0][0].buffer[:16]

>> We can also use ctypes instead of
>> memoryview/struct.
> Only if you want non-portable code.

ctypes has good support for at least Linux and Windows, but it's an
optional package in CPython's standard library and not necessarily
available with other implementations.

> What advantage over struct is ctypes?

If it's available, I find that ctypes is often more convenient than
the manual pack/unpack approach of struct. If we're writing to the
file, ctypes lets us directly assign data to arrays and the fields of
records on disk (the ctypes instance knows the address and its data
descriptors handle converting values implicitly). The tradeoff is that
defining structures in ctypes can be tedious (_pack_, _fields_)
compared to the simple format strings of the struct module. With
ctypes it helps to already be fluent in C.

More information about the Python-ideas mailing list