Use __bytes__ to access buffer-protocol from "user-land"
I am working on a toolbox for computer-archaeology where old data media are "excavated" and presented on a web-page. (https://github.com/Datamuseum-DK/AutoArchaeologist for anybody who cares). Since these data-media can easily sum tens of gigabytes, mmap and virtual memory is my weapons of choice and that has brought me into an obscure corner of python where few people seem to venture: I want to access the buffer-protocol from "userland". The fundamental problem is that if I have a image of a disk and it has 2 partitions, I end up with the "mmap.mmap" object that mapped the raw disk image, and two "bytes" or "bytearray" objects, each containing one partition, for a total memory footprint of twice the size of the disk. As the tool dives into the filesystems in the partitions and creates objects for the individual files in the filesystem, that grows to three times the size of the disk etc. To avoid this, I am writing a "bytes-like" scatter-gather class (not yet committed), and that is fine as far as it goes. If I want to write one of my scatter-gather objects to disk, I have to: fd.write(bytes(myobj)) As a preliminary point, I think that is just wrong: A class with a __bytes__ method should satisfy any needs the buffer-protocol might have, so this should work: fd.write(myobj) But taking this a little bit further, I think __bytes__ should be allowed to be an iterator, provided the object also offers __len__, so that this would work: class bar(): def __len__(self): return 3 def __bytes__(self): yield b'0' yield b'1' yield b'2' open("/tmp/_", "wb").write(foo()) This example is of course trivial, but hav the yield statements hand out hundreds of megabytes, and the savings in time and malloc-space becomes very tangible. Poul-Henning
I don't think __bytes__ is necessarily a bad idea, but I want to point out a couple of things you may be unaware of. First, slicing a memoryview object creates a subview, so you can wrap your mmap object in a memoryview and then create slices for each partition, cluster run, etc. without wasting any memory (but not for fragmented files). Second, your iterator example works in Python as it stands if you substitute __iter__ for __bytes__ and writelines for write.
15.11.20 21:19, Ben Rudiak-Gould пише:
I don't think __bytes__ is necessarily a bad idea, but I want to point out a couple of things you may be unaware of. First, slicing a memoryview object creates a subview, so you can wrap your mmap object in a memoryview and then create slices for each partition, cluster run, etc. without wasting any memory (but not for fragmented files). Second, your iterator example works in Python as it stands if you substitute __iter__ for __bytes__ and writelines for write.
I was going to write the same.
participants (3)
-
Ben Rudiak-Gould
-
phk@freebsd.dk
-
Serhiy Storchaka