struct.unpack should support open files

The struct unpack API is inconvenient to use with files. I must do: struct.unpack(fmt, file.read(struct.calcsize(fmt)) every time I want to read a struct from the file. I ended up having to create a utility function for this due to how frequently I was using struct.unpack with files: def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) else: return struct.unpack(fmt, frm) This seems like something that should be built into the default implementation -- struct.unpack already has all the information it needs with just the struct format and open binary file. Current behavior is an error since struct.unpack only supports bytes-like objects, so this should be backwards compatible except in the case where a developer is relying on that to error in a try block instead of verifying the buffer type beforehand.

Handling files overcomplicates both implementation and mental space for API saving. Files can be opened in text mode, what to do in this case? What exception should be raised? How to handle OS errors? On Mon, Dec 24, 2018 at 1:11 PM Drew Warwick <dwarwick96@gmail.com> wrote:
-- Thanks, Andrew Svetlov

On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote:
Handling files overcomplicates both implementation and mental space for API saving.
Perhaps. Although the implementation doesn't seem that complicated, and the mental space for the API not that much more difficult: unpack from bytes, or read from a file; versus unpack from bytes, which you might read from a file Seems about the same to me, except that with the proposal you don't have to calculate the size of the struct before reading. I haven't thought about this very deeply, but at first glance, I like Drew's idea of being able to just pass an open file to unpack and have it read from the file.
Files can be opened in text mode, what to do in this case? What exception should be raised?
That is easy to answer: the same exception you get if you pass text to unpack() when it is expecting bytes: py> struct.unpack(fmt, "a") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: a bytes-like object is required, not 'str' There should be no difference whether the text comes from a literal, a variable, or is read from a file.
How to handle OS errors?
unpack() shouldn't try to handle them. If an OS error occurs, raise an exception, exactly the same way file.read() would raise an exception. -- Steve

On 12/24/18 7:33 AM, Steven D'Aprano wrote:
On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote:
Handling files overcomplicates both implementation and mental space for API saving.
The json module has load for files, and loads for bytes and strings, That said, JSON is usually read and decoded all at once, but I can see lots of use cases for ingesting "unpackable" data in little chunks. Similarly (but not really), print takes an optional destination that overrides the default destination of stdout. Ironically, StringIO adapts strings so that they can be used in places that expect open files. What about something like gzip.GzipFile (call it struct.StructFile?), which is basically a specialized file-like class that packs data on writes and unpacks data on reads? Dan

Here's a snippet of semi-production code we use: def read_and_unpack(handle, fmt): size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) which was originally something like: def read_and_unpack(handle, fmt, offset=None): if offset is not None: handle.seek(*offset) size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) until we pulled file seeking up out of the function. Having struct.unpack and struct.unpack_from support files would seem straightforward and be a nice quality of life change, imo. On Mon, Dec 24, 2018 at 9:36 AM Dan Sommers < 2QdxY4RzWzUUiLuE@potatochowder.com> wrote:

On 24Dec2018 10:19, James Edwards <jheiv@jheiv.com> wrote:
These days I go the other way. I make it easy to get bytes from what I'm working with and _expect_ to parse from a stream of bytes. I have a pair of modules cs.buffer (for getting bytes from things) and cs.binary (for parsing structures from binary data). (See PyPI.) cs.buffer primarily offers a CornuCopyBuffer which manages access to any iterable of bytes objects. It has a suite of factories to make these from binary files, bytes, bytes[], a mmap, etc. Once you've got one of these you have access to a suite of convenient methods. Particularly for grabbing structs, these's a .take() method which obtains a precise number of bytes. (Think that looks like a file read? Yes, and it offers a basic file-like suite of methods too.) Anyway, cs.binary is based of a PacketField base class oriented around pulling a binary structure from a CornuCopyBuffer. Obviously, structs are very common, and cs.binary has a factory: def structtuple(class_name, struct_format, subvalue_names): which gets you a PacketField subclass whose parse methods read a struct and return it to you in a nice namedtuple. Also, PacketFields self transcribe: you can construct one from its values and have it write out the binary form. Once you've got these the tendency is just to make a PacketField instances from that function for the structs you need and then to just grab things from a CornuCopyBuffer providing the data. And you no longer have to waste effort on different code for bytes or files. Example from cs.iso14496: PDInfo = structtuple('PDInfo', '>LL', 'rate initial_delay') Then you can just use PDInfo.from_buffer() or PDInfo.from_bytes() to parse out your structures from then on. I used to have tedious duplicated code for bytes and files in various placed; I'm ripping it out and replacing with this as I encounter it. Far more reliable, not to mention smaller and easier. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 24 Dec 2018 at 13:39, Steven D'Aprano <steve@pearwood.info> wrote:
One difference is that with a file, it's (as far as I can see) impossible to determine whether or not you're going to get bytes or text without reading some data (and so potentially affecting the state of the file object). This might be considered irrelevant (personally, I don't see a problem with a function definition that says "parameter fd must be an object that has a read(length) method that returns bytes" - that's basically what duck typing is all about) but it *is* a distinguishing feature of files over in-memory data. There is also the fact that read() is only defined to return *at most* the requested number of bytes. Non-blocking reads and objects like pipes that can return additional data over time add extra complexity. Again, not insoluble, and potentially simple enough to handle with "read N bytes, if you got something other than bytes or fewer than N of them, raise an error", but still enough that the special cases start to accumulate. The suggestion is a nice convenience method, and probably a useful addition for the majority of cases where it would do exactly what was needed, but still not completely trivial to actually implement and document (if I were doing it, I'd go with the naive approach, and just raise a ValueError when read(N) returns anything other than N bytes, for what it's worth). Paul

On Mon, Dec 24, 2018 at 03:36:07PM +0000, Paul Moore wrote:
Here are two ways: look at the type of the file object, or look at the mode of the file object: py> f = open('/tmp/spam.binary', 'wb') py> g = open('/tmp/spam.text', 'w') py> type(f), type(g) (<class '_io.BufferedWriter'>, <class '_io.TextIOWrapper'>) py> f.mode, g.mode ('wb', 'w')
This might be considered irrelevant
Indeed :-)
But it's not a distinguishing feature between the proposal, and writing: unpack(fmt, f.read(size)) which will also read from the file and affect the file state before failing. So its a difference that makes no difference.
How do they add extra complexity? According to the proposal, unpack() attempts the read. If it returns the correct number of bytes, the unpacking succeeds. If it doesn't, you get an exception, precisely the same way you would get an exception if you manually did the read and passed it to unpack(). Its the caller's responsibility to provide a valid file object. If your struct needs 10 bytes, and you provide a file that returns 6 bytes, you get an exception. There's no promise made that unpack() should repeat the read over and over again, hoping that its a pipe and more data becomes available. It either works with a single read, or it fails. Just like similar APIs as those provided by pickle, json etc which provide load() and loads() functions. In hindsight, the precedent set by pickle, json, etc suggests that we ought to have an unpack() function that reads from files and an unpacks() function that takes a string, but that ship has sailed.
I can understand the argument that the benefit of this is trivial over unpack(fmt, f.read(calcsize(fmt)) Unlike reading from a pickle or json record, its pretty easy to know how much to read, so there is an argument that this convenience method doesn't gain us much convenience. But I'm just not seeing where all the extra complexity and special case handing is supposed to be, except by having unpack make promises that the OP didn't request: - read partial structs from non-blocking files without failing - deal with file system errors without failing - support reading from text files when bytes are required without failing - if an exception occurs, the state of the file shouldn't change Those promises *would* add enormous amounts of complexity, but I don't think we need to make those promises. I don't think the OP wants them, I don't want them, and I don't think they are reasonable promises to make.
Indeed. Except that we should raise precisely the same exception type that struct.unpack() currently raises in the same circumstances: py> struct.unpack("ddd", b"a") Traceback (most recent call last): File "<stdin>", line 1, in <module> struct.error: unpack requires a bytes object of length 24 rather than ValueError. -- Steve

The proposal can generate cryptic messages like `a bytes-like object is required, not 'NoneType'` To produce more informative exception text all mentioned cases should be handled:
When a user calls unpack(fmt, f.read(calcsize(fmt)) the user is responsible for handling all edge cases (or ignore them most likely). If it is a part of a library -- robustness is the library responsibility. On Mon, Dec 24, 2018 at 11:23 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- Thanks, Andrew Svetlov

On Tue, Dec 25, 2018 at 01:28:02AM +0200, Andrew Svetlov wrote:
The proposal can generate cryptic messages like `a bytes-like object is required, not 'NoneType'`
How will it generate such a message? That's not obvious to me. The message doesn't seem cryptic to me. It seems perfectly clear: a bytes-like object is required, but you provided None instead. The only thing which is sub-optimal is the use of "NoneType" (the name of the class) instead of None.
To produce more informative exception text all mentioned cases should be handled:
Why should they? How are the standard exceptions not good enough? The standard library is full of implementations which use ducktyping, and if you pass a chicken instead of a duck you get errors like AttributeError: 'Chicken' object has no attribute 'bill' Why isn't that good enough for this function too? We already have a proof-of-concept implementation, given by the OP. Here is it again: import io, struct def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) else: return struct.unpack(fmt, frm) Here's the sort of exceptions it generates. For brevity, I have cut the tracebacks down to only the final line: py> unpackStruct("ddd", open("/tmp/spam", "w")) io.UnsupportedOperation: not readable Is that not clear enough? (This is not a rhetorical question.) In what way do you think that exception needs enhancing? It seems perfectly fine to me. Here's another exception that may be fine as given. If the given file doesn't contain enough bytes to fill the struct, you get this: py> __ = open("/tmp/spam", "wb").write(b"\x10") py> unpackStruct("ddd", open("/tmp/spam", "rb")) struct.error: unpack requires a bytes object of length 24 It might be *nice*, but hardly *necessary*, to re-word the error message to make it more obvious that we're reading from a file, but honestly that should be obvious from context. There are certainly worse error messages in Python. Here is one exception which should be reworded: py> unpackStruct("ddd", open("/tmp/spam", "r")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 3, in unpackStruct TypeError: a bytes-like object is required, not 'str' For production use, that should report that the file needs to be opened in binary mode, not text mode. Likewise similar type errors should report "bytes-like or file-like" object. These are minor enhancements to exception reporting, and aren't what I consider to be adding complexity in any meaningful sense. Of course we should expect that library-quality functions will have more error checking and better error reporting than a simple utility function for you own use. The OP's simple implementation is a five line function. Adding more appropriate error messages might, what? Triple it? That surely is an argument for *doing it right, once* in the standard library, rather than having people re-invent the wheel over and over. def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): if isinstance(frm, io.TextIOBase): raise TypeError('file must be opened in binary mode, not text') n = struct.calcsize(fmt) value = frm.read(n) assert isinstance(value, bytes) if len(value) != n: raise ValueError( 'expected %d bytes but only got %d' % (n, len(value)) ) return struct.unpack(fmt, value) else: return struct.unpack(fmt, frm) I think this is a useful enhancement to unpack(). If we were designing the struct module from scratch today, we'd surely want unpack() to read from files and unpacks() to read from a byte-string, mirroring the API of json, pickle, and similar. But given the requirement for backwards compatibility, we can't change the fact that unpack() works with byte-strings. So we can either add a new function, unpack_from_file() or simply make unpack() a generic function that accepts either a byte-like interface or a file-like interface. I vote for the generic function approach. (Or do nothing, of course.) So far, I'm not seeing any substantial arguments for why this isn't useful, or too difficult to implement. If anything, the biggest argument against it is that it is too simple to bother with (but that argument would apply equally to the pickle and json APIs). "Not every ~~one~~ fifteen line function needs to be in the standard library." -- Steve

On Wed, Dec 26, 2018 at 7:12 AM Steven D'Aprano <steve@pearwood.info> wrote:
The perfect demonstration of io objects complexity. `stream.read(N)` can return None by spec if the file is non-blocking and have no ready data. Confusing but still possible and documented behavior.
`.read(N)` can return less bytes by definition, that's true starting from very low-level read(2) syscall. Otherwise a (low) change of broken code with very non-obvious error message exists.
-- Thanks, Andrew Svetlov

On Wed, Dec 26, 2018 at 09:48:15AM +0200, Andrew Svetlov wrote:
https://docs.python.org/3/library/io.html#io.RawIOBase.read Regardless, my point doesn't change. That has nothing to do with the behaviour of unpack. If you pass a non-blocking file-like object which returns None, you get exactly the same exception as if you wrote unpack(fmt, f.read(size)) and the call to f.read returned None. Why is it unpack's responsibility to educate the caller that f.read can return None? Let's see what other functions with similar APIs do. py> class FakeFile: ... def read(self, n=-1): ... return None ... def readline(self): ... return None ... py> pickle.load(FakeFile()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: a bytes-like object is required, not 'NoneType' py> json.load(FakeFile()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/local/lib/python3.5/json/__init__.py", line 312, in loads s.__class__.__name__)) TypeError: the JSON object must be str, not 'NoneType' If it is good enough for pickle and json load() functions to report a TypeError like this, it is good enough for unpack(). Not every exception needs a custom error message.
You need to repeat reads until collecting the value of enough size.
That's not what the OP has asked for, it isn't what the OP's code does, and its not what I've suggested. Do pickle and json block and repeat the read until they have a complete object? I'm pretty sure they don't -- the source for json.load() that I have says: return loads(fp.read(), ... ) so it definitely doesn't repeat the read. I think it is so unlikely that pickle blocks waiting for extra input that I haven't even bothered to look. Looping and repeating the read is a clear case of YAGNI. Don't over-engineer the function, and then complain that the over- engineered function is too complex. There is no need for unpack() to handle streaming input which can output anything less than a complete struct per read.
`.read(N)` can return less bytes by definition,
Yes, we know that. And if it returns fewer bytes, then you get a nice, clear exception. -- Steve

On Wed, Dec 26, 2018 at 11:26 AM Steven D'Aprano <steve@pearwood.info> wrote:
Restriction fp to BufferedIOBase looks viable though, but it is not a file-like object. Also I'm thinking about type annotations in typeshed. Now the type is Union[array[int], bytes, bytearray, memoryview] Should it be Union[io.BinaryIO, array[int], bytes, bytearray, memoryview] ? What is behavior of unpack_from(fp, offset=120)? Should iter_unpack() read the whole buffer from file into a memory before emitting a first value?
-- Thanks, Andrew Svetlov

On Wed, Dec 26, 2018, 02:19 Andrew Svetlov <andrew.svetlov@gmail.com wrote:
Yeah, trying to support both buffers and file-like objects in the same function seems like a clearly bad idea. If we do this at all it should be by adding new convenience functions/methods that take file-like objects exclusively, like the ones several people posted on the thread. I don't really have an opinion on whether this is worth doing at all. I guess I can think of some arguments against: Packing/unpacking multiple structs to the same file-like object may be less efficient than using a single buffer + a single call to read/write. And it's unfortunate that the obvious pack_into/unpack_from names are already taken. And it's only 2 lines of code to write your own helpers. But none of these are particularly strong arguments either, and clearly some people would find them handy. -n

On Wed, Dec 26, 2018 at 12:18:23PM +0200, Andrew Svetlov wrote: [...]
This is complexity that isn't the unpack() function's responsibility to care about. All it wants is to call read(N) and get back N bytes. If it gets back anything else, that's an error.
Restriction fp to BufferedIOBase looks viable though, but it is not a file-like object.
There is no need to restrict it to BufferedIOBase. In hindsight, I am not even sure we should do an isinstance check at all. Surely all we care about is that the object has a read() method which takes a single argument, and returns that number of bytes? Here's another proof-of-concept implementation which doesn't require any isinstance checks on the argument. The only type checking it does is to verify that the read returns bytes, and even that is only a convenience so it can provide a friendly error message. def unpackStruct(fmt, frm): try: read = frm.read except AttributeError: return struct.unpack(fmt, frm) n = struct.calcsize(fmt) value = read(n) if not isinstance(value, bytes): raise TypeError('read method must return bytes') if len(value) != n: raise ValueError('expected %d bytes but only got %d' % (n, len(value))) return struct.unpack(fmt, value) [...]
What is behavior of unpack_from(fp, offset=120)?
I don't know. What does the "offset" parameter do, and who requested it? I didn't, and neither did the OP Drew Warwick. James Edwards wrote that he too uses a similar function in production, one which originally did support file seeking, but they took it out. If you are suggesting an offset parameter to the unpack() function, it is up to you to propose what meaning it will have and justify why it should be part of unpack's API. Until then, YAGNI.
Should iter_unpack() read the whole buffer from file into a memory before emitting a first value?
Nobody has requested any changes to iter_unpack(). -- Steve

On 26Dec2018 12:18, Andrew Svetlov <andrew.svetlov@gmail.com> wrote:
Oh, it is better than that. At the low level, even blocking streams can return short reads - particularly serial streams like ttys and TCP connections.
And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms. And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread). Its entire purpose is to wrap an iterable of bytes-like objects and do all that work via convenient methods. And which has factory methods to make these from files or other common things. Given a CornuCopyBuffer `bfr`: S = struct('spec-here...') sbuf = bfr.take(S.size) result = S.unpack(sbuf) Under the covers `bfr` take care of short "reads" (iteraion values) etc in the underlying iterable. The return from .take is typically a memoryview from `bfr`'s internal buffer - it is _always_ exactly `size` bytes long if you don't pass short_ok=True, or it raises an exception. And so on. The point here is: make a class to get what you actually need, and _don't_ stuff variable and hard to agree on extra semantics inside multiple basic utility classes like struct. For myself, the CornuCopyBuffer is now my universal interface to byte streams (binary files, TCP connections, whatever) which need binary parsing, and it has the methods and internal logic to provide that, including presenting a simple read only file-like interface with read and seek-forward, should I need to pass it to a file-expecting object. Do it _once_, and don't megacomplicatise all the existing utility classes. Cheers, Cameron Simpson <cs@cskk.id.au>

And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms.
And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread).
Seems like that should be in the standard library then! / Anders

On 27Dec2018 02:53, Anders Hovmöller <boxed@killingar.net> wrote:
It is insufficiently used at present. The idea seems sound - a flexible adapter of bytes sources providing easy methods to aid parsing - based on how useful it has been to me. But it has rough edges and one needs to convince others of its utility before entry into the stdlib. So it is on PyPI for easy use. If you're in the binary I/O/parsing space, pip install it (and cs.binary, which utilises it) and see how they work for you. Complain to me about poor semantics or bugs. And then we'll see how general purpose it really is. The PyPI package pages for each have doco derived from the module docstrings. Cheers, Cameron Simpson <cs@cskk.id.au>

On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote: [...]
I presume you mean json.load(), not read, except that it already reads from files. Nobody is talking about augmenting "a myriad of other arbitrary methods" except for you. We're talking about enhancing *one* function to be a simple generic function. I assume you have no objection to the existence of json.load() and json.loads() functions. (If you do think they're a bad idea, I don't know what to say.) Have they lead to "an open ended can of worms"? If we wrote a simple wrapper: def load(obj, *args, **kwargs): if isinstance(obj, str): return json.loads(obj, *args, **kwargs) else: return json.load(obj, *args, **kwargs) would that lead to "an open ended can of worms"? These aren't rhetoricial questions. I'd like to understand your objection. You have dismissed what seems to be a simple enhancement with a vague statement about hypothetical problems. Please explain in concrete terms what these figurative worms are. Let's come back to unpack. Would you object to having two separate functions that matched (apart from the difference in name) the API used by json, pickle, marshal etc? - unpack() reads from files - unpacks() reads from strings Obviously this breaks backwards compatibility, but if we were designing struct from scratch today, would this API open a can of worms? (Again, this is not a rhetorical question.) Let's save backwards compatibility: - unpack() reads from strings - unpackf() reads from files Does this open a can of worms? Or we could use a generic function. There is plenty of precedent for generic files in the stdlib. For example, zipfile accepts either a file name, or an open file object. def unpack(fmt, frm): if hasattr(frm, "read"): return _unpack_file(fmt, frm) else: return _unpack_bytes(fmt, frm) Does that generic function wrapper create "an open ended can of worms"? If so, in what way? I'm trying to understand where the problem lies, between the existing APIs used by json etc (presumably they are fine) and the objections to using what seems to be a very similar API for unpack, offerring the same functionality but differing only in spelling (a single generic function instead of two similarly-named functions).
That's exactly the proposed semantics for unpack, except there's no "short_ok" parameter. If the read is short, you get an exception.
And so on.
The point here is: make a class to get what you actually need
Do you know better than the OP (Drew Warwick) and James Edwards what they "actually need"? How would you react if I told you that your CornuCopyBuffer class, is an over-engineered, over-complicated, over-complex class that you don't need? You'd probably be pretty pissed off at my arrogance in telling you what you do or don't need for your own use-cases. (Especially since I don't know your use-cases.) Now consider that you are telling Drew and James that they don't know their own use-cases, despite the fact that they've been working successfully with this simple enhancement for years. I'm happy for you that CornuCopyBuffer solves real problems for you, and if you want to propose it for the stdlib I'd be really interested to learn more about it. But this is actually irrelevant to the current proposal. Even if we had a CornuCopyBuffer in the std lib, how does that help? We will still need to call struct.calcsize(format) by hand, still need to call read(size) by hand. Your CornuCopyBuffer does nothing to avoid that. The point of this proposal is to avoid that tedious make-work, not increase it by having to wrap our simple disk files in a CornuCopyBuffer before doing precisely the same make-work we didn't want to do in the first case. Drew has asked for a better hammer, and you're telling him he really wants a space shuttle. -- Steve

I'm quoting Steve's post here but am responding more broadly to the whole thread too. On Thu, Dec 27, 2018 at 1:00 PM Steven D'Aprano <steve@pearwood.info> wrote:
Personally, I'd actually be -0 on json.load if it didn't already exist. It's just a thin wrapper around json.loads() - it doesn't actually add anything. This proposal is _notably better_ in that it will (attempt to) read the correct number of bytes. The only real reason to have json.load/json.loads is to match pickle etc. (Though pickle does things the other way around, at least in the Python source code I have handy - loads is implemented using BytesIO, so it's the file-based API that's fundamental, as opposed to JSON where the string-based API is fundamental. I guess maybe that's a valid reason? To allow either one to be implemented in terms of the other?) But reading a struct *and then leaving the rest behind* is, IMO, a more valuable feature.
Not in my opinion, but I also don't think it gains you anything much. It isn't consistent with other stdlib modules, and it isn't very advantageous over the OP's idea of just having the same function able to cope with files as well as strings. The only advantage that I can see is that unpackf() might be made able to accept a pathlike, which it will open, read from, and close. (Since a pathlike could be a string, the single function would technically be ambiguous.) And I'd drop that idea in the YAGNI basket.
FTR, I am +0.9 on this kind of proposal - basically "just make it work" within the existing API. It's a small amount of additional complexity to support a quite reasonable use-case.
Drew has asked for a better hammer, and you're telling him he really wants a space shuttle.
But but.... a space shuttle is very effective at knocking nails into wood... also, I just want my own space shuttle. Plz? Thx. Bye! :) ChrisA

On 27Dec2018 12:59, Steven D'Aprano <steve@pearwood.info> wrote:
Likely. Though the json module is string oriented (though if one has UTF-8 data, turning binary into that is easy).
Yes, but that is how the rot sets in. Some here want to enhance json.load/loads. The OP wants to enhance struct.unpack. Yay. Now let's also do csv.reader. Etc. I think my point is twofold: once you start down this road you (a) start doing it to every parser in the stdlib and (b) we all start bikeshedding about semantics. There are at least two roads to such enhancement: make the functions polymorphic, coping with files or bytes/strs (depending), or make a parallel suite of functions like json.load/loads. The latter is basicly API bloat to little advantage. The former is rather slippery - I've a few functions myself with accept-str-or-file call modes, and _normally_ the "str" flavour is taken as a filename. But... if the function is a string parser, maybe it should parse the string itself? Already the choices are messy. And both approaches have much bikeshedding. Some of us would like something like struct.unpack to pull enough data from the file even if the file returns short reads. You, I gather, generally like the shim to be very shallow and have a short read cause an exception through insufficient data. Should the file version support an optional seek/offset argument? The example from James suggests that such a thing would benefit him. And so on. And this argument has to play out for _every_ parser interface you want to adapt for both files and direct bytes/str (again, depending).
On their own, no. The isolated example never starts that way. But really consistency argues that the entire stdlib should have file and str/bytes parallel functions across all parsers. And _that_ is a can of worms.
Less so. I've a decorator of my own called @strable, which wraps other functions; it intercepts the first positional argument if it is a str and replaces it with something derived from it. The default mode is an open file, with the str as the filename, but it is slightly pluggable. Such a decorator could reside in a utility stdlib module and become heavily used in places like json.load if desired.
I'm hoping my discussion above shows where I think the opn ended side of the issue arises: once we do it to one function we sort of want to do it to all similar functions, and there are multiple defensible ways to do it.
Well, yeah. (Presuming you mean bytes rather than strings above in the Python 3 domain.) API bloat. There are essentially identical functions in terms of utility.
Only in that it opens the door to doing the same for every other similar function in the stdlib. And wouldn't it be nice to have a third form to take a filename and open it?
Let's save backwards compatibility:
Some degree of objection: API bloat requiring repated bloat elsewhere. Let's set backwards compatibility aside: it halves the discussion and examples.
Indeed, and here we are with flavour #3: the string isn't a byte sequence to parse, it is now a filename. In Python 3 we can disambiuate if we parse bytes and treat str as a filename. But what if we're parsing str, as JSON does? Now we don't know and must make a policy decision.
If you were to rewrite the above in the form of my @strable decorator, provide it in a utility library, and _use_ it in unpack, I'd be +1, because the _same_ utility can be reused elsewhere by anyone for any API. Embedding it directly in unpack complicates unpack's semantics for what it essentially a shim. Here's my @strable, minus its docstring: @decorator def strable(func, open_func=None): if open_func is None: open_func = open def accepts_str(arg, *a, **kw): if isinstance(arg, str): with Pfx(arg): with open_func(arg) as opened: return func(opened, *a, **kw) return func(arg, *a, **kw) return accepts_str and an example library function: @strable def count_lines(f): count = 0 for line in f: count += 1 return count and there's a function taking an open file or a filename. But suppose we want to supply a string whose lines need counting, not a filename. We count _either_ change our policy decision from "accepts a filename" to "accepts an input string", _or_ we can start adding a third mode on top of the existing two modes. All three modes are reasonable.
I'm trying to understand where the problem lies, between the existing APIs used by json etc (presumably they are fine)
They're historic. I think I'm -0 on having 2 functions. But only because it is so easy to hand file contents to loads.
I hope I've made it more clear above that my objection is to either approach (polymorphic or parallel functions) because one can write a general purpose shim and use it with almost anything, and then we can make things like json or struct accept _only_ str or bytes respectively, with _no_ complication extra semantics. Because once we do it for these 2 we _should_ do it for every parser for consistency. Yes, yes, stripping json _back_ to just loads would break backwards compatibility; I'm not proposing that for real. I'm proposing resisting extra semantic bloat in favour of a help class or decorator. Consider: from shimutils import bytes_from_file from struct import unpack unpackf = bytes_from_file(unpack) Make a bunch of shims for the common use cases and the burden on users of the various _other_ modules becomes very small, and we don't have to go to every parser API and bloat it out. Especially since we've seen the bikeshedding on semantics even on this small suggestion ("accept a file").
And here we are. Bikeshedding already! My CCB.take (for short) raises an exception on _insufficient_ data, not a short read. It does enough reads to get the data demanded. If I _want_ to know that a read was short I can pass short_ok=True and examine the result before use. Its whole point is to give the right data to the caller. Let me give you some examples: I run som binary protocols over TCP streams. They're not network packets; the logical packets can span IP packets, and of course conversely several small protocol packets may fit in a single network packet because they're assembled in a buffer at the sending end (via plain old file.write). Via a CCB the receiver _doesn't care_. Ask for the required data, the CCB gathers enough and hands it over. I parse MP4 files. The ISO14496 packet structure has plenty of structures of almost arbitrary size, particularly the media data packet (MDAT) which can be gigabytes in size. You're _going_ to get a short read there. I'd be annoyed by an exception.
No, but I know what _I_ need. A flexible controller with several knobs to treat input in various common ways.
Some examples above. There's a _little_ over engineering, but it actually solves a _lot_ of problems, making everything else MUCH MUCH simpler.
I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics which we can't even all agree on into parser libraries it is often better to make it easy to give the parser what their _current_ API accepts. And that the tool to do that should be _outside_ those parser modules, not inside, because it can be generally applicable.
Not yet. Slightly rough and the user audience is basicly me right now. But feel free to pip install cs.buffer and cs.binary and have a look.
No, but its partner cs.binary _does_. As described in my first post to this thread. Have a quick reread, particularly near the "PDInfo" example.
To my eye he asked to make unpack into a multitool (bytes and files), and I'm suggesting maybe he should get a screwdriver to go with his hammer (to use as a chisel, of course). Anyway, I've making 2 arguments: - don't bloat the stdlib APIs to accomodate thing much beyond their core - offer a tool to make the things beyond the core _easily_ available for use in the core way The latter can then _also_ be used with other APIs not yet extended. Cheers, Cameron Simpson <cs@cskk.id.au>

On Wed, 26 Dec 2018 at 09:26, Steven D'Aprano <steve@pearwood.info> wrote:
Abstraction, basically - once the unpack function takes responsibility for doing the read, and hiding the fact that there's a read going on behind an API unpack(fmt, f), it *also* takes on responsibility for managing all of the administration of that read call. It's perfectly at liberty to do so by saying "we do a read() behind the scenes, so you get the same behaviour as if you did that read() yourself", but that's a pretty thin layer of abstraction (and people often expect something less transparent). As I say, you *can* define the behaviour as you say, but it shouldn't be surprising if people expect a bit more (even if, as you've said a few times, "no-one has asked for that"). Designing an API that meets people's (often unstated) expectations isn't always as easy as just writing a convenience function. Paul PS I remain neutral on whether the OP's proposal is worth adding, but the conversation has drifted more into abstract questions about what "needs" to be in this API, so take the above on that basis.

On Wed, Dec 26, 2018 at 01:32:38PM +0000, Paul Moore wrote:
As I keep pointing out, the json.load and pickle.load functions don't take on all that added administration. Neither does marshal, or zipfile, and I daresay there are others. Why does "abstraction" apply to this proposal but not the others? If you pass a file-like object to marshal.load that returns less than a full record, it simply raises an exception. There's no attempt to handle non-blocking streams and re-read until it has a full record: py> class MyFile: ... def read(self, n=-1): ... print("reading") ... return marshal.dumps([1, "a"])[:5] ... py> marshal.load(MyFile()) reading Traceback (most recent call last): File "<stdin>", line 1, in <module> EOFError: EOF read where object expected The use-case for marshall.load is to read a valid, complete marshall record from a file on disk. Likewise for json.load and pickle.load. There's no need to complicate the implementation by handling streams from ttys and other exotic file-like objects. Likewise there's zipfile, which also doesn't take on this extra responsibility. It doesn't try to support non-blocking streams which return None, for example. It assumes the input file is seekable, and doesn't raise a dedicated error for the case that it isn't. Nor does it support non-blocking streams by looping until it has read the data it expects. The use-case for unpack with a file object argument is the same. Why should we demand that it alone take on this unnecessary, unwanted, unused extra responsibility? It seems to me that only people insisting that unpack() take on this extra responsibility are those who are opposed to the proposal. We're asking for a battery, and they're insisting that we actually need a nuclear reactor, and rejecting the proposal because nuclear reactors are too complex. Here are some of the features that have been piled on to the proposal: - you need to deal with non-blocking streams that return None; - if you read an incomplete struct, you need to block and read in a loop until the struct is complete; - you need to deal with OS errors in some unspecified way, apart from just letting them bubble up to the caller. The response to all of these are: No we don't need to do these things, they are all out of scope for the proposal and other similar functions in the standard library don't do them. These are examples of over-engineering and YAGNI. *If* (a very big if!) somebody requests these features in the future, then they'll be considered as enhancement requests. The effort required versus the benefit will be weighed up, and if the benefit exceeds the costs, then the function may be enhanced to support streams which return partial records. The benefit will need to be more than just "abstraction". If there are objective, rational reasons for unpack() taking on these extra responsibilities, when other stdlib code doesn't, then I wish people would explain what those reasons are. Why does "abstraction" apply to struct.unpack() but not json.load()? I'm willing to be persuaded, I can change my mind. When Andrew suggested that unpack would need extra code to generate better error messages, I tested a few likely exceptions, and ended up agreeing that at least one and possibly two such enhancements were genuinely necessary. Those better error messages ended up in my subsequent proof-of-concept implementations, tripling the size from five lines to fifteen. (A second implementation reduced it to twelve.) But it irks me when people unnecessarily demand that new proposals are written to standards far beyond what the rest of the stdlib is written to. (I'm not talking about some of the venerable old, crufty parts of the stdlib dating back to Python 1.4, I'm talking about actively maintained, modern parts like json.) Especially when they seem unwilling or unable to explain *why* we need to apply such a high standard. What's so specially about unpack() that it has to handle these additional use-cases? If an objection to a proposal equally applies to parts of the stdlib that are in widepread use without actually being a problem in practice, then the objection is probably invalid. Remember the Zen: Now is better than never. Although never is often better than *right* now. Even if we do need to deal with rare, exotic or unusual input, we don't need to deal with them *right now*. When somebody submits an enhancement request "support non-blocking streams", we can deal with it then. Probably by rejecting it. -- Steve

On 12/24/18, Drew Warwick <dwarwick96@gmail.com> wrote:
The struct unpack API is inconvenient to use with files. I must do:
struct.unpack(fmt, file.read(struct.calcsize(fmt))
Alternatively, we can memory-map the file via mmap. An important difference is that the mmap buffer interface is low-level (e.g. no file pointer and the offset has to be page aligned), so we have to slice out bytes for the given offset and size. We can avoid copying via memoryview slices. We can also use ctypes instead of memoryview/struct.

On Tue, Dec 25, 2018 at 04:51:18PM -0600, eryk sun wrote:
Seems awfully complicated. How do we do all these things, and what advantage does it give?
We can also use ctypes instead of memoryview/struct.
Only if you want non-portable code. What advantage over struct is ctypes? -- Steve

On 12/25/18, Steven D'Aprano <steve@pearwood.info> wrote:
Refer to the mmap and memoryview docs. It is more complex, not significantly, but not something I'd suggest to a novice. Anyway, another disadvantage is that this requires a real OS file, not just a file-like interface. One possible advantage is that we can work naively and rely on the OS to move pages of the file to and from memory on demand. However, making this really convenient requires the ability to access memory directly with on-demand conversion, as is possible with ctypes (records & arrays) or numpy (arrays). Out of the box, multiprocessing works like this for shared-memory access. For example: import ctypes import multiprocessing class Record(ctypes.LittleEndianStructure): _pack_ = 1 _fields_ = (('a', ctypes.c_int), ('b', ctypes.c_char * 4)) a = multiprocessing.Array(Record, 2) a[0].a = 1 a[0].b = b'spam' a[1].a = 2 a[1].b = b'eggs' >>> a._obj <multiprocessing.sharedctypes.Record_Array_2 object at 0x7f96974c9f28> Shared values and arrays are accessed out of a heap that uses arenas backed by mmap instances: >>> a._obj._wrapper._state ((<multiprocessing.heap.Arena object at 0x7f96991faf28>, 0, 16), 16) >>> a._obj._wrapper._state[0][0].buffer <mmap.mmap object at 0x7f96974c4d68> The two records are stored in this shared memory: >>> a._obj._wrapper._state[0][0].buffer[:16] b'\x01\x00\x00\x00spam\x02\x00\x00\x00eggs'
ctypes has good support for at least Linux and Windows, but it's an optional package in CPython's standard library and not necessarily available with other implementations.
What advantage over struct is ctypes?
If it's available, I find that ctypes is often more convenient than the manual pack/unpack approach of struct. If we're writing to the file, ctypes lets us directly assign data to arrays and the fields of records on disk (the ctypes instance knows the address and its data descriptors handle converting values implicitly). The tradeoff is that defining structures in ctypes can be tedious (_pack_, _fields_) compared to the simple format strings of the struct module. With ctypes it helps to already be fluent in C.

Handling files overcomplicates both implementation and mental space for API saving. Files can be opened in text mode, what to do in this case? What exception should be raised? How to handle OS errors? On Mon, Dec 24, 2018 at 1:11 PM Drew Warwick <dwarwick96@gmail.com> wrote:
-- Thanks, Andrew Svetlov

On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote:
Handling files overcomplicates both implementation and mental space for API saving.
Perhaps. Although the implementation doesn't seem that complicated, and the mental space for the API not that much more difficult: unpack from bytes, or read from a file; versus unpack from bytes, which you might read from a file Seems about the same to me, except that with the proposal you don't have to calculate the size of the struct before reading. I haven't thought about this very deeply, but at first glance, I like Drew's idea of being able to just pass an open file to unpack and have it read from the file.
Files can be opened in text mode, what to do in this case? What exception should be raised?
That is easy to answer: the same exception you get if you pass text to unpack() when it is expecting bytes: py> struct.unpack(fmt, "a") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: a bytes-like object is required, not 'str' There should be no difference whether the text comes from a literal, a variable, or is read from a file.
How to handle OS errors?
unpack() shouldn't try to handle them. If an OS error occurs, raise an exception, exactly the same way file.read() would raise an exception. -- Steve

On 12/24/18 7:33 AM, Steven D'Aprano wrote:
On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote:
Handling files overcomplicates both implementation and mental space for API saving.
The json module has load for files, and loads for bytes and strings, That said, JSON is usually read and decoded all at once, but I can see lots of use cases for ingesting "unpackable" data in little chunks. Similarly (but not really), print takes an optional destination that overrides the default destination of stdout. Ironically, StringIO adapts strings so that they can be used in places that expect open files. What about something like gzip.GzipFile (call it struct.StructFile?), which is basically a specialized file-like class that packs data on writes and unpacks data on reads? Dan

Here's a snippet of semi-production code we use: def read_and_unpack(handle, fmt): size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) which was originally something like: def read_and_unpack(handle, fmt, offset=None): if offset is not None: handle.seek(*offset) size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) until we pulled file seeking up out of the function. Having struct.unpack and struct.unpack_from support files would seem straightforward and be a nice quality of life change, imo. On Mon, Dec 24, 2018 at 9:36 AM Dan Sommers < 2QdxY4RzWzUUiLuE@potatochowder.com> wrote:

On 24Dec2018 10:19, James Edwards <jheiv@jheiv.com> wrote:
These days I go the other way. I make it easy to get bytes from what I'm working with and _expect_ to parse from a stream of bytes. I have a pair of modules cs.buffer (for getting bytes from things) and cs.binary (for parsing structures from binary data). (See PyPI.) cs.buffer primarily offers a CornuCopyBuffer which manages access to any iterable of bytes objects. It has a suite of factories to make these from binary files, bytes, bytes[], a mmap, etc. Once you've got one of these you have access to a suite of convenient methods. Particularly for grabbing structs, these's a .take() method which obtains a precise number of bytes. (Think that looks like a file read? Yes, and it offers a basic file-like suite of methods too.) Anyway, cs.binary is based of a PacketField base class oriented around pulling a binary structure from a CornuCopyBuffer. Obviously, structs are very common, and cs.binary has a factory: def structtuple(class_name, struct_format, subvalue_names): which gets you a PacketField subclass whose parse methods read a struct and return it to you in a nice namedtuple. Also, PacketFields self transcribe: you can construct one from its values and have it write out the binary form. Once you've got these the tendency is just to make a PacketField instances from that function for the structs you need and then to just grab things from a CornuCopyBuffer providing the data. And you no longer have to waste effort on different code for bytes or files. Example from cs.iso14496: PDInfo = structtuple('PDInfo', '>LL', 'rate initial_delay') Then you can just use PDInfo.from_buffer() or PDInfo.from_bytes() to parse out your structures from then on. I used to have tedious duplicated code for bytes and files in various placed; I'm ripping it out and replacing with this as I encounter it. Far more reliable, not to mention smaller and easier. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 24 Dec 2018 at 13:39, Steven D'Aprano <steve@pearwood.info> wrote:
One difference is that with a file, it's (as far as I can see) impossible to determine whether or not you're going to get bytes or text without reading some data (and so potentially affecting the state of the file object). This might be considered irrelevant (personally, I don't see a problem with a function definition that says "parameter fd must be an object that has a read(length) method that returns bytes" - that's basically what duck typing is all about) but it *is* a distinguishing feature of files over in-memory data. There is also the fact that read() is only defined to return *at most* the requested number of bytes. Non-blocking reads and objects like pipes that can return additional data over time add extra complexity. Again, not insoluble, and potentially simple enough to handle with "read N bytes, if you got something other than bytes or fewer than N of them, raise an error", but still enough that the special cases start to accumulate. The suggestion is a nice convenience method, and probably a useful addition for the majority of cases where it would do exactly what was needed, but still not completely trivial to actually implement and document (if I were doing it, I'd go with the naive approach, and just raise a ValueError when read(N) returns anything other than N bytes, for what it's worth). Paul

On Mon, Dec 24, 2018 at 03:36:07PM +0000, Paul Moore wrote:
Here are two ways: look at the type of the file object, or look at the mode of the file object: py> f = open('/tmp/spam.binary', 'wb') py> g = open('/tmp/spam.text', 'w') py> type(f), type(g) (<class '_io.BufferedWriter'>, <class '_io.TextIOWrapper'>) py> f.mode, g.mode ('wb', 'w')
This might be considered irrelevant
Indeed :-)
But it's not a distinguishing feature between the proposal, and writing: unpack(fmt, f.read(size)) which will also read from the file and affect the file state before failing. So its a difference that makes no difference.
How do they add extra complexity? According to the proposal, unpack() attempts the read. If it returns the correct number of bytes, the unpacking succeeds. If it doesn't, you get an exception, precisely the same way you would get an exception if you manually did the read and passed it to unpack(). Its the caller's responsibility to provide a valid file object. If your struct needs 10 bytes, and you provide a file that returns 6 bytes, you get an exception. There's no promise made that unpack() should repeat the read over and over again, hoping that its a pipe and more data becomes available. It either works with a single read, or it fails. Just like similar APIs as those provided by pickle, json etc which provide load() and loads() functions. In hindsight, the precedent set by pickle, json, etc suggests that we ought to have an unpack() function that reads from files and an unpacks() function that takes a string, but that ship has sailed.
I can understand the argument that the benefit of this is trivial over unpack(fmt, f.read(calcsize(fmt)) Unlike reading from a pickle or json record, its pretty easy to know how much to read, so there is an argument that this convenience method doesn't gain us much convenience. But I'm just not seeing where all the extra complexity and special case handing is supposed to be, except by having unpack make promises that the OP didn't request: - read partial structs from non-blocking files without failing - deal with file system errors without failing - support reading from text files when bytes are required without failing - if an exception occurs, the state of the file shouldn't change Those promises *would* add enormous amounts of complexity, but I don't think we need to make those promises. I don't think the OP wants them, I don't want them, and I don't think they are reasonable promises to make.
Indeed. Except that we should raise precisely the same exception type that struct.unpack() currently raises in the same circumstances: py> struct.unpack("ddd", b"a") Traceback (most recent call last): File "<stdin>", line 1, in <module> struct.error: unpack requires a bytes object of length 24 rather than ValueError. -- Steve

The proposal can generate cryptic messages like `a bytes-like object is required, not 'NoneType'` To produce more informative exception text all mentioned cases should be handled:
When a user calls unpack(fmt, f.read(calcsize(fmt)) the user is responsible for handling all edge cases (or ignore them most likely). If it is a part of a library -- robustness is the library responsibility. On Mon, Dec 24, 2018 at 11:23 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- Thanks, Andrew Svetlov

On Tue, Dec 25, 2018 at 01:28:02AM +0200, Andrew Svetlov wrote:
The proposal can generate cryptic messages like `a bytes-like object is required, not 'NoneType'`
How will it generate such a message? That's not obvious to me. The message doesn't seem cryptic to me. It seems perfectly clear: a bytes-like object is required, but you provided None instead. The only thing which is sub-optimal is the use of "NoneType" (the name of the class) instead of None.
To produce more informative exception text all mentioned cases should be handled:
Why should they? How are the standard exceptions not good enough? The standard library is full of implementations which use ducktyping, and if you pass a chicken instead of a duck you get errors like AttributeError: 'Chicken' object has no attribute 'bill' Why isn't that good enough for this function too? We already have a proof-of-concept implementation, given by the OP. Here is it again: import io, struct def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) else: return struct.unpack(fmt, frm) Here's the sort of exceptions it generates. For brevity, I have cut the tracebacks down to only the final line: py> unpackStruct("ddd", open("/tmp/spam", "w")) io.UnsupportedOperation: not readable Is that not clear enough? (This is not a rhetorical question.) In what way do you think that exception needs enhancing? It seems perfectly fine to me. Here's another exception that may be fine as given. If the given file doesn't contain enough bytes to fill the struct, you get this: py> __ = open("/tmp/spam", "wb").write(b"\x10") py> unpackStruct("ddd", open("/tmp/spam", "rb")) struct.error: unpack requires a bytes object of length 24 It might be *nice*, but hardly *necessary*, to re-word the error message to make it more obvious that we're reading from a file, but honestly that should be obvious from context. There are certainly worse error messages in Python. Here is one exception which should be reworded: py> unpackStruct("ddd", open("/tmp/spam", "r")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 3, in unpackStruct TypeError: a bytes-like object is required, not 'str' For production use, that should report that the file needs to be opened in binary mode, not text mode. Likewise similar type errors should report "bytes-like or file-like" object. These are minor enhancements to exception reporting, and aren't what I consider to be adding complexity in any meaningful sense. Of course we should expect that library-quality functions will have more error checking and better error reporting than a simple utility function for you own use. The OP's simple implementation is a five line function. Adding more appropriate error messages might, what? Triple it? That surely is an argument for *doing it right, once* in the standard library, rather than having people re-invent the wheel over and over. def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): if isinstance(frm, io.TextIOBase): raise TypeError('file must be opened in binary mode, not text') n = struct.calcsize(fmt) value = frm.read(n) assert isinstance(value, bytes) if len(value) != n: raise ValueError( 'expected %d bytes but only got %d' % (n, len(value)) ) return struct.unpack(fmt, value) else: return struct.unpack(fmt, frm) I think this is a useful enhancement to unpack(). If we were designing the struct module from scratch today, we'd surely want unpack() to read from files and unpacks() to read from a byte-string, mirroring the API of json, pickle, and similar. But given the requirement for backwards compatibility, we can't change the fact that unpack() works with byte-strings. So we can either add a new function, unpack_from_file() or simply make unpack() a generic function that accepts either a byte-like interface or a file-like interface. I vote for the generic function approach. (Or do nothing, of course.) So far, I'm not seeing any substantial arguments for why this isn't useful, or too difficult to implement. If anything, the biggest argument against it is that it is too simple to bother with (but that argument would apply equally to the pickle and json APIs). "Not every ~~one~~ fifteen line function needs to be in the standard library." -- Steve

On Wed, Dec 26, 2018 at 7:12 AM Steven D'Aprano <steve@pearwood.info> wrote:
The perfect demonstration of io objects complexity. `stream.read(N)` can return None by spec if the file is non-blocking and have no ready data. Confusing but still possible and documented behavior.
`.read(N)` can return less bytes by definition, that's true starting from very low-level read(2) syscall. Otherwise a (low) change of broken code with very non-obvious error message exists.
-- Thanks, Andrew Svetlov

On Wed, Dec 26, 2018 at 09:48:15AM +0200, Andrew Svetlov wrote:
https://docs.python.org/3/library/io.html#io.RawIOBase.read Regardless, my point doesn't change. That has nothing to do with the behaviour of unpack. If you pass a non-blocking file-like object which returns None, you get exactly the same exception as if you wrote unpack(fmt, f.read(size)) and the call to f.read returned None. Why is it unpack's responsibility to educate the caller that f.read can return None? Let's see what other functions with similar APIs do. py> class FakeFile: ... def read(self, n=-1): ... return None ... def readline(self): ... return None ... py> pickle.load(FakeFile()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: a bytes-like object is required, not 'NoneType' py> json.load(FakeFile()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/local/lib/python3.5/json/__init__.py", line 312, in loads s.__class__.__name__)) TypeError: the JSON object must be str, not 'NoneType' If it is good enough for pickle and json load() functions to report a TypeError like this, it is good enough for unpack(). Not every exception needs a custom error message.
You need to repeat reads until collecting the value of enough size.
That's not what the OP has asked for, it isn't what the OP's code does, and its not what I've suggested. Do pickle and json block and repeat the read until they have a complete object? I'm pretty sure they don't -- the source for json.load() that I have says: return loads(fp.read(), ... ) so it definitely doesn't repeat the read. I think it is so unlikely that pickle blocks waiting for extra input that I haven't even bothered to look. Looping and repeating the read is a clear case of YAGNI. Don't over-engineer the function, and then complain that the over- engineered function is too complex. There is no need for unpack() to handle streaming input which can output anything less than a complete struct per read.
`.read(N)` can return less bytes by definition,
Yes, we know that. And if it returns fewer bytes, then you get a nice, clear exception. -- Steve

On Wed, Dec 26, 2018 at 11:26 AM Steven D'Aprano <steve@pearwood.info> wrote:
Restriction fp to BufferedIOBase looks viable though, but it is not a file-like object. Also I'm thinking about type annotations in typeshed. Now the type is Union[array[int], bytes, bytearray, memoryview] Should it be Union[io.BinaryIO, array[int], bytes, bytearray, memoryview] ? What is behavior of unpack_from(fp, offset=120)? Should iter_unpack() read the whole buffer from file into a memory before emitting a first value?
-- Thanks, Andrew Svetlov

On Wed, Dec 26, 2018, 02:19 Andrew Svetlov <andrew.svetlov@gmail.com wrote:
Yeah, trying to support both buffers and file-like objects in the same function seems like a clearly bad idea. If we do this at all it should be by adding new convenience functions/methods that take file-like objects exclusively, like the ones several people posted on the thread. I don't really have an opinion on whether this is worth doing at all. I guess I can think of some arguments against: Packing/unpacking multiple structs to the same file-like object may be less efficient than using a single buffer + a single call to read/write. And it's unfortunate that the obvious pack_into/unpack_from names are already taken. And it's only 2 lines of code to write your own helpers. But none of these are particularly strong arguments either, and clearly some people would find them handy. -n

On Wed, Dec 26, 2018 at 12:18:23PM +0200, Andrew Svetlov wrote: [...]
This is complexity that isn't the unpack() function's responsibility to care about. All it wants is to call read(N) and get back N bytes. If it gets back anything else, that's an error.
Restriction fp to BufferedIOBase looks viable though, but it is not a file-like object.
There is no need to restrict it to BufferedIOBase. In hindsight, I am not even sure we should do an isinstance check at all. Surely all we care about is that the object has a read() method which takes a single argument, and returns that number of bytes? Here's another proof-of-concept implementation which doesn't require any isinstance checks on the argument. The only type checking it does is to verify that the read returns bytes, and even that is only a convenience so it can provide a friendly error message. def unpackStruct(fmt, frm): try: read = frm.read except AttributeError: return struct.unpack(fmt, frm) n = struct.calcsize(fmt) value = read(n) if not isinstance(value, bytes): raise TypeError('read method must return bytes') if len(value) != n: raise ValueError('expected %d bytes but only got %d' % (n, len(value))) return struct.unpack(fmt, value) [...]
What is behavior of unpack_from(fp, offset=120)?
I don't know. What does the "offset" parameter do, and who requested it? I didn't, and neither did the OP Drew Warwick. James Edwards wrote that he too uses a similar function in production, one which originally did support file seeking, but they took it out. If you are suggesting an offset parameter to the unpack() function, it is up to you to propose what meaning it will have and justify why it should be part of unpack's API. Until then, YAGNI.
Should iter_unpack() read the whole buffer from file into a memory before emitting a first value?
Nobody has requested any changes to iter_unpack(). -- Steve

On 26Dec2018 12:18, Andrew Svetlov <andrew.svetlov@gmail.com> wrote:
Oh, it is better than that. At the low level, even blocking streams can return short reads - particularly serial streams like ttys and TCP connections.
And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms. And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread). Its entire purpose is to wrap an iterable of bytes-like objects and do all that work via convenient methods. And which has factory methods to make these from files or other common things. Given a CornuCopyBuffer `bfr`: S = struct('spec-here...') sbuf = bfr.take(S.size) result = S.unpack(sbuf) Under the covers `bfr` take care of short "reads" (iteraion values) etc in the underlying iterable. The return from .take is typically a memoryview from `bfr`'s internal buffer - it is _always_ exactly `size` bytes long if you don't pass short_ok=True, or it raises an exception. And so on. The point here is: make a class to get what you actually need, and _don't_ stuff variable and hard to agree on extra semantics inside multiple basic utility classes like struct. For myself, the CornuCopyBuffer is now my universal interface to byte streams (binary files, TCP connections, whatever) which need binary parsing, and it has the methods and internal logic to provide that, including presenting a simple read only file-like interface with read and seek-forward, should I need to pass it to a file-expecting object. Do it _once_, and don't megacomplicatise all the existing utility classes. Cheers, Cameron Simpson <cs@cskk.id.au>

And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms.
And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread).
Seems like that should be in the standard library then! / Anders

On 27Dec2018 02:53, Anders Hovmöller <boxed@killingar.net> wrote:
It is insufficiently used at present. The idea seems sound - a flexible adapter of bytes sources providing easy methods to aid parsing - based on how useful it has been to me. But it has rough edges and one needs to convince others of its utility before entry into the stdlib. So it is on PyPI for easy use. If you're in the binary I/O/parsing space, pip install it (and cs.binary, which utilises it) and see how they work for you. Complain to me about poor semantics or bugs. And then we'll see how general purpose it really is. The PyPI package pages for each have doco derived from the module docstrings. Cheers, Cameron Simpson <cs@cskk.id.au>

On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote: [...]
I presume you mean json.load(), not read, except that it already reads from files. Nobody is talking about augmenting "a myriad of other arbitrary methods" except for you. We're talking about enhancing *one* function to be a simple generic function. I assume you have no objection to the existence of json.load() and json.loads() functions. (If you do think they're a bad idea, I don't know what to say.) Have they lead to "an open ended can of worms"? If we wrote a simple wrapper: def load(obj, *args, **kwargs): if isinstance(obj, str): return json.loads(obj, *args, **kwargs) else: return json.load(obj, *args, **kwargs) would that lead to "an open ended can of worms"? These aren't rhetoricial questions. I'd like to understand your objection. You have dismissed what seems to be a simple enhancement with a vague statement about hypothetical problems. Please explain in concrete terms what these figurative worms are. Let's come back to unpack. Would you object to having two separate functions that matched (apart from the difference in name) the API used by json, pickle, marshal etc? - unpack() reads from files - unpacks() reads from strings Obviously this breaks backwards compatibility, but if we were designing struct from scratch today, would this API open a can of worms? (Again, this is not a rhetorical question.) Let's save backwards compatibility: - unpack() reads from strings - unpackf() reads from files Does this open a can of worms? Or we could use a generic function. There is plenty of precedent for generic files in the stdlib. For example, zipfile accepts either a file name, or an open file object. def unpack(fmt, frm): if hasattr(frm, "read"): return _unpack_file(fmt, frm) else: return _unpack_bytes(fmt, frm) Does that generic function wrapper create "an open ended can of worms"? If so, in what way? I'm trying to understand where the problem lies, between the existing APIs used by json etc (presumably they are fine) and the objections to using what seems to be a very similar API for unpack, offerring the same functionality but differing only in spelling (a single generic function instead of two similarly-named functions).
That's exactly the proposed semantics for unpack, except there's no "short_ok" parameter. If the read is short, you get an exception.
And so on.
The point here is: make a class to get what you actually need
Do you know better than the OP (Drew Warwick) and James Edwards what they "actually need"? How would you react if I told you that your CornuCopyBuffer class, is an over-engineered, over-complicated, over-complex class that you don't need? You'd probably be pretty pissed off at my arrogance in telling you what you do or don't need for your own use-cases. (Especially since I don't know your use-cases.) Now consider that you are telling Drew and James that they don't know their own use-cases, despite the fact that they've been working successfully with this simple enhancement for years. I'm happy for you that CornuCopyBuffer solves real problems for you, and if you want to propose it for the stdlib I'd be really interested to learn more about it. But this is actually irrelevant to the current proposal. Even if we had a CornuCopyBuffer in the std lib, how does that help? We will still need to call struct.calcsize(format) by hand, still need to call read(size) by hand. Your CornuCopyBuffer does nothing to avoid that. The point of this proposal is to avoid that tedious make-work, not increase it by having to wrap our simple disk files in a CornuCopyBuffer before doing precisely the same make-work we didn't want to do in the first case. Drew has asked for a better hammer, and you're telling him he really wants a space shuttle. -- Steve

I'm quoting Steve's post here but am responding more broadly to the whole thread too. On Thu, Dec 27, 2018 at 1:00 PM Steven D'Aprano <steve@pearwood.info> wrote:
Personally, I'd actually be -0 on json.load if it didn't already exist. It's just a thin wrapper around json.loads() - it doesn't actually add anything. This proposal is _notably better_ in that it will (attempt to) read the correct number of bytes. The only real reason to have json.load/json.loads is to match pickle etc. (Though pickle does things the other way around, at least in the Python source code I have handy - loads is implemented using BytesIO, so it's the file-based API that's fundamental, as opposed to JSON where the string-based API is fundamental. I guess maybe that's a valid reason? To allow either one to be implemented in terms of the other?) But reading a struct *and then leaving the rest behind* is, IMO, a more valuable feature.
Not in my opinion, but I also don't think it gains you anything much. It isn't consistent with other stdlib modules, and it isn't very advantageous over the OP's idea of just having the same function able to cope with files as well as strings. The only advantage that I can see is that unpackf() might be made able to accept a pathlike, which it will open, read from, and close. (Since a pathlike could be a string, the single function would technically be ambiguous.) And I'd drop that idea in the YAGNI basket.
FTR, I am +0.9 on this kind of proposal - basically "just make it work" within the existing API. It's a small amount of additional complexity to support a quite reasonable use-case.
Drew has asked for a better hammer, and you're telling him he really wants a space shuttle.
But but.... a space shuttle is very effective at knocking nails into wood... also, I just want my own space shuttle. Plz? Thx. Bye! :) ChrisA

On 27Dec2018 12:59, Steven D'Aprano <steve@pearwood.info> wrote:
Likely. Though the json module is string oriented (though if one has UTF-8 data, turning binary into that is easy).
Yes, but that is how the rot sets in. Some here want to enhance json.load/loads. The OP wants to enhance struct.unpack. Yay. Now let's also do csv.reader. Etc. I think my point is twofold: once you start down this road you (a) start doing it to every parser in the stdlib and (b) we all start bikeshedding about semantics. There are at least two roads to such enhancement: make the functions polymorphic, coping with files or bytes/strs (depending), or make a parallel suite of functions like json.load/loads. The latter is basicly API bloat to little advantage. The former is rather slippery - I've a few functions myself with accept-str-or-file call modes, and _normally_ the "str" flavour is taken as a filename. But... if the function is a string parser, maybe it should parse the string itself? Already the choices are messy. And both approaches have much bikeshedding. Some of us would like something like struct.unpack to pull enough data from the file even if the file returns short reads. You, I gather, generally like the shim to be very shallow and have a short read cause an exception through insufficient data. Should the file version support an optional seek/offset argument? The example from James suggests that such a thing would benefit him. And so on. And this argument has to play out for _every_ parser interface you want to adapt for both files and direct bytes/str (again, depending).
On their own, no. The isolated example never starts that way. But really consistency argues that the entire stdlib should have file and str/bytes parallel functions across all parsers. And _that_ is a can of worms.
Less so. I've a decorator of my own called @strable, which wraps other functions; it intercepts the first positional argument if it is a str and replaces it with something derived from it. The default mode is an open file, with the str as the filename, but it is slightly pluggable. Such a decorator could reside in a utility stdlib module and become heavily used in places like json.load if desired.
I'm hoping my discussion above shows where I think the opn ended side of the issue arises: once we do it to one function we sort of want to do it to all similar functions, and there are multiple defensible ways to do it.
Well, yeah. (Presuming you mean bytes rather than strings above in the Python 3 domain.) API bloat. There are essentially identical functions in terms of utility.
Only in that it opens the door to doing the same for every other similar function in the stdlib. And wouldn't it be nice to have a third form to take a filename and open it?
Let's save backwards compatibility:
Some degree of objection: API bloat requiring repated bloat elsewhere. Let's set backwards compatibility aside: it halves the discussion and examples.
Indeed, and here we are with flavour #3: the string isn't a byte sequence to parse, it is now a filename. In Python 3 we can disambiuate if we parse bytes and treat str as a filename. But what if we're parsing str, as JSON does? Now we don't know and must make a policy decision.
If you were to rewrite the above in the form of my @strable decorator, provide it in a utility library, and _use_ it in unpack, I'd be +1, because the _same_ utility can be reused elsewhere by anyone for any API. Embedding it directly in unpack complicates unpack's semantics for what it essentially a shim. Here's my @strable, minus its docstring: @decorator def strable(func, open_func=None): if open_func is None: open_func = open def accepts_str(arg, *a, **kw): if isinstance(arg, str): with Pfx(arg): with open_func(arg) as opened: return func(opened, *a, **kw) return func(arg, *a, **kw) return accepts_str and an example library function: @strable def count_lines(f): count = 0 for line in f: count += 1 return count and there's a function taking an open file or a filename. But suppose we want to supply a string whose lines need counting, not a filename. We count _either_ change our policy decision from "accepts a filename" to "accepts an input string", _or_ we can start adding a third mode on top of the existing two modes. All three modes are reasonable.
I'm trying to understand where the problem lies, between the existing APIs used by json etc (presumably they are fine)
They're historic. I think I'm -0 on having 2 functions. But only because it is so easy to hand file contents to loads.
I hope I've made it more clear above that my objection is to either approach (polymorphic or parallel functions) because one can write a general purpose shim and use it with almost anything, and then we can make things like json or struct accept _only_ str or bytes respectively, with _no_ complication extra semantics. Because once we do it for these 2 we _should_ do it for every parser for consistency. Yes, yes, stripping json _back_ to just loads would break backwards compatibility; I'm not proposing that for real. I'm proposing resisting extra semantic bloat in favour of a help class or decorator. Consider: from shimutils import bytes_from_file from struct import unpack unpackf = bytes_from_file(unpack) Make a bunch of shims for the common use cases and the burden on users of the various _other_ modules becomes very small, and we don't have to go to every parser API and bloat it out. Especially since we've seen the bikeshedding on semantics even on this small suggestion ("accept a file").
And here we are. Bikeshedding already! My CCB.take (for short) raises an exception on _insufficient_ data, not a short read. It does enough reads to get the data demanded. If I _want_ to know that a read was short I can pass short_ok=True and examine the result before use. Its whole point is to give the right data to the caller. Let me give you some examples: I run som binary protocols over TCP streams. They're not network packets; the logical packets can span IP packets, and of course conversely several small protocol packets may fit in a single network packet because they're assembled in a buffer at the sending end (via plain old file.write). Via a CCB the receiver _doesn't care_. Ask for the required data, the CCB gathers enough and hands it over. I parse MP4 files. The ISO14496 packet structure has plenty of structures of almost arbitrary size, particularly the media data packet (MDAT) which can be gigabytes in size. You're _going_ to get a short read there. I'd be annoyed by an exception.
No, but I know what _I_ need. A flexible controller with several knobs to treat input in various common ways.
Some examples above. There's a _little_ over engineering, but it actually solves a _lot_ of problems, making everything else MUCH MUCH simpler.
I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics which we can't even all agree on into parser libraries it is often better to make it easy to give the parser what their _current_ API accepts. And that the tool to do that should be _outside_ those parser modules, not inside, because it can be generally applicable.
Not yet. Slightly rough and the user audience is basicly me right now. But feel free to pip install cs.buffer and cs.binary and have a look.
No, but its partner cs.binary _does_. As described in my first post to this thread. Have a quick reread, particularly near the "PDInfo" example.
To my eye he asked to make unpack into a multitool (bytes and files), and I'm suggesting maybe he should get a screwdriver to go with his hammer (to use as a chisel, of course). Anyway, I've making 2 arguments: - don't bloat the stdlib APIs to accomodate thing much beyond their core - offer a tool to make the things beyond the core _easily_ available for use in the core way The latter can then _also_ be used with other APIs not yet extended. Cheers, Cameron Simpson <cs@cskk.id.au>

On Wed, 26 Dec 2018 at 09:26, Steven D'Aprano <steve@pearwood.info> wrote:
Abstraction, basically - once the unpack function takes responsibility for doing the read, and hiding the fact that there's a read going on behind an API unpack(fmt, f), it *also* takes on responsibility for managing all of the administration of that read call. It's perfectly at liberty to do so by saying "we do a read() behind the scenes, so you get the same behaviour as if you did that read() yourself", but that's a pretty thin layer of abstraction (and people often expect something less transparent). As I say, you *can* define the behaviour as you say, but it shouldn't be surprising if people expect a bit more (even if, as you've said a few times, "no-one has asked for that"). Designing an API that meets people's (often unstated) expectations isn't always as easy as just writing a convenience function. Paul PS I remain neutral on whether the OP's proposal is worth adding, but the conversation has drifted more into abstract questions about what "needs" to be in this API, so take the above on that basis.

On Wed, Dec 26, 2018 at 01:32:38PM +0000, Paul Moore wrote:
As I keep pointing out, the json.load and pickle.load functions don't take on all that added administration. Neither does marshal, or zipfile, and I daresay there are others. Why does "abstraction" apply to this proposal but not the others? If you pass a file-like object to marshal.load that returns less than a full record, it simply raises an exception. There's no attempt to handle non-blocking streams and re-read until it has a full record: py> class MyFile: ... def read(self, n=-1): ... print("reading") ... return marshal.dumps([1, "a"])[:5] ... py> marshal.load(MyFile()) reading Traceback (most recent call last): File "<stdin>", line 1, in <module> EOFError: EOF read where object expected The use-case for marshall.load is to read a valid, complete marshall record from a file on disk. Likewise for json.load and pickle.load. There's no need to complicate the implementation by handling streams from ttys and other exotic file-like objects. Likewise there's zipfile, which also doesn't take on this extra responsibility. It doesn't try to support non-blocking streams which return None, for example. It assumes the input file is seekable, and doesn't raise a dedicated error for the case that it isn't. Nor does it support non-blocking streams by looping until it has read the data it expects. The use-case for unpack with a file object argument is the same. Why should we demand that it alone take on this unnecessary, unwanted, unused extra responsibility? It seems to me that only people insisting that unpack() take on this extra responsibility are those who are opposed to the proposal. We're asking for a battery, and they're insisting that we actually need a nuclear reactor, and rejecting the proposal because nuclear reactors are too complex. Here are some of the features that have been piled on to the proposal: - you need to deal with non-blocking streams that return None; - if you read an incomplete struct, you need to block and read in a loop until the struct is complete; - you need to deal with OS errors in some unspecified way, apart from just letting them bubble up to the caller. The response to all of these are: No we don't need to do these things, they are all out of scope for the proposal and other similar functions in the standard library don't do them. These are examples of over-engineering and YAGNI. *If* (a very big if!) somebody requests these features in the future, then they'll be considered as enhancement requests. The effort required versus the benefit will be weighed up, and if the benefit exceeds the costs, then the function may be enhanced to support streams which return partial records. The benefit will need to be more than just "abstraction". If there are objective, rational reasons for unpack() taking on these extra responsibilities, when other stdlib code doesn't, then I wish people would explain what those reasons are. Why does "abstraction" apply to struct.unpack() but not json.load()? I'm willing to be persuaded, I can change my mind. When Andrew suggested that unpack would need extra code to generate better error messages, I tested a few likely exceptions, and ended up agreeing that at least one and possibly two such enhancements were genuinely necessary. Those better error messages ended up in my subsequent proof-of-concept implementations, tripling the size from five lines to fifteen. (A second implementation reduced it to twelve.) But it irks me when people unnecessarily demand that new proposals are written to standards far beyond what the rest of the stdlib is written to. (I'm not talking about some of the venerable old, crufty parts of the stdlib dating back to Python 1.4, I'm talking about actively maintained, modern parts like json.) Especially when they seem unwilling or unable to explain *why* we need to apply such a high standard. What's so specially about unpack() that it has to handle these additional use-cases? If an objection to a proposal equally applies to parts of the stdlib that are in widepread use without actually being a problem in practice, then the objection is probably invalid. Remember the Zen: Now is better than never. Although never is often better than *right* now. Even if we do need to deal with rare, exotic or unusual input, we don't need to deal with them *right now*. When somebody submits an enhancement request "support non-blocking streams", we can deal with it then. Probably by rejecting it. -- Steve

On 12/24/18, Drew Warwick <dwarwick96@gmail.com> wrote:
The struct unpack API is inconvenient to use with files. I must do:
struct.unpack(fmt, file.read(struct.calcsize(fmt))
Alternatively, we can memory-map the file via mmap. An important difference is that the mmap buffer interface is low-level (e.g. no file pointer and the offset has to be page aligned), so we have to slice out bytes for the given offset and size. We can avoid copying via memoryview slices. We can also use ctypes instead of memoryview/struct.

On Tue, Dec 25, 2018 at 04:51:18PM -0600, eryk sun wrote:
Seems awfully complicated. How do we do all these things, and what advantage does it give?
We can also use ctypes instead of memoryview/struct.
Only if you want non-portable code. What advantage over struct is ctypes? -- Steve

On 12/25/18, Steven D'Aprano <steve@pearwood.info> wrote:
Refer to the mmap and memoryview docs. It is more complex, not significantly, but not something I'd suggest to a novice. Anyway, another disadvantage is that this requires a real OS file, not just a file-like interface. One possible advantage is that we can work naively and rely on the OS to move pages of the file to and from memory on demand. However, making this really convenient requires the ability to access memory directly with on-demand conversion, as is possible with ctypes (records & arrays) or numpy (arrays). Out of the box, multiprocessing works like this for shared-memory access. For example: import ctypes import multiprocessing class Record(ctypes.LittleEndianStructure): _pack_ = 1 _fields_ = (('a', ctypes.c_int), ('b', ctypes.c_char * 4)) a = multiprocessing.Array(Record, 2) a[0].a = 1 a[0].b = b'spam' a[1].a = 2 a[1].b = b'eggs' >>> a._obj <multiprocessing.sharedctypes.Record_Array_2 object at 0x7f96974c9f28> Shared values and arrays are accessed out of a heap that uses arenas backed by mmap instances: >>> a._obj._wrapper._state ((<multiprocessing.heap.Arena object at 0x7f96991faf28>, 0, 16), 16) >>> a._obj._wrapper._state[0][0].buffer <mmap.mmap object at 0x7f96974c4d68> The two records are stored in this shared memory: >>> a._obj._wrapper._state[0][0].buffer[:16] b'\x01\x00\x00\x00spam\x02\x00\x00\x00eggs'
ctypes has good support for at least Linux and Windows, but it's an optional package in CPython's standard library and not necessarily available with other implementations.
What advantage over struct is ctypes?
If it's available, I find that ctypes is often more convenient than the manual pack/unpack approach of struct. If we're writing to the file, ctypes lets us directly assign data to arrays and the fields of records on disk (the ctypes instance knows the address and its data descriptors handle converting values implicitly). The tradeoff is that defining structures in ctypes can be tedious (_pack_, _fields_) compared to the simple format strings of the struct module. With ctypes it helps to already be fluent in C.
participants (11)
-
Anders Hovmöller
-
Andrew Svetlov
-
Cameron Simpson
-
Chris Angelico
-
Dan Sommers
-
Drew Warwick
-
eryk sun
-
James Edwards
-
Nathaniel Smith
-
Paul Moore
-
Steven D'Aprano