[Python-ideas] struct.unpack should support open files
Cameron Simpson
cs at cskk.id.au
Thu Dec 27 01:14:35 EST 2018
On 27Dec2018 12:59, Steven D'Aprano <steve at pearwood.info> wrote:
>On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote:
>[...]
>> >Also I'm thinking about type annotations in typeshed.
>> >Now the type is Union[array[int], bytes, bytearray, memoryview]
>> >Should it be Union[io.BinaryIO, array[int], bytes, bytearray,
>> >memoryview] ?
>>
>> And this is why I, personally, think augumenting struct.unpack and
>> json.read and a myriad of other arbitrary methods to accept both
>> file-like things and bytes is an open ended can of worms.
>
>I presume you mean json.load(), not read, except that it already reads
>from files.
Likely. Though the json module is string oriented (though if one has
UTF-8 data, turning binary into that is easy).
>Nobody is talking about augmenting "a myriad of other arbitrary methods"
>except for you. We're talking about enhancing *one* function to be a
>simple generic function.
Yes, but that is how the rot sets in.
Some here want to enhance json.load/loads. The OP wants to enhance
struct.unpack. Yay. Now let's also do csv.reader. Etc.
I think my point is twofold: once you start down this road you (a) start
doing it to every parser in the stdlib and (b) we all start bikeshedding
about semantics.
There are at least two roads to such enhancement: make the functions
polymorphic, coping with files or bytes/strs (depending), or make a
parallel suite of functions like json.load/loads.
The latter is basicly API bloat to little advantage. The former is
rather slippery - I've a few functions myself with accept-str-or-file
call modes, and _normally_ the "str" flavour is taken as a filename.
But... if the function is a string parser, maybe it should parse the
string itself? Already the choices are messy.
And both approaches have much bikeshedding. Some of us would like
something like struct.unpack to pull enough data from the file even if
the file returns short reads. You, I gather, generally like the shim to
be very shallow and have a short read cause an exception through
insufficient data. Should the file version support an optional
seek/offset argument? The example from James suggests that such a thing
would benefit him. And so on.
And this argument has to play out for _every_ parser interface you want
to adapt for both files and direct bytes/str (again, depending).
>I assume you have no objection to the existence of json.load() and
>json.loads() functions. (If you do think they're a bad idea, I don't
>know what to say.) Have they lead to "an open ended can of worms"?
On their own, no. The isolated example never starts that way. But really
consistency argues that the entire stdlib should have file and str/bytes
parallel functions across all parsers. And _that_ is a can of worms.
>If we wrote a simple wrapper:
>
>def load(obj, *args, **kwargs):
> if isinstance(obj, str):
> return json.loads(obj, *args, **kwargs)
> else:
> return json.load(obj, *args, **kwargs)
>
>would that lead to "an open ended can of worms"?
Less so. I've a decorator of my own called @strable, which wraps other
functions; it intercepts the first positional argument if it is a str
and replaces it with something derived from it. The default mode is an
open file, with the str as the filename, but it is slightly pluggable.
Such a decorator could reside in a utility stdlib module and become
heavily used in places like json.load if desired.
>These aren't rhetoricial questions. I'd like to understand your
>objection. You have dismissed what seems to be a simple enhancement with
>a vague statement about hypothetical problems. Please explain in
>concrete terms what these figurative worms are.
I'm hoping my discussion above shows where I think the opn ended side of
the issue arises: once we do it to one function we sort of want to do it
to all similar functions, and there are multiple defensible ways to do
it.
>Let's come back to unpack. Would you object to having two separate
>functions that matched (apart from the difference in name) the API used
>by json, pickle, marshal etc?
>
>- unpack() reads from files
>- unpacks() reads from strings
Well, yeah. (Presuming you mean bytes rather than strings above in the
Python 3 domain.) API bloat. There are essentially identical functions
in terms of utility.
>Obviously this breaks backwards compatibility, but if we were designing
>struct from scratch today, would this API open a can of worms?
>(Again, this is not a rhetorical question.)
Only in that it opens the door to doing the same for every other similar
function in the stdlib. And wouldn't it be nice to have a third form to
take a filename and open it?
>Let's save backwards compatibility:
Some degree of objection: API bloat requiring repated bloat elsewhere.
Let's set backwards compatibility aside: it halves the discussion and
examples.
>Or we could use a generic function. There is plenty of precedent for
>generic files in the stdlib. For example, zipfile accepts either
>a file name, or an open file object.
Indeed, and here we are with flavour #3: the string isn't a byte
sequence to parse, it is now a filename. In Python 3 we can disambiuate
if we parse bytes and treat str as a filename. But what if we're parsing
str, as JSON does? Now we don't know and must make a policy decision.
>def unpack(fmt, frm):
> if hasattr(frm, "read"):
> return _unpack_file(fmt, frm)
> else:
> return _unpack_bytes(fmt, frm)
>
>Does that generic function wrapper create "an open ended can of worms"?
>If so, in what way?
If you were to rewrite the above in the form of my @strable decorator,
provide it in a utility library, and _use_ it in unpack, I'd be +1,
because the _same_ utility can be reused elsewhere by anyone for any
API. Embedding it directly in unpack complicates unpack's semantics for
what it essentially a shim.
Here's my @strable, minus its docstring:
@decorator
def strable(func, open_func=None):
if open_func is None:
open_func = open
def accepts_str(arg, *a, **kw):
if isinstance(arg, str):
with Pfx(arg):
with open_func(arg) as opened:
return func(opened, *a, **kw)
return func(arg, *a, **kw)
return accepts_str
and an example library function:
@strable
def count_lines(f):
count = 0
for line in f:
count += 1
return count
and there's a function taking an open file or a filename. But suppose we
want to supply a string whose lines need counting, not a filename. We
count _either_ change our policy decision from "accepts a filename" to
"accepts an input string", _or_ we can start adding a third mode on top
of the existing two modes. All three modes are reasonable.
>I'm trying to understand where the problem lies, between the existing
>APIs used by json etc (presumably they are fine)
They're historic. I think I'm -0 on having 2 functions. But only because
it is so easy to hand file contents to loads.
>and the objections to
>using what seems to be a very similar API for unpack, offerring the same
>functionality but differing only in spelling (a single generic function
>instead of two similarly-named functions).
I hope I've made it more clear above that my objection is to either
approach (polymorphic or parallel functions) because one can write a
general purpose shim and use it with almost anything, and then we can
make things like json or struct accept _only_ str or bytes respectively,
with _no_ complication extra semantics. Because once we do it for these
2 we _should_ do it for every parser for consistency.
Yes, yes, stripping json _back_ to just loads would break backwards
compatibility; I'm not proposing that for real. I'm proposing resisting
extra semantic bloat in favour of a help class or decorator. Consider:
from shimutils import bytes_from_file
from struct import unpack
unpackf = bytes_from_file(unpack)
Make a bunch of shims for the common use cases and the burden on users
of the various _other_ modules becomes very small, and we don't have to
go to every parser API and bloat it out. Especially since we've seen the
bikeshedding on semantics even on this small suggestion ("accept a
file").
>> And it is why I wrote myself my CornuCopyBuffer class (see my other
>> post in this thread).
>[...]
>> The return from .take is typically a
>> memoryview from `bfr`'s internal buffer - it is _always_ exactly `size`
>> bytes long if you don't pass short_ok=True, or it raises an exception.
>
>That's exactly the proposed semantics for unpack, except there's no
>"short_ok" parameter. If the read is short, you get an exception.
And here we are. Bikeshedding already!
My CCB.take (for short) raises an exception on _insufficient_ data, not
a short read. It does enough reads to get the data demanded. If I _want_
to know that a read was short I can pass short_ok=True and examine the
result before use. Its whole point is to give the right data to the
caller.
Let me give you some examples:
I run som binary protocols over TCP streams. They're not network
packets; the logical packets can span IP packets, and of course
conversely several small protocol packets may fit in a single network
packet because they're assembled in a buffer at the sending end (via
plain old file.write). Via a CCB the receiver _doesn't care_. Ask for
the required data, the CCB gathers enough and hands it over.
I parse MP4 files. The ISO14496 packet structure has plenty of
structures of almost arbitrary size, particularly the media data packet
(MDAT) which can be gigabytes in size. You're _going_ to get a short
read there. I'd be annoyed by an exception.
>> And so on.
>>
>> The point here is: make a class to get what you actually need
>
>Do you know better than the OP (Drew Warwick) and James Edwards what
>they "actually need"?
No, but I know what _I_ need. A flexible controller with several knobs
to treat input in various common ways.
>How would you react if I told you that your CornuCopyBuffer class, is an
>over-engineered, over-complicated, over-complex class that you don't
>need? You'd probably be pretty pissed off at my arrogance in telling you
>what you do or don't need for your own use-cases. (Especially since I
>don't know your use-cases.)
Some examples above. There's a _little_ over engineering, but it
actually solves a _lot_ of problems, making everything else MUCH MUCH
simpler.
>Now consider that you are telling Drew and James that they don't know
>their own use-cases, despite the fact that they've been working
>successfully with this simple enhancement for years.
I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics
which we can't even all agree on into parser libraries it is often
better to make it easy to give the parser what their _current_ API
accepts. And that the tool to do that should be _outside_ those parser
modules, not inside, because it can be generally applicable.
>I'm happy for you that CornuCopyBuffer solves real problems for you,
>and if you want to propose it for the stdlib I'd be really interested
>to learn more about it.
Not yet. Slightly rough and the user audience is basicly me right now.
But feel free to pip install cs.buffer and cs.binary and have a look.
>But this is actually irrelevant to the current proposal. Even if we had
>a CornuCopyBuffer in the std lib, how does that help? We will still need
>to call struct.calcsize(format) by hand, still need to call read(size)
>by hand. Your CornuCopyBuffer does nothing to avoid that.
No, but its partner cs.binary _does_. As described in my first post to
this thread. Have a quick reread, particularly near the "PDInfo"
example.
>The point of this proposal is to avoid that tedious make-work, not
>increase it by having to wrap our simple disk files in a CornuCopyBuffer
>before doing precisely the same make-work we didn't want to do in the
>first case.
>
>Drew has asked for a better hammer, and you're telling him he really
>wants a space shuttle.
To my eye he asked to make unpack into a multitool (bytes and files),
and I'm suggesting maybe he should get a screwdriver to go with his
hammer (to use as a chisel, of course).
Anyway, I've making 2 arguments:
- don't bloat the stdlib APIs to accomodate thing much beyond their core
- offer a tool to make the things beyond the core _easily_ available for
use in the core way
The latter can then _also_ be used with other APIs not yet extended.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-ideas
mailing list