[Python-ideas] struct.unpack should support open files

Thu Dec 27 01:14:35 EST 2018

On 27Dec2018 12:59, Steven D'Aprano <steve at pearwood.info> wrote:
>On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote:
>[...]
>> >Also I'm thinking about type annotations in typeshed.
>> >Now the type is Union[array[int], bytes, bytearray, memoryview]
>> >Should it be Union[io.BinaryIO, array[int], bytes, bytearray,
>> >memoryview] ?
>>
>> And this is why I, personally, think augumenting struct.unpack and
>> json.read and a myriad of other arbitrary methods to accept both
>> file-like things and bytes is an open ended can of worms.
>
>I presume you mean json.load(), not read, except that it already reads
>from files.

Likely. Though the json module is string oriented (though if one has 
UTF-8 data, turning binary into that is easy).

>Nobody is talking about augmenting "a myriad of other arbitrary methods"
>except for you. We're talking about enhancing *one* function to be a
>simple generic function.

Yes, but that is how the rot sets in.

Some here want to enhance json.load/loads. The OP wants to enhance 
struct.unpack. Yay. Now let's also do csv.reader. Etc.

I think my point is twofold: once you start down this road you (a) start 
doing it to every parser in the stdlib and (b) we all start bikeshedding 
about semantics.

There are at least two roads to such enhancement: make the functions 
polymorphic, coping with files or bytes/strs (depending), or make a 
parallel suite of functions like json.load/loads.

The latter is basicly API bloat to little advantage. The former is 
rather slippery - I've a few functions myself with accept-str-or-file 
call modes, and _normally_ the "str" flavour is taken as a filename.  
But... if the function is a string parser, maybe it should parse the 
string itself? Already the choices are messy.

And both approaches have much bikeshedding. Some of us would like 
something like struct.unpack to pull enough data from the file even if 
the file returns short reads. You, I gather, generally like the shim to 
be very shallow and have a short read cause an exception through 
insufficient data. Should the file version support an optional 
seek/offset argument? The example from James suggests that such a thing 
would benefit him. And so on.

And this argument has to play out for _every_ parser interface you want 
to adapt for both files and direct bytes/str (again, depending).

>I assume you have no objection to the existence of json.load() and
>json.loads() functions. (If you do think they're a bad idea, I don't
>know what to say.) Have they lead to "an open ended can of worms"?

On their own, no. The isolated example never starts that way. But really 
consistency argues that the entire stdlib should have file and str/bytes 
parallel functions across all parsers.  And _that_ is a can of worms.

>If we wrote a simple wrapper:
>
>def load(obj, *args, **kwargs):
>    if isinstance(obj, str):
>        return json.loads(obj, *args, **kwargs)
>    else:
>        return json.load(obj, *args, **kwargs)
>
>would that lead to "an open ended can of worms"?

Less so. I've a decorator of my own called @strable, which wraps other 
functions; it intercepts the first positional argument if it is a str 
and replaces it with something derived from it. The default mode is an 
open file, with the str as the filename, but it is slightly pluggable.

Such a decorator could reside in a utility stdlib module and become 
heavily used in places like json.load if desired.

>These aren't rhetoricial questions. I'd like to understand your
>objection. You have dismissed what seems to be a simple enhancement with
>a vague statement about hypothetical problems. Please explain in
>concrete terms what these figurative worms are.

I'm hoping my discussion above shows where I think the opn ended side of 
the issue arises: once we do it to one function we sort of want to do it 
to all similar functions, and there are multiple defensible ways to do 
it.

>Let's come back to unpack. Would you object to having two separate
>functions that matched (apart from the difference in name) the API used
>by json, pickle, marshal etc?
>
>- unpack() reads from files
>- unpacks() reads from strings

Well, yeah. (Presuming you mean bytes rather than strings above in the 
Python 3 domain.) API bloat. There are essentially identical functions 
in terms of utility.

>Obviously this breaks backwards compatibility, but if we were designing
>struct from scratch today, would this API open a can of worms?
>(Again, this is not a rhetorical question.)

Only in that it opens the door to doing the same for every other similar 
function in the stdlib. And wouldn't it be nice to have a third form to 
take a filename and open it?

>Let's save backwards compatibility:

Some degree of objection: API bloat requiring repated bloat elsewhere.  
Let's set backwards compatibility aside: it halves the discussion and 
examples.

>Or we could use a generic function. There is plenty of precedent for
>generic files in the stdlib. For example, zipfile accepts either
>a file name, or an open file object.

Indeed, and here we are with flavour #3: the string isn't a byte 
sequence to parse, it is now a filename. In Python 3 we can disambiuate 
if we parse bytes and treat str as a filename. But what if we're parsing 
str, as JSON does? Now we don't know and must make a policy decision.

>def unpack(fmt, frm):
>    if hasattr(frm, "read"):
>         return _unpack_file(fmt, frm)
>    else:
>         return _unpack_bytes(fmt, frm)
>
>Does that generic function wrapper create "an open ended can of worms"?
>If so, in what way?

If you were to rewrite the above in the form of my @strable decorator, 
provide it in a utility library, and _use_ it in unpack, I'd be +1, 
because the _same_ utility can be reused elsewhere by anyone for any 
API. Embedding it directly in unpack complicates unpack's semantics for 
what it essentially a shim.

Here's my @strable, minus its docstring:

    @decorator
    def strable(func, open_func=None):
      if open_func is None:
        open_func = open
      def accepts_str(arg, *a, **kw):
        if isinstance(arg, str):
          with Pfx(arg):
            with open_func(arg) as opened:
              return func(opened, *a, **kw)
        return func(arg, *a, **kw)
      return accepts_str

and an example library function:

    @strable
    def count_lines(f):
        count = 0
        for line in f:
            count += 1
        return count

and there's a function taking an open file or a filename. But suppose we 
want to supply a string whose lines need counting, not a filename. We 
count _either_ change our policy decision from "accepts a filename" to 
"accepts an input string", _or_ we can start adding a third mode on top 
of the existing two modes. All three modes are reasonable.

>I'm trying to understand where the problem lies, between the existing
>APIs used by json etc (presumably they are fine)

They're historic. I think I'm -0 on having 2 functions. But only because 
it is so easy to hand file contents to loads.

>and the objections to
>using what seems to be a very similar API for unpack, offerring the same
>functionality but differing only in spelling (a single generic function
>instead of two similarly-named functions).

I hope I've made it more clear above that my objection is to either 
approach (polymorphic or parallel functions) because one can write a 
general purpose shim and use it with almost anything, and then we can 
make things like json or struct accept _only_ str or bytes respectively, 
with _no_ complication extra semantics. Because once we do it for these 
2 we _should_ do it for every parser for consistency.

Yes, yes, stripping json _back_ to just loads would break backwards 
compatibility; I'm not proposing that for real. I'm proposing resisting 
extra semantic bloat in favour of a help class or decorator. Consider:

    from shimutils import bytes_from_file
    from struct import unpack
    unpackf = bytes_from_file(unpack)

Make a bunch of shims for the common use cases and the burden on users 
of the various _other_ modules becomes very small, and we don't have to 
go to every parser API and bloat it out. Especially since we've seen the 
bikeshedding on semantics even on this small suggestion ("accept a 
file").

>> And it is why I wrote myself my CornuCopyBuffer class (see my other 
>> post in this thread).
>[...]
>> The return from .take is typically a
>> memoryview from `bfr`'s internal buffer - it is _always_ exactly `size`
>> bytes long if you don't pass short_ok=True, or it raises an exception.
>
>That's exactly the proposed semantics for unpack, except there's no
>"short_ok" parameter. If the read is short, you get an exception.

And here we are. Bikeshedding already!

My CCB.take (for short) raises an exception on _insufficient_ data, not 
a short read. It does enough reads to get the data demanded. If I _want_ 
to know that a read was short I can pass short_ok=True and examine the 
result before use. Its whole point is to give the right data to the 
caller.

Let me give you some examples:

I run som binary protocols over TCP streams. They're not network 
packets; the logical packets can span IP packets, and of course 
conversely several small protocol packets may fit in a single network 
packet because they're assembled in a buffer at the sending end (via 
plain old file.write). Via a CCB the receiver _doesn't care_. Ask for 
the required data, the CCB gathers enough and hands it over.

I parse MP4 files. The ISO14496 packet structure has plenty of 
structures of almost arbitrary size, particularly the media data packet 
(MDAT) which can be gigabytes in size. You're _going_ to get a short 
read there. I'd be annoyed by an exception.

>> And so on.
>>
>> The point here is: make a class to get what you actually need
>
>Do you know better than the OP (Drew Warwick) and James Edwards what
>they "actually need"?

No, but I know what _I_ need. A flexible controller with several knobs 
to treat input in various common ways.

>How would you react if I told you that your CornuCopyBuffer class, is an
>over-engineered, over-complicated, over-complex class that you don't
>need? You'd probably be pretty pissed off at my arrogance in telling you
>what you do or don't need for your own use-cases. (Especially since I
>don't know your use-cases.)

Some examples above. There's a _little_ over engineering, but it 
actually solves a _lot_ of problems, making everything else MUCH MUCH 
simpler.

>Now consider that you are telling Drew and James that they don't know
>their own use-cases, despite the fact that they've been working
>successfully with this simple enhancement for years.

I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics 
which we can't even all agree on into parser libraries it is often 
better to make it easy to give the parser what their _current_ API 
accepts. And that the tool to do that should be _outside_ those parser 
modules, not inside, because it can be generally applicable.

>I'm happy for you that CornuCopyBuffer solves real problems for you,
>and if you want to propose it for the stdlib I'd be really interested
>to learn more about it.

Not yet. Slightly rough and the user audience is basicly me right now.  
But feel free to pip install cs.buffer and cs.binary and have a look.

>But this is actually irrelevant to the current proposal. Even if we had
>a CornuCopyBuffer in the std lib, how does that help? We will still need
>to call struct.calcsize(format) by hand, still need to call read(size)
>by hand. Your CornuCopyBuffer does nothing to avoid that.

No, but its partner cs.binary _does_. As described in my first post to 
this thread. Have a quick reread, particularly near the "PDInfo" 
example.

>The point of this proposal is to avoid that tedious make-work, not
>increase it by having to wrap our simple disk files in a CornuCopyBuffer
>before doing precisely the same make-work we didn't want to do in the
>first case.
>
>Drew has asked for a better hammer, and you're telling him he really
>wants a space shuttle.

To my eye he asked to make unpack into a multitool (bytes and files), 
and I'm suggesting maybe he should get a screwdriver to go with his 
hammer (to use as a chisel, of course).

Anyway, I've making 2 arguments:

- don't bloat the stdlib APIs to accomodate thing much beyond their core

- offer a tool to make the things beyond the core _easily_ available for 
  use in the core way

The latter can then _also_ be used with other APIs not yet extended.

Cheers,
Cameron Simpson <cs at cskk.id.au>