[Python-ideas] Re: JSON encoder protocol

Aug. 17, 2019

      ...
True for yourself, I assume.  But json.dumps is *not* why *the rest of
us* do that.  We do it because we've *always* done it.  The Python
objects we are serializing themselves lack units, precision, and pet's
name!  Until our Python programs become unit- and precision- aware,
support for "lossless JSON" is necessarily going to be idiosyncratic,
and mostly avoided.
As a more "casual" user of JSON, this part in particular resonated with me.
For the majority of use cases, I can't imagine that most users have a need
for the degree of precision desired by the OP of the previous topic. As far
as I'm aware ``json.dumps()`` serves adequately as a JSON stream for most
users. For context, a recent JSON file that I created using the json
module: https://github.com/python/devguide/pull/517. It's quite simple and
has a very small number of data types.
...
1.  Of those who don't know, how many have need to know, and will
    acknowledge that eed?  (If they don't admit it, good luck getting
    them to change their programs!)
Prior to this discussion, I'll admit that I had no idea about decimals
converting to floats when JSONs were deserialized. But as stated in the
question above, I really never had a need to know. For pretty much any time
I've ever used JSON, floats provided a "good enough" degree of precision.

I'm not saying that just because it isn't useful for me personally that it
shouldn't be added, but I wouldn't be surprised if the majority of users of
the json module had no idea about this issue because it didn't affect them
significantly.
...
For almost all Python applications, a JSON-LD @context specific to
Python's object model and standard builtin types would be enough.
I'm personally not knowledgeable enough about JSON-LD, I've only seen it
mentioned before a few times. But, based on what I can tell from the
examples on https://json-ld.org/playground/ (the "Place" example was
particularly helpful), I could definitely imagine @context being useful.

Speaking of examples, I think it would be helpful to provide a brief
example of ``object.__json__()`` being used for some added clarity. Would
it be a direct alias of ``json.dumps()`` with the same exact parameters and
usage, or would there be some substantial differences?

Thanks,
Kyle Stanley

On Fri, Aug 16, 2019 at 4:00 AM Stephen J. Turnbull <
turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
...
Wes Turner writes:
...
Data interchange with structured types is worthwhile.
That's not what the main thread is about.  It's about adding support
for Decimal to the stdlib's json module.  Even the OP has explicitly
disclaimed pretty much everything else, although his preferred
implementation is more general than that.
I'm +1 on that.  I think the outline of how to do it has become pretty
obvious, and that it should be restricted to automatically converting
Decimals to a JSON number, perhaps under control of a use_decimal flag
for both encoding and decoding.
The rest should go into a separate thread.  First let's dispose of
this:
...
Streaming JSON is not possible without JSON lines support.
It is obvious to me that this should be handled in yet another thread
from "lossless JSON", because it can and should be independently
implemented, if it's done at all.  Given (ob,n) = raw_decode(idx=n)
support in the json module, the difficulty in implementing is all
about buffering, and choosing where to do that buffering (in a
separate module? in json.load? in a new json.load_stream generator?)
I will now argue that the __json__ protocol is nowhere near so
obviously stdlib-able as Decimal and streaming JSON.
...
An object.__json__(**kwargs) protocol would inconvenience no-one so
long as:
- decimal isn't imported unless used
- all existing code continues to work
I also think that JSON is widely enough used, and deserves better
semantic support, that a protocol (specifically, the __json__ dunder)
for serializer support and some form of complementary deserializer
support are quite justifiable.  But the __json__ dunder is the *easy*
part.  The complexity here is in that complementary deserializer.
Here's why.  To your desiderata I would add
- no complex type's module is imported unless used (easy)
- the deserializer support for a type should be linked to its
  serializer support (something like the codecs registry, but more
  complicated because each entity will need to invoke support
  separately, unlike codecs where there's one codec for a whole text)
- such object support should be automatically linked in to both the
  top level serializer and deserializer dispatching.
The latter two desiderata look *hard* to me.  Without them, you've got
the inverse of the current Decimal problem.  This is going to require
that somebody or somebodies spend many person-hours on design,
implementation, and testing.  Also
- the deserializer support may or may not want to be in json.loads()
because it may be preferable to deserialize to the primitive Python
objects that correspond to the JSON types, and then allow the Python
program to flexibly handle those.  Eg, what to do about variable
annotations?  Should our deserializer automatically deal with those?
What if a variable's value conflicts with its annotation?  While there
may be a clear answer to this question after somebody has thought
about it for a bit, it's not obvious to me.
The fundamental problem with your overall argument is that the
usefulness to the community at large is unclear:
...
It is unfortunate that we all just use JSON and throw away decimals
and float precision and datetimes because json.dumps is so easy.
True for yourself, I assume.  But json.dumps is *not* why *the rest of
us* do that.  We do it because we've *always* done it.  The Python
objects we are serializing themselves lack units, precision, and pet's
name!  Until our Python programs become unit- and precision- aware,
support for "lossless JSON" is necessarily going to be idiosyncratic,
and mostly avoided.
...
How many people know that:
- You can or should use decimal to avoid float precision error, but then
you have to annoyingly write a JSONEncoder to save that data, and then
the
type is lost when it's parsed and cast to a float when it's
deserialized?
- JSON-LD is the only non-ad-hoc solution to preserving precision,
datetimes, and complex numbers and types with JSON
- JSON5 supports IEEE 754 ±Infinity and NaN
- Pickles do serialize arbitrary objects, but are not safe for data
publishing because unmarshalling runs executable code in the pickle
(this
is in the docs now)
Very few.  But again, that's the wrong set of questions, for reasons
similar to the above issue about "why we use json.dumps".  The right
questions are:
1.  Of those who don't know, how many have need to know, and will
    acknowledge that eed?  (If they don't admit it, good luck getting
    them to change their programs!)
2.  Of those who have need to know, how many would have "enough" of
    their serialization problems solved by any particular packaged set
    of features that might be added to the stdlib?
3.  Is the number of programs in 2 "large enough" to justify the
    additional maintenance burden and the risk that better but
    conflicting solutions will be created in the future?
...
JSON-LD is the way to go for complex types in JSON.
...
It's worth specifying a JSON serialization protocol as a PEP that
third-party and stdlib JSON implementations would use.
All of JSON-LD is way overkill for the examples of complex types
you've given.  We *do not need or want* a complete reimplementation
of the Semantic Web in JSON in our stdlib.  So what exactly are you
talking about?  Here's my idea:
I suspect your "serialization protocol" above really means
*deserialization* protocol.  object.__json__ is all the serialization
protocol we need, because it will produce a standard JSON stream that
can be deserialized (perhaps with different semantics!) by any
standard JSON deserializer.  Also, we don't need a PEP to specify the
protocol for providing a more accurate deserialization, JSON-LD
already did that work, and the parts we need are pretty trivial
(definitely @context, maybe @id).  So I interpret your word "protocol"
to mean "JSON-LD @context".  Is that close?
For almost all Python applications, a JSON-LD @context specific to
Python's object model and standard builtin types would be enough.
Since each type is itself a Python object, JSON-LD should be able to
represent user-defined classes and their instances within that
@context too.  For those programs that provide more semantic
information about their classes, they'd need additional, idiosyncratic
@context anyway, and I have no idea what a "standard extended
@context" would want to include.  Each large external package (NumPy,
Twisted) would want to implement its own @context, I think.
We could imagine additional semantic information in this @context that
would even tell you which modules you need to pip from PyPI to work
with these data types, along with the developers' and auditors'[1]
signatures you can authenticate the module and apply your trust model
to whether you want to import them.
Steve
Footnotes:
[1]  Is this new?  I know that frequently software modules are signed
by their maintainers, and people decide to extend trust to particular
maintainers.  But in open source, anybody can audit, so a list of
auditors with signatures, dates, and a comment field for the audit
might also be useful for maintainers who aren't famous when the
auditors are famous.
Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-leave@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/XACTLM...
Code of Conduct: http://python.org/psf/codeofconduct/