Wes Turner writes:
Data interchange with structured types is worthwhile.
That's not what the main thread is about. It's about adding support for Decimal to the stdlib's json module. Even the OP has explicitly disclaimed pretty much everything else, although his preferred implementation is more general than that.
I'm +1 on that. I think the outline of how to do it has become pretty obvious, and that it should be restricted to automatically converting Decimals to a JSON number, perhaps under control of a use_decimal flag for both encoding and decoding.
The rest should go into a separate thread. First let's dispose of this:
Streaming JSON is not possible without JSON lines support.
It is obvious to me that this should be handled in yet another thread from "lossless JSON", because it can and should be independently implemented, if it's done at all. Given (ob,n) = raw_decode(idx=n) support in the json module, the difficulty in implementing is all about buffering, and choosing where to do that buffering (in a separate module? in json.load? in a new json.load_stream generator?)
I will now argue that the __json__ protocol is nowhere near so obviously stdlib-able as Decimal and streaming JSON.
An object.__json__(**kwargs) protocol would inconvenience no-one so long as:
- decimal isn't imported unless used
- all existing code continues to work
I also think that JSON is widely enough used, and deserves better semantic support, that a protocol (specifically, the __json__ dunder) for serializer support and some form of complementary deserializer support are quite justifiable. But the __json__ dunder is the *easy* part. The complexity here is in that complementary deserializer.
Here's why. To your desiderata I would add
- no complex type's module is imported unless used (easy)
- the deserializer support for a type should be linked to its serializer support (something like the codecs registry, but more complicated because each entity will need to invoke support separately, unlike codecs where there's one codec for a whole text)
- such object support should be automatically linked in to both the top level serializer and deserializer dispatching.
The latter two desiderata look *hard* to me. Without them, you've got the inverse of the current Decimal problem. This is going to require that somebody or somebodies spend many person-hours on design, implementation, and testing. Also
- the deserializer support may or may not want to be in json.loads()
because it may be preferable to deserialize to the primitive Python objects that correspond to the JSON types, and then allow the Python program to flexibly handle those. Eg, what to do about variable annotations? Should our deserializer automatically deal with those? What if a variable's value conflicts with its annotation? While there may be a clear answer to this question after somebody has thought about it for a bit, it's not obvious to me.
The fundamental problem with your overall argument is that the usefulness to the community at large is unclear:
It is unfortunate that we all just use JSON and throw away decimals and float precision and datetimes because json.dumps is so easy.
True for yourself, I assume. But json.dumps is *not* why *the rest of us* do that. We do it because we've *always* done it. The Python objects we are serializing themselves lack units, precision, and pet's name! Until our Python programs become unit- and precision- aware, support for "lossless JSON" is necessarily going to be idiosyncratic, and mostly avoided.
How many people know that:
- You can or should use decimal to avoid float precision error, but then
you have to annoyingly write a JSONEncoder to save that data, and then the type is lost when it's parsed and cast to a float when it's deserialized?
- JSON-LD is the only non-ad-hoc solution to preserving precision,
datetimes, and complex numbers and types with JSON
JSON5 supports IEEE 754 ±Infinity and NaN
Pickles do serialize arbitrary objects, but are not safe for data
publishing because unmarshalling runs executable code in the pickle (this is in the docs now)
Very few. But again, that's the wrong set of questions, for reasons similar to the above issue about "why we use json.dumps". The right questions are:
1. Of those who don't know, how many have need to know, and will acknowledge that eed? (If they don't admit it, good luck getting them to change their programs!)
2. Of those who have need to know, how many would have "enough" of their serialization problems solved by any particular packaged set of features that might be added to the stdlib?
3. Is the number of programs in 2 "large enough" to justify the additional maintenance burden and the risk that better but conflicting solutions will be created in the future?
JSON-LD is the way to go for complex types in JSON.
It's worth specifying a JSON serialization protocol as a PEP that third-party and stdlib JSON implementations would use.
All of JSON-LD is way overkill for the examples of complex types you've given. We *do not need or want* a complete reimplementation of the Semantic Web in JSON in our stdlib. So what exactly are you talking about? Here's my idea:
I suspect your "serialization protocol" above really means *deserialization* protocol. object.__json__ is all the serialization protocol we need, because it will produce a standard JSON stream that can be deserialized (perhaps with different semantics!) by any standard JSON deserializer. Also, we don't need a PEP to specify the protocol for providing a more accurate deserialization, JSON-LD already did that work, and the parts we need are pretty trivial (definitely @context, maybe @id). So I interpret your word "protocol" to mean "JSON-LD @context". Is that close?
For almost all Python applications, a JSON-LD @context specific to Python's object model and standard builtin types would be enough. Since each type is itself a Python object, JSON-LD should be able to represent user-defined classes and their instances within that @context too. For those programs that provide more semantic information about their classes, they'd need additional, idiosyncratic @context anyway, and I have no idea what a "standard extended @context" would want to include. Each large external package (NumPy, Twisted) would want to implement its own @context, I think.
We could imagine additional semantic information in this @context that would even tell you which modules you need to pip from PyPI to work with these data types, along with the developers' and auditors' signatures you can authenticate the module and apply your trust model to whether you want to import them.
Footnotes:  Is this new? I know that frequently software modules are signed by their maintainers, and people decide to extend trust to particular maintainers. But in open source, anybody can audit, so a list of auditors with signatures, dates, and a comment field for the audit might also be useful for maintainers who aren't famous when the auditors are famous.