adding support for a "raw output" in JSON serializer
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case: In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312 but I guess it really does not matter. What matters is that I did not find a way how to fix it with the standard `json` module. I have the JSON file generated by another program (C++ code, which uses nlohmann/json library), which serializes one of the floats to the value above. Then when reading this JSON file in my Python code, I can get either decimal.Decimal object (when specifying `parse_float=decimal.Decimal`) or float. If I use the latter the least significant digit is lost in deserialization. If I use Decimal, the value is preserved, but there seems to be no way to "serialize it back". Writing a custom serializer: class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This becomes quoted in the serialized output return super.default(o) seems to only allow returning "string" value, but then serializes it as a string! I.e. with the double quotes. What seems to be missing is an ability to return a "raw textual representation" of the serialized object which will not get mangled further by the `json` module. I noticed that `simplejson` provides an explicit option for its standard serializing function, called `use_decimal`, which basically solves my problem., but I would just like to use the standard module, I guess. So the question is, if something like `use_decimal` has been considered for the standard module, and if yes, why it was not implemented, or the other option could be to support "raw output" in the serializer, e.g. something like: class DecimalEncoder(json.JSONEncoder): def raw(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This is a raw representation of the object return super.raw(o) Where the returning values will be directly passed to the output stream without adding any additional characters. Then I could write my own Decimal serializer with few lines of code above. If anyone would want to know, why the last digit matters (or why I cannot double quote the floats), it is because the file has a secure hash attached and this basically breaks it.
On 8 Aug 2019, at 12:22, Richard Musil <risa2000x@gmail.com> wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
It really doesn’t, both values have the same binary representation. See the Python FAQ at <https://docs.python.org/3/faq/design.html#why-are-floating-point-calculation... <https://docs.python.org/3/faq/design.html#why-are-floating-point-calculations-so-inaccurate>> or the floating point section of the tutorial at <https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues <https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues>> for more information. Ronald — Twitter: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On Thu, Aug 8, 2019 at 11:34 PM Ronald Oussoren via Python-ideas <python-ideas@python.org> wrote:
On 8 Aug 2019, at 12:22, Richard Musil <risa2000x@gmail.com> wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
It really doesn’t, both values have the same binary representation. See the Python FAQ at <https://docs.python.org/3/faq/design.html#why-are-floating-point-calculations-so-inaccurate> or the floating point section of the tutorial at <https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues> for more information.
That depends on your definition of "matter". The JSON specification doesn't actually stipulate IEEE 64-bit floating point; it just defines the grammar. It'd be completely valid to use a JSON number to carry data from one Python script into another, where both ends use decimal.Decimal to store it. But if the value is going to be parsed into a float at the other end, then yeah, they're equivalent. ChrisA
On 08/08/2019 11:22, Richard Musil wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
Well, yes it does. It's pretty much the whole of the matter. Unless you get lucky with values, floats are inherently imprecise. The number you are playing with is apparently 0.64417266845703125 as an IEEE float representation. It happens that Python rounds down ("rounds to even" is the statement in the documentation) when it displays this. Essentially you are not getting the precision you think you are getting. -- Rhodri James *-* Kynesim Ltd
On Thu, Aug 8, 2019 at 11:50 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 08/08/2019 11:22, Richard Musil wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
Well, yes it does. It's pretty much the whole of the matter. Unless you get lucky with values, floats are inherently imprecise. The number you are playing with is apparently 0.64417266845703125 as an IEEE float representation. It happens that Python rounds down ("rounds to even" is the statement in the documentation) when it displays this.
Essentially you are not getting the precision you think you are getting.
Floats are actually rational numbers subject to a few constraints, including that the denominator be a power of two. For display, Python happens to choose some number which is represented identically to the original value (and, I believe, picks the one with the shortest decimal representation). Floats aren't "inherently imprecise"; they have a specific finite precision that is defined in binary, and then there are multiple equivalently valid decimal numbers that all will encode to the same bits. So yes, if you think you are getting a certain number of *decimal* digits of precision, then you're not - you're getting both more and less precision than that. You're getting a certain number of *binary* digits. In fact, the number given above can be shown to be this fraction:
"%d/%d" % 0.6441726684570313.as_integer_ratio() '84433/131072'
Which is the number ending ...03125. Effectively, the number ...0313 is being rounded down to ...03125; yes, it's being rounded to a number that requires MORE decimal digits to display it. Since the display is being shown to a limited number of decimal digits, the stored value has to be rounded, and that results in something that isn't identical to the input. But both ...0313 and ...0312 get rounded to ...03125 for storage, so they are equally valid representations of the floating-point value 84433/131072. ChrisA
On Thu, Aug 08, 2019 at 10:22:49AM -0000, Richard Musil wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
I think it does matter, because there is no such float as ``0.6441726684570313``. The call to float() there is a no-op, because the literal 0.64417... is already compiled to a float before the function is called. Calling float() on a float does nothing. So the problem here lies *before* you call float(): floats only have finite precision, and they use base 2, not 10, so there are many decimal numbers they cannot represent. And 0.6441726684570313 is one of those numbers. So even though you typed 0.64417...313, that gets rounded to the nearest base-2 number, which is 0.64417...312. You can see this by using the hex representation: py> float.fromhex('0x1.49d1000000000p-1') 0.6441726684570312 py> float.fromhex('0x1.49d1000000001p-1') 0.6441726684570314
What matters is that I did not find a way how to fix it with the standard `json` module. I have the JSON file generated by another program (C++ code, which uses nlohmann/json library), which serializes one of the floats to the value above.
I am very curious how the C++ code is generating that value, because Python floats ought to be identical to C doubles. Perhaps the C++ code is using an extended precision float with more bits? Or a decimal?
Then when reading this JSON file in my Python code, I can get either decimal.Decimal object (when specifying `parse_float=decimal.Decimal`) or float. If I use the latter the least significant digit is lost in deserialization.
As above, that's unavoidable for Python floats.
If I use Decimal, the value is preserved, but there seems to be no way to "serialize it back". Writing a custom serializer:
Alas, this is beyond my knowledge of JSON, but if you are correct that there's no way to serialise Decimals back to JSON, that seems like a major missing piece to me. Perhaps you can help jump-start this: https://bugs.python.org/issue16535
class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This becomes quoted in the serialized output return super.default(o)
I don't know a lot about writing JSON encoders, but a half-hearted and cursory glance at other custom encoders on Stackoverflow suggests to me that you probably want this instead: return (str(o),) # return a length-1 tuple but don't quote me or ask me to explain why that does or doesn't work. -- Steven
On Thu, Aug 8, 2019 at 4:01 PM Steven D'Aprano wrote:
In [31]: float(0.6441726684570313)
Out[31]: 0.6441726684570312
The call to float() there is a no-op, because the literal 0.64417... is already compiled to a float before the function is called. Calling float() on a float does nothing. So the problem here lies *before* you call float(): floats only have finite precision, and they use base 2, not 10, so there are many decimal numbers they cannot represent. And 0.6441726684570313 is one of those numbers.
Here's the same idea, expressed differently: >>> 0.6441726684570313 0.6441726684570312 >>> 2.000 2.0 -- Jonathan
I am not sure `(str(o),)` is what I want. For a comparison here are three examples: ``` json:orig = {"val": 0.6441726684570313} json:pyth = {'val': 0.6441726684570312} json:seri = {"val": 0.6441726684570312} dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": ["0.6441726684570313"]} sjson:orig = {"val": 0.6441726684570313} sjson:pyth = {'val': Decimal('0.6441726684570313')} sjson:seri = {"val": 0.6441726684570313} ``` Each one has three outputs, `orig` is the input text, `pyth` is its python representation in a `dict`, `seri` is the serialized text output of `pyth`. Now, the prefixes are `json` for standard Python module (which gets the last digit different from the output). `dson` is standard Python module using `parse_float=decimal.Decimal` on `json.loads` and custom serializer with proposed `return (str(o),)`. Finally `sjson` is `simplejson` using `use_decimal=True` on `json.loads` and the same (which is default) on its `json.dumps`. When I had `return str(o)` in the custom impl. I ended up with the string in the output: ``` dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": "0.6441726684570313"} ``` and finally, with `return float(o)` I am basically back at the square one: ``` dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": 0.6441726684570312} ``` The possibility to specify the "raw" textual output, which does not get mangled by the JSONEncoder when custom encoder is used seems to be missing. ˙``
On Fri, Aug 9, 2019 at 5:06 AM Richard Musil <risa2000x@gmail.com> wrote:
I am not sure `(str(o),)` is what I want. For a comparison here are three examples: ``` json:orig = {"val": 0.6441726684570313} json:pyth = {'val': 0.6441726684570312} json:seri = {"val": 0.6441726684570312}
dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": ["0.6441726684570313"]}
sjson:orig = {"val": 0.6441726684570313} sjson:pyth = {'val': Decimal('0.6441726684570313')} sjson:seri = {"val": 0.6441726684570313} ``` Each one has three outputs, `orig` is the input text, `pyth` is its python representation in a `dict`, `seri` is the serialized text output of `pyth`.
Now, the prefixes are `json` for standard Python module (which gets the last digit different from the output). `dson` is standard Python module using `parse_float=decimal.Decimal` on `json.loads` and custom serializer with proposed `return (str(o),)`. Finally `sjson` is `simplejson` using `use_decimal=True` on `json.loads` and the same (which is default) on its `json.dumps`.
A quick search of bpo showed up this thread: https://bugs.python.org/issue16535 Unfortunately, the trick given doesn't work, as there have been some optimizations done that break it. There are two broad suggestions that came out of that thread and others, and I think it may be worth reopening them. 1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily? 2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file? I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method. However, this is a much broader topic, and if you want to push for that, I would recommend starting a new thread. As Andrew pointed out, trying to get bit-for-bit identical JSON representations out of different encoders is usually a bad idea. ChrisA
Chris Angelico wrote:
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
However, this is a much broader topic, and if you want to push for that, I would recommend starting a new thread. As Andrew pointed out, trying to get bit-for-bit identical JSON representations out of different encoders is usually a bad idea.
I am not sure I have ever asked for bit-for-bit identical JSON representation. I have always only mentioned `decimal.Decimal` and the lack of proper way to encode it (while having the proper way of decoding it), and if you read the subject of the OP it is asking for "raw output" (in the encoder, nothing about underlying representation) which if I understand your two options basically corresponds to the second one and is probably addressed elsewhere far more thoroughly. If my understanding is right, I would be also very much in favor of this approach.
On Aug 8, 2019, at 13:29, Richard Musil <risa2000x@gmail.com> wrote:
I am not sure I have ever asked for bit-for-bit identical JSON representation.
Your complaint is that the hash is different. If you don’t get bit-for-bit identical JSON representation, the hash will not be the same. So either you _are_ asking for it, or you’re asking for something that won’t actually solve your problem (in which case, why do it?). And I don’t know how output could be “raw” in an sense other than letting you specify the exact byte representation for it, so I don’t get what you’re saying when you claim you want the former but don’t want the latter.
I have always only mentioned `decimal.Decimal` and the lack of proper way to encode it (while having the proper way of decoding it), and if you read the subject of the OP it is asking for "raw output" (in the encoder, nothing about underlying representation)
If decimal really is the only case you or anyone else is ever going to care about, you shouldn’t be asking for “raw output”, you should either be asking for simplejson’s use_decimal to be brought into the stdlib, or just using simplejson in the first place. If you really do need support for raw output, you’d need a new hook besides default, say, rawdefault. This wouldn’t be hard. Other than passing it up and down the chain of APIs, the only trick should be the C and Python code down in _iterencode that that do this: o = _default(o) yield from _iterencode(o, _current_indent_level) … with something like: ro = _rawdefault(o) if r is not NotImplemented: yield ro else: o = _default(o) yield from _iterencode(o, _current_indent_level) But I’m pretty sure simplejson had issues raised for something to generalize use_decimal this way and always rejected it. However, that might be only because all such suggestions also had other problems. Often they either come along with “I want my hook called _before_ the normal encoding, rather than between normal encoding and default”, which has serious performance implications, or “I want to be able to create invalid JSON so it can be eval’d as JS code”, which is just a bad idea. This one doesn’t have either of those problems, and maybe it doesn’t have _any_ problems that have come up in the past, but someone has to go through the simplejson issues to figure that out. So, if you really want this, come up with a use case that actually needs it, make sure you’ve answered whatever objections were raised in the past for simplejson if any, and ideally write a patch to simplejson and/or CPython (but if you can’t, I think this one is easy enough that you can hope someone else does), and you’ll probably get a difference response than you get by defensively claiming that you’re not asking for what you’re asking for but then repeating your need for it anyway.
The hash was an argument for having the means to preserve the precision (or representation) of the float (in its textual form) over decode-encode. The example was just chosen because I had experienced it in my project. But I never said that I need the underlying binary representation to be bit to bit accurate over any JSON decoder-encoder. I am perfectly aware of float binary representation limitation and already provided anecdotal evidence of different results from two different encoders. The points was that I could have this bit to bit accuracy implemented with the custom type and its encoder/decoder (if the default one did not support), if only standard module supported it. The custom type could be decimal.Decimal and the support would involve the decoding (which is already implemented) and encoding which, at the moment, is not possible to implement. So to sum up, I am asking for the support in the JSON module, so I can implement bit to bit (or let's say byte to byte) accuracy with a custom type, if needed. Not that the JSON module has to have it. I do not know if the generic solution (for an arbitrary type) is really needed and I hoped you might be able to judge that better. In my particular case it only concerns float. I can use simplejson and it solves my problem. It just seemed strange that the standard module did not support it, in particular that it supported it only half-way. I was coming here to get an idea, why it might be the case (if there was any case in the first place) and to gather other opinions.
On Fri, Aug 9, 2019 at 6:31 AM Richard Musil <risa2000x@gmail.com> wrote:
Chris Angelico wrote:
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
However, this is a much broader topic, and if you want to push for that, I would recommend starting a new thread. As Andrew pointed out, trying to get bit-for-bit identical JSON representations out of different encoders is usually a bad idea.
I am not sure I have ever asked for bit-for-bit identical JSON representation. I have always only mentioned `decimal.Decimal` and the lack of proper way to encode it (while having the proper way of decoding it), and if you read the subject of the OP it is asking for "raw output" (in the encoder, nothing about underlying representation) which if I understand your two options basically corresponds to the second one and is probably addressed elsewhere far more thoroughly.
If you're checking hashes, you need it to be bit-for-bit identical. But if what you REALLY want is for a Decimal to be represented in JSON as a native number, then I think that is a very reasonable feature request. The question is, how should it be done? And there are multiple viable options. I'd recommend, rather than requesting a way to create raw output, that you request a way to either (a) recognize Decimals in the default JSON encoder, both the Python and C implementations thereof; or (b) create a protocol such as __json__ to allow any object to choose how it represents itself. You'll probably find a reasonable level of support for either of those suggestions. ChrisA
After some thinking it seems that the float is only case where this "loss of a precision" happen. With the current implementation (in the standard module) of how `JSONEncoder.default`, I believe I can serialize pretty much everything else. The "composite" types (i.e. the custom types in the sense of JSON spec) can be serialized either as special string or a map and the logic implemented on both ends on the application level. The only possible suspects to suffer from the current `JSONEncoder.default` are the "integral" types in JSON, i.e. integer, float and boolean (null is in this sense "equivalent" to boolean). The boolean should not need any custom encoder as it can only be represented as one of well defined set of representation. The integer might suffer the similar fate as a float if there was not a native support in Python for big int. With that I can do this and it works as expected: ``` json:orig = {"val": 10000000000000000000000000000000000000000000000000000000000001} json:pyth = {'val': 10000000000000000000000000000000000000000000000000000000000001} json:seri = {"val": 10000000000000000000000000000000000000000000000000000000000001} ``` This is an integer value far exceeding the standard binary representation in 64bit CPU arch. ``` In [12]: hex(10000000000000000000000000000000000000000000000000000000000001) Out[12]: '0x63917877cec0556b21269d695bdcbf7a87aa000000000000001' ``` So with an integer, Python, thanks to its internal handling, parses the int correctly and "silently" upgrades it to big int, so it does not lose the precision (or bit-to-bit/byte-to-byte accuracy). The only type "left in the dark" is the float. It does have an equivalent of integer's big int (decimal.Decimal) but it is not automatically applied, which is perfectly reasonable, because it would involve a custom type, and probably not many would want/need that. On the other hand, being aware of the problem, it offers the famous `parse_float` keyword argument, which can just be conveniently set `decimal.Decimal` if the user needs an equivalent to the big int for the float. So far this also seem well thought out, because it shows: a) the decoder (or better say its implementer) was well aware of the float properties in JSON input and wanted to give the user a way to handle it in their own way. It looks better than simplejson's `use_decimal`, because this one implies one particular type only can be use. While the standard module leaves the choice to the user. On the other hand in order to implement it efficiently they both (standard module and simplejson) made this option an explicit argument which only concerns the float type, so it does not need any "generic raw" decoder infrastructure to support it. So far the way standard module handles that makes perfect sense. Now for the encoder part. simplejson got away with `use_decimal` again, because it allowed `Decimal` as the only option. Standard module would need a way to identify the custom codec for the float to serialize "properly". I can see two ways out of it: 1) The standard module could implement something like `dump_float` keyword argument in its `dump`, which would allow the user to specify which custom type he/she used for the float in the load and then the standard encoder will mark that and will honor the string representation of this object/type as the _raw_ output, either when internally converts the object (possibly by doing something like str(o), or when the custom implementation of JSONEncoder.default returns the string. 2) It would implement some specific semantics in the handling of JSONEncoder.default output which would allow user to signal to the underlying layer that it needs to output "raw" data to the output stream from the custom encoder without a need for the keyword argument. Using `bytes` object could be that trigger: ``` class DecimalEncoder(json.JSONEncoder): def default(self, o): print(o) if isinstance(o, decimal.Decimal): return str(o).encode() return super.default(o) ``` Any thoughts on this?
Richard Musil writes:
After some thinking it seems that the float is only case where this "loss of a precision" happen.
I don't think so. What's happening here is not restricted to "loss of precision". This can happen for any type that has multiple JSON representations of the same internal value, and therefore you cannot roundtrip from JSON representation to the same JSON representation for all of them. This is true of Unicode normalized forms as well as floats. According to Unicode these are entirely interchangable. If it's convenient for the Unicode process to change normalization form, according to Unicode you will get the same *characters* out that you fed in, but you will not necessarily get byte-for-byte equality. (This bites me all the time on the Mac, when I write to a file named in NFC and the Mac fs converts to NFD.) Python doesn't do any normalization by default, but it's an inherent feature of Unicode, and there's no (good) way for the codec to know that it's happening. As far as I can see, any of the proposals requires the cooperating systems to coordinate on a nonstandard JSON dialect[1], and for them to be of practical use, they'll need to have similarly capable internal representations -- which means the programs have to be designed for this cooperation. Pick an internal representation for float (probably Decimal), pick an external (JSON) normalization for Unicode (probably NFC), write the programs to ensure those representations -- using conformant JSON, and hope there are no compound types with multiple JSON representations. Footnotes: [1] I write "nonstandard dialect" rather than "non-conformant" because you can serialize floats as '{"Decimal" : 1.0}', and the internal processor just needs to know that such a dict should be automatically converted to Decimal('1.0').
There is no "normalized" representation for JSON. If you look at the "standard" it is pretty simple (json.org). The JSON object is defined solely by its textual representation (string of characters). The way how different parsers choose to represent it in the binary form, so they can process it, is an implementation detail, but JSON format neither stipulate any particular (binary) representation or that all have to be same. From the JSON point of view 0.6441726684570313 is perfectly valid float (or better say _number_ as it is what JSON uses) and 0.6441726684570312 is perfectly valid _and different_ number, because it differs in the last digit. The fact that both numbers transform into the same value when represented in IEEE-754 floating point number format is the feature of this particular binary representation, and has nothing to do with the JSON itself. The underlying JSON parser may as well choose a different representation and preserve an arbitrary precision (for example decimal.Decimal). From the JSON point of view there is no ambiguity, nor doubt. Even the number 0.64417266845703130 (note the last 0) is different JSON object from 0.6441726684570313 (without the last 0). Yes both represent the same value and if used in calculations will give the same results, but they are different JSON objects.
On Aug 9, 2019, at 07:25, Richard Musil <risa2000x@gmail.com> wrote:
There is no "normalized" representation for JSON. If you look at the "standard" it is pretty simple (json.org). The JSON object is defined solely by its textual representation (string of characters).
json.org is not the standard; RFC 8259 is the standard.* The description at json.org is not just informal and incomplete, it’s wrong about multiple things. For example, it assumes all Unicode characters can be escaped with \uxxxx, and that JSON strings are a subset of JS strings despite allowing two characters the JS doesn’t, and that its float format is the same as C’s except for octal and hex when it isn’t. . The RFC nails down all the details, fixes all of the mistakes, and, most relevant here, makes recommendations about what JSON implementations should do if they care about interoperability. None of these are "MUST" recommendations, so you can still call an implementation conforming if it ignores them, but it’s still not a good idea to ignore them.** And it’s not just your two examples that are different representations of the same float value that implementations should treat as (approximately, to the usual limits of IEEE) equal. So are 100000000.0 and +1E8 and 1.0e+08. If you’ve received the number 1E8 and write it back, you’re going to get 100000000.0. Storing it in Decimal instead of float doesn’t magically fix that. The fact that it does fix the one example you’ve run into so far doesn’t mean that it guarantees byte-for-byte*** round-tripping, just that it happens to give you byte-for-byte round-tripping in that one example. Arguing that we must allow it because we have to allow people to guarantee byte-for-byte round-tripping is effectively arguing that we have to actively mislead people into thinking we’re making a guarantee that we can’t make. And even if it did solve the problem for numbers, you’d still have the problem of different choices for which characters to escape, different Unicode normalizations in strings, different order of object members, different handling for repeated keys, different separator whitespace in arrays and objects, and so on. All of these are just as much a problem for round-tripping a different library’s JSON as they are for generating the same thing from scratch. Other text-based formats solve the hashing problem by specifying a canonical form: you can’t guarantee that any implementation can round-trip another implementation’s output, but you can guarantee that any implementation that claims to support the canonical form will produce the same canonical form for the same inputs, so you just hash the canonical form. But JSON intentionally does not have a canonical form. The intended way to solve the hashing problem is to not use JSON. Finally, all of your proposals for solving this so far either allow people to insert arbitrary strings into JSON, or don’t work. For example, using bytes returns from default to signal “insert these characters into the stream instead of encoding this value”**** explicitly lets people insert anything they want, even as you say you don’t want to allow that. I don’t get why you aren’t just proposing that the stdlib adopt simplejson’s use_decimal flag (and trying to figure out a way to make that work) instead of trying to invent something new and more general and more complicated even though it can’t actually be used for anything but decimal. But again, use_decimal does not solve the round-tripping, canonicalization, or hashing problem at all, so if your argument is that we should include it to solve that problem, you need a new argument. —- * Python, and I believe simplejson, still conforms to RFC 7159, which was the standard until December 2017, so if you wanted to make an argument from 7159, that _might_ be valid, although I think if there were a significant difference, people would be more likely to take that as a reason to update to 8259. ** And in fact, even the informal description doesn’t say what you want. It defines numbers are “very much like a C or Java number”. C and Java numbers are explicitly representations of the underlying machine type (for C) or of IEEE binary64 (for Java). *** I don’t get why you’re obsessed with the question of bit-for-bit vs. byte-for-byte. Are you worried about behavior on an ancient PDP-7 where you have 9-bit bytes but ignore the top bit for characters or something? **** And why would bytes ever mean “exactly these Unicode characters”? That’s what str means; bytes doesn’t, and in fact can’t unless you have an out-of-band-specified encoding; that’s half the reason we have Python 3. And even if you did confusingly specify that bytes can be used for “exactly the characters these bytes decode to as UTF-8”, that still doesn’t specify what happens if the bytes has, say, a newline, or a backslash followed by an a, or a non-BMP Unicode character. The fact that you weren’t expecting to give it any of those things doesn’t mean you don’t have to design what happens if someone else does.
Alright, when I made the reference to the json.org I made a mistake. Should have made the reference to the RFC. You are right that the RFC makes it clear that it is aware of possible interoperability problems and that some "common ground" should be acceptable for the underlying representation. (https://tools.ietf.org/html/rfc8259#section-6). I believe my proposal is not going against that. I have already stated in my previous mail to Paul that the default behavior (current implementation) which uses platform native float binary representation is fine. Adding the support that custom type may implement it is not the same as asking as having it implemented in the standard (default) implementation. As it turned out during the discussion here (at least for me and few others), it seems that asking for the general feature of "being able to insert custom (raw) output" is not really necessary and the sufficient would be only give this support to `float` type, because this is the only which is both - native type for both Python and JSON and at the same time could (and could be useful) being handled by custom type. Concerning for bit-to-bit vs byte-to-byte - no obsession here. It is just that the byte is the usual granularity of the (text) stream processors (either in or out), so I felt that using byte-to-byte was more fitting.
Richard Musil wrote:
Even the number 0.64417266845703130 (note the last 0) is different JSON object from 0.6441726684570313 (without the last 0).
And Python's Decimal type preserves that distinction:
Decimal("0.64417266845703130") Decimal('0.64417266845703130') Decimal("0.6441726684570313") Decimal('0.6441726684570313')
-- Greg
Stephen J. Turnbull wrote:
Richard Musil writes:
After some thinking it seems that the float is only case where this "loss of a precision" happen.
This is true of Unicode normalized forms as well as floats.
But it's true in Python, where you have the option to not normalise your deserialised JSON strings. What's missing is the ability to deserialise JSON floats to a non-lossy type. This seems like a reasonable thing to want, even if you're not intending to round-trip anything. -- Greg
On Thu, 8 Aug 2019 at 18:53, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Aug 9, 2019 at 6:31 AM Richard Musil <risa2000x@gmail.com> wrote:
Chris Angelico wrote:
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
However, this is a much broader topic, and if you want to push for that, I would recommend starting a new thread. As Andrew pointed out, trying to get bit-for-bit identical JSON representations out of different encoders is usually a bad idea.
I am not sure I have ever asked for bit-for-bit identical JSON
representation. I have always only mentioned `decimal.Decimal` and the lack of proper way to encode it (while having the proper way of decoding it), and if you read the subject of the OP it is asking for "raw output" (in the encoder, nothing about underlying representation) which if I understand your two options basically corresponds to the second one and is probably addressed elsewhere far more thoroughly.
If you're checking hashes, you need it to be bit-for-bit identical. But if what you REALLY want is for a Decimal to be represented in JSON as a native number, then I think that is a very reasonable feature request. The question is, how should it be done? And there are multiple viable options.
I'd recommend, rather than requesting a way to create raw output, that you request a way to either (a) recognize Decimals in the default JSON encoder, both the Python and C implementations thereof; or (b) create a protocol such as __json__ to allow any object to choose how it represents itself. You'll probably find a reasonable level of support for either of those suggestions.
I spent some minutes now trying to encode a Decimal as a JSON "Number" using Python native encoder - it really is not possible. The level of customization for Python encoders just allows a method ("default") that have to return a "known" object type - and if it returns a string, it is included with quotes in the final output - which defeats writting numbers. So - there is clearly the need for more customization capabilities, and a __json__ protocol (allowing one to return an already serialized string, with quotes, if needed, included in the serialization) seems to be good way to go. There is no need to change Python object for including a "__json__" slot - it is just Python's json encoder (and 3rd parties) that would need to check for that. I am here calling that we settle for that - (I think this would need a PEP, right?) . js -><-
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4KXUIV... Code of Conduct: http://python.org/psf/codeofconduct/
On 09/08/2019 16:05, Joao S. O. Bueno wrote:
I spent some minutes now trying to encode a Decimal as a JSON "Number" using Python native encoder - it really is not possible. The level of customization for Python encoders just allows a method ("default") that have to return a "known" object type - and if it returns a string, it is included with quotes in the final output - which defeats writting numbers.
I still need some persuasion that this is not the right behaviour as it stands. I get what you want -- "this string of digits is the representation I want to use, please don't put quotes around it" -- but I can't help but feel that it will only encourage more unrealistic expectations. -- Rhodri James *-* Kynesim Ltd
Rhodri James wrote:
I get what you want -- "this string of digits is the representation I want to use, please don't put quotes around it" -- but I can't help but feel that it will only encourage more unrealistic expectations.
I think "consenting adults" applies here. Yes, you could use it to produce invalid JSON, so it's your responsibility to not do that. And if you do so accidentally, you'll find out about it when other things (and quite possibly your own thing) fail to read it. If you really insist on being strict, it could require you to return a special wrapper type that takes a string of digits and checks that it conforms to the syntax of a JSON number. Come to think of it... you could use Decimal as that wrapper type! -- Greg
On Aug 9, 2019, at 17:53, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
If you really insist on being strict, it could require you to return a special wrapper type that takes a string of digits and checks that it conforms to the syntax of a JSON number.
Come to think of it... you could use Decimal as that wrapper type!
No you can’t. Decimal accepts strings that aren’t valid as JSON numbers, like `.3`, or `nan`, just as float does and C’s atoi, etc. (And it accepts strings that are valid as JSON numbers, but not recommended for interoperability, like `1.0E999`.) If you want to know whether something can be parsed as a JSON number, the obvious thing to do is call json.loads on it. (And if you want to know whether it’s something that’s “interoperable”, check whether loads returns a float—and maybe whether that float is equal to your input.) It’s not exactly the hardest thing to parse in the world, but why duplicate the parser (and potentially create new bugs) if you already have one?
Andrew Barnert wrote:
No you can’t. Decimal accepts strings that aren’t valid as JSON numbers, like `.3`,
That's not a problem as long as it doesn't serialise them that way, which it wouldn't:
str(Decimal('.3')) '0.3'
or `nan`,
The serialiser could raise an exception in that case. BTW, I just checked what it does with floats:
json.dumps(float('nan')) 'NaN'
So it seems that it currently doesn't care about strict conformance here. -- Greg
On Aug 9, 2019, at 19:09, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Andrew Barnert wrote:
No you can’t. Decimal accepts strings that aren’t valid as JSON numbers, like `.3`,
That's not a problem as long as it doesn't serialise them that way, which it wouldn't:
Your suggestion was that we could allow people to serialize whatever they want, and then use the Decimal constructor as a sanity check to make sure it’s actually legal. If someone’s code emits 0.3 and we check it with Decimal instead of with the JSON parser, it gives the wrong answer.
str(Decimal('.3')) '0.3'
or `nan`,
The serialiser could raise an exception in that case.
How? You want to validate by checking if it can be passed to the Decimal constructor, and then add special cases on top of that?
BTW, I just checked what it does with floats:
json.dumps(float('nan')) 'NaN'
So it seems that it currently doesn't care about strict conformance here.
It does care about conformance, but it also cares about interoperability. (This is explained in the docs.) JS implementations before ES5 serialized and parsed NaN, Infinity, and -Infinity, even though the informal spec said they shouldn’t. Lots of other libraries followed suit. (Some also accept or even produce nan, int, -inf, because that’s what most C libraries output—as do Python, Perl, etc.—but many do not.) Modern versions of JS mandate that those values are all emitted as null; some other libraries raise an error; some still use the “legacy” values. Similarly, they vary on what they accept. To deal with all of this, Python’s json module gives you keyword options for both dump and load to select what you want. If you use Decimal to validate the output or a custom emitter, not only it will ignore those flags, and it will also accept things that load wouldn’t accept even if the flag is enabled. I honestly don’t see the point in a sanity check. If we make it possible but not too easy for people to emit whatever raw strings they want into the middle of JSON, they presumably know what they’re doing, and if they really want to produce a non-standard dialect that their code on the other hand can handle, why not just let them? But if we do need such a check, using the Decimal constructor is not a good check. It’s similar to the spec, but not the same, and why validate that someone’s output passes a similar specification instead of the one they’re trying to use? Especially when we already have tested (and configurable) code that exactly parses the actual spec (and common variations on it, depending on the config flags)?
On Aug 8, 2019, at 3:55 PM, Chris Angelico <rosuav@gmail.com> wrote:
There are two broad suggestions that came out of that thread and others, and I think it may be worth reopening them.
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I proposed something similar about a year ago [1]. I really like the idea of a protocol for this. Especially since the other encoders already use this approach. Should I reboot this approach? The implementation was really simple [2]. - dave [1]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ZC4OOAV... [2]: https://github.com/dave-shawley/cpython/pull/2 <https://github.com/dave-shawley/cpython/pull/2>
On Mon, Aug 12, 2019 at 10:09 AM David Shawley <daveshawley@gmail.com> wrote:
On Aug 8, 2019, at 3:55 PM, Chris Angelico <rosuav@gmail.com> wrote:
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I proposed something similar about a year ago [1]. I really like the idea of a protocol for this. Especially since the other encoders already use this approach. Should I reboot this approach? The implementation was really simple [2].
[1]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ZC4OOAV... [2]: https://github.com/dave-shawley/cpython/pull/2
As proposed here, this is unable to solve the original problem, because whatever's returned gets encoded (so if you return a string, it will be represented as a string). It'd need to NOT re-encode it, thus allowing an object to choose its own representation. Minor bikeshedding: protocols like this are more usually dunders, so __json__ or __jsonformat__ would be a better spelling. I'm going to assume __json__ here, but other spellings are equally viable. For the purposes of documentation, built-in types could be given __json__ methods, though the optimization in the C version would mean they're actually bypassed. class int: def __json__(self): return str(self) class float: def __json__(self): if math.isfinite(self): return str(self) return "null" class Decimal: def __json__(self): return str(self) It may be of value to pass the JSONEncoder instance (or a callback) to this method, which would allow the object to say "render me the way you would render this". Thoughts? ChrisA
On Sun, Aug 11, 2019, at 20:42, Chris Angelico wrote:
class float: def __json__(self): if math.isfinite(self): return str(self) return "null"
Er, to be clear, the current JSON decoder returns 'Infinity', '-Infinity', or 'NaN', which are invalid JSON, not null. This behavior (vs raising an exception) is governed by a flag. I'm mainly bringing this up to point out that the concerns about a raw protocol being able to generate invalid JSON seem somewhat overblown and that "people should be able to generate invalid JSON if they want to" is not without precedent.
On Aug 14, 2019, at 09:22, Random832 <random832@fastmail.com> wrote:
On Sun, Aug 11, 2019, at 20:42, Chris Angelico wrote: class float: def __json__(self): if math.isfinite(self): return str(self) return "null"
Er, to be clear, the current JSON decoder returns 'Infinity', '-Infinity', or 'NaN', which are invalid JSON, not null. This behavior (vs raising an exception) is governed by a flag.
But that’s a special case, as already discussed on this thread. Those values are invalid JSON according to the RFC. But, for historical reasons, passing around these three values used to be near-ubiquitous, and they’re still pretty common even 6 years after the ECMA spec and RFC definitely said they’re illegal. So, a library that refuses to cooperate with that won’t interoperate properly with zillions of existing services and programs, and which side is the user going to blame?
I'm mainly bringing this up to point out that the concerns about a raw protocol being able to generate invalid JSON seem somewhat overblown and that "people should be able to generate invalid JSON if they want to" is not without precedent.
Sure, and as soon as you discover another special case that millions of existing programs expect to be handled in a way that violates the RFC, that somehow nobody has noticed in the last 16 years, file a bug to handle that with another special flag. ;)
On Thu, Aug 15, 2019 at 2:23 AM Random832 <random832@fastmail.com> wrote:
On Sun, Aug 11, 2019, at 20:42, Chris Angelico wrote:
class float: def __json__(self): if math.isfinite(self): return str(self) return "null"
Er, to be clear, the current JSON decoder returns 'Infinity', '-Infinity', or 'NaN', which are invalid JSON, not null. This behavior (vs raising an exception) is governed by a flag. I'm mainly bringing this up to point out that the concerns about a raw protocol being able to generate invalid JSON seem somewhat overblown and that "people should be able to generate invalid JSON if they want to" is not without precedent.
The examples I gave were meant to be toys, but you're right, this doesn't match current behaviour. For an actual PEP it would be better to be more accurate to the way json.dumps currently works; for this, I didn't really care too much :) Side point: Does anyone else think it was an egotistical idea to create JSON as a non-versioned specification? "I can't possibly get this wrong". And now look what's happened. ChrisA
On Aug 14, 2019, at 11:38, Chris Angelico <rosuav@gmail.com> wrote:
Side point: Does anyone else think it was an egotistical idea to create JSON as a non-versioned specification? "I can't possibly get this wrong". And now look what's happened.
Well, it was supposed to be an antidote to complicated specifications like XML. The syntax fits on a yellow sticky note (sort of), the semantics are “whatever your browser’s eval says”, and nobody’s ever going to use it for anything but passing around DHTML info. Even though Crockford and friends were using it to pass that info from Java services to a JS client pretty early on, they didn’t even have a JSON library for Java, just hand-written format strings they filled with the data. So, the idea that there might one day be a series of three RFCs (plus ECMA and ISO specs) was probably not so much horrifying as laughably implausible. Of course now the world has learned its lesson, so nobody will ever do that again. YAML and TOML have version numbers, even if they don’t mean what you expect and there’s no way to tell what version number a given document is. BSON has 1.0 and 1.1, each of which matches some specific version of JSON., even if that version isn’t specified anywhere, and, while there’s no way to tell which version a given document was written in, 1.1 is “mostly” backward compatible, so that’s good enough, right? The various non-XML RDF/semantic-triple notations mostly don’t have their own versions, but at least if you search hard enough you can find which version of RDF they currently notate and hope that hasn’t changed since the document was created. MessagePack has a version field, even if it’s optional and deprecated and version numbers were localtimes that couldn’t be looked up anywhere. Avro is intentionally unversioned because it will never change, according to version 1.8 of the spec. And so on. We’re in great shape for the future. :)
There's JSON5; which supports comments, trailing commas, IEEE 754 ±Infinity and NaN, [...] https://json5.org/ There's also JSON-LD, which supports xsd:datetime, everything that can be expressed as RDF, @context vocabulary, compact and expanded representations, @id, and lots of other cool features that formats like CSVW utilize. https://json-ld.org/spec/latest/ https://www.w3.org/TR/tabular-data-primer/ It seems like just yesterday we were adding JSON to the standard library. Is there a way to save pickles without allowing executable code to be stored or executed at deserialization time? PyArrow is fast, there's also a JS implementation, does SIMD, and zero-copy reads. Native serialization and deserialization support for either or both JSON5, JSON lines, and JSON-LD would be great to have. On Wednesday, August 14, 2019, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
On Aug 14, 2019, at 11:38, Chris Angelico <rosuav@gmail.com> wrote:
Side point: Does anyone else think it was an egotistical idea to create JSON as a non-versioned specification? "I can't possibly get this wrong". And now look what's happened.
Well, it was supposed to be an antidote to complicated specifications like XML. The syntax fits on a yellow sticky note (sort of), the semantics are “whatever your browser’s eval says”, and nobody’s ever going to use it for anything but passing around DHTML info. Even though Crockford and friends were using it to pass that info from Java services to a JS client pretty early on, they didn’t even have a JSON library for Java, just hand-written format strings they filled with the data.
So, the idea that there might one day be a series of three RFCs (plus ECMA and ISO specs) was probably not so much horrifying as laughably implausible.
Of course now the world has learned its lesson, so nobody will ever do that again. YAML and TOML have version numbers, even if they don’t mean what you expect and there’s no way to tell what version number a given document is. BSON has 1.0 and 1.1, each of which matches some specific version of JSON., even if that version isn’t specified anywhere, and, while there’s no way to tell which version a given document was written in, 1.1 is “mostly” backward compatible, so that’s good enough, right? The various non-XML RDF/semantic-triple notations mostly don’t have their own versions, but at least if you search hard enough you can find which version of RDF they currently notate and hope that hasn’t changed since the document was created. MessagePack has a version field, even if it’s optional and deprecated and version numbers were localtimes that couldn’t be looked up anywhere. Avro is intentionally unversioned because it will never change, according to version 1.8 of the spec. And so on. We’re in great shape for the future. :) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@ python.org/message/HT6DRWL6ODCJ5MFO7ZU7POALIZ4Q3FU2/ Code of Conduct: http://python.org/psf/codeofconduct/
On Wednesday, August 14, 2019, Wes Turner <wes.turner@gmail.com> wrote:
There's JSON5; which supports comments, trailing commas, IEEE 754 ±Infinity and NaN, [...] https://json5.org/
There's also JSON-LD, which supports xsd:datetime, everything that can be expressed as RDF, @context vocabulary, compact and expanded representations, @id, and lots of other cool features that formats like CSVW utilize.
(Edit) https://www.w3.org/TR/json-ld11/
A JSON-based Serialization for Linked Data
https://www.w3.org/TR/tabular-data-primer/
It seems like just yesterday we were adding JSON to the standard library. Is there a way to save pickles without allowing executable code to be stored or executed at deserialization time? PyArrow is fast, there's also a JS implementation, does SIMD, and zero-copy reads.
Native serialization and deserialization support for either or both JSON5, JSON lines, and JSON-LD would be great to have.
On Wednesday, August 14, 2019, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
On Aug 14, 2019, at 11:38, Chris Angelico <rosuav@gmail.com> wrote:
Side point: Does anyone else think it was an egotistical idea to create JSON as a non-versioned specification? "I can't possibly get this wrong". And now look what's happened.
Well, it was supposed to be an antidote to complicated specifications like XML. The syntax fits on a yellow sticky note (sort of), the semantics are “whatever your browser’s eval says”, and nobody’s ever going to use it for anything but passing around DHTML info. Even though Crockford and friends were using it to pass that info from Java services to a JS client pretty early on, they didn’t even have a JSON library for Java, just hand-written format strings they filled with the data.
So, the idea that there might one day be a series of three RFCs (plus ECMA and ISO specs) was probably not so much horrifying as laughably implausible.
Of course now the world has learned its lesson, so nobody will ever do that again. YAML and TOML have version numbers, even if they don’t mean what you expect and there’s no way to tell what version number a given document is. BSON has 1.0 and 1.1, each of which matches some specific version of JSON., even if that version isn’t specified anywhere, and, while there’s no way to tell which version a given document was written in, 1.1 is “mostly” backward compatible, so that’s good enough, right? The various non-XML RDF/semantic-triple notations mostly don’t have their own versions, but at least if you search hard enough you can find which version of RDF they currently notate and hope that hasn’t changed since the document was created. MessagePack has a version field, even if it’s optional and deprecated and version numbers were localtimes that couldn’t be looked up anywhere. Avro is intentionally unversioned because it will never change, according to version 1.8 of the spec. And so on. We’re in great shape for the future. :) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archiv es/list/python-ideas@python.org/message/HT6DRWL6ODCJ5MFO7ZU7POALIZ4Q3FU2/ Code of Conduct: http://python.org/psf/codeofconduct/
On Aug 14, 2019, at 19:48, Wes Turner <wes.turner@gmail.com> wrote:
Native serialization and deserialization support for either or both JSON5, JSON lines, and JSON-LD would be great to have.
PyPI has packages for all of these things (plus the two not-quite-identical variations on JSONlines, and probably other JSON-related formats as well, not to mention alternative packages for JSON itself that are better for various special uses, like pull-parsing giant docs). I don’t think that there’s any particular reason any of them need to be included in the stdlib. Especially since at least some of them are still improving at a rate much faster than the stdlib can. It might make sense for the json module docs to mention some of the related formats. But since none of the packages looks like a category killer, I don’t think that could come with links to specific packages to handle any of them, just descriptions of what they are and why they’re not the same thing as JSON. And even that might not be necessary. Most people looking for JSON actually do need JSON, not, say, JSON-LD, and people who do need JSON-LD probably know they need it, won’t expect a module called json to provide it, and can go find a library for it themselves. Still, if Python had the “loose recommendations from core devs” links thing discussed in another thread, some of these packages might qualify. If someone really wants that, they should probably read over what’s been said about it so far and start a separate thread.
Setting aside the arguments for just having reusable linked data JSON-LD in the first place (which is also JSON that a non JSON-LD app can pretty easily consume), A few questions then are: Should __json__() be versioned? Should __json__() return JSON5 (with IEEE 754 ±Infinity and NaN)? What prevents existing third-party solutions for encoding/serializing/marshalling/dumping and decoding/deserializing/ubmarshalling/loading JSON, JSON5, and JSON-LD from standardizing on e.g. __json__(), __json5__(), and __jsonld11__()? On Wednesday, August 14, 2019, Andrew Barnert <abarnert@yahoo.com> wrote:
On Aug 14, 2019, at 19:48, Wes Turner <wes.turner@gmail.com> wrote:
Native serialization and deserialization support for either or both
JSON5, JSON lines, and JSON-LD would be great to have.
PyPI has packages for all of these things (plus the two not-quite-identical variations on JSONlines, and probably other JSON-related formats as well, not to mention alternative packages for JSON itself that are better for various special uses, like pull-parsing giant docs).
I don’t think that there’s any particular reason any of them need to be included in the stdlib. Especially since at least some of them are still improving at a rate much faster than the stdlib can.
It might make sense for the json module docs to mention some of the related formats. But since none of the packages looks like a category killer, I don’t think that could come with links to specific packages to handle any of them, just descriptions of what they are and why they’re not the same thing as JSON.
And even that might not be necessary. Most people looking for JSON actually do need JSON, not, say, JSON-LD, and people who do need JSON-LD probably know they need it, won’t expect a module called json to provide it, and can go find a library for it themselves.
Still, if Python had the “loose recommendations from core devs” links thing discussed in another thread, some of these packages might qualify. If someone really wants that, they should probably read over what’s been said about it so far and start a separate thread.
On Aug 14, 2019, at 20:35, Wes Turner <wes.turner@gmail.com> wrote:
A few questions then are:
Should __json__() be versioned?
Why? While there are sort of multiple versions of JSON (pre-spec, ECMA404, and the successive RFCs), even when you know which one you’re dealing with, there’s not really anything you’d want to do differently as far as returning different objects to be serialized. You might return different objects based on the allow_nan flag, and if the proposed use_decimal is added, even more likely—but those aren’t controlled by the JSON version. And meanwhile, you seem to be expecting that one of the many JSON-based and JSON-like formats is going to become an official successor to JSON and take over its ubiquity. But that really isn’t likely (and you aren’t going to make it any more likely by arguing for it on python-ideas, and even less so by arguing for four of them at once, which all pull in different directions). And, even if it did happen, it wouldn’t be a next version of JSON. And, even if it were, there’s nothing code could do with a version number today to take advantage of whatever format becomes available in the future, given that nobody could guess what format that might be. So there’s no advantage to adding this now..
Should __json__() return JSON5
Definitely not. Just because JSON5 is a superset of JSON doesn’t mean you can produce JSON5 and give it to services, programs, etc. expecting JSON, any more than you can give them YAML 1.1 or JavaScript source or any other superset of JSON. This would make the JSON module unusable for 99.9999% of its existing uses.
What prevents existing third-party solutions for encoding/serializing/marshalling/dumping and decoding/deserializing/ubmarshalling/loading JSON, JSON5, and JSON-LD from standardizing on e.g. __json__(), __json5__(), and __jsonld11__()?
The fact that __dunder__ names are reserved for Python itself and Python implementations. This is why simplejson changed its protocol to use a method named for_json. Other than that, this really isn’t a question for python-ideas. it’s a question for the authors of the existing third-party solutions. If there are three different JSON-LD packages and they all have a “here’s a substitute object to serialize” protocol but they’re not compatible with each other, file bugs against those packages asking them to be compatible with each other. Either that, or go out into the world and proselytize for one of them to get it to “category killer” status so the others either go away or adopt its de facto standard.
On Thursday, August 15, 2019, Andrew Barnert <abarnert@yahoo.com> wrote:
On Aug 14, 2019, at 20:35, Wes Turner <wes.turner@gmail.com> wrote:
A few questions then are:
Should __json__() be versioned?
Why? While there are sort of multiple versions of JSON (pre-spec, ECMA404, and the successive RFCs), even when you know which one you’re dealing with, there’s not really anything you’d want to do differently as far as returning different objects to be serialized. You might return different objects based on the allow_nan flag, and if the proposed use_decimal is added, even more likely—but those aren’t controlled by the JSON version.
So, there is opportunity for JSON library developers to write a PEP that standardizes the argument specifications (maybe as optional **kwargs) for e.g. ._json() or .for_json() so that flags like allow_nan and use_decimal MAY BE honored. And then, years later, we can implement that in the standard library, too? Originally, oddly, sorry I can't remember where the standard library JSON module came from? Someone donated and relicensed it; IIRC. Is it sinplejson? https://github.com/python/cpython/blob/master/Modules/_json.c https://devguide.python.org/experts/ :
json bob.ippolito (inactive), ezio.melotti, rhettinger
And meanwhile, you seem to be expecting that one of the many JSON-based and JSON-like formats is going to become an official successor to JSON and take over its ubiquity. But that really isn’t likely (and you aren’t going to make it any more likely by arguing for it on python-ideas, and even less so by arguing for four of them at once, which all pull in different directions). And, even if it did happen, it wouldn’t be a next version of JSON. And, even if it were, there’s nothing code could do with a version number today to take advantage of whatever format becomes available in the future, given that nobody could guess what format that might be. So there’s no advantage to adding this now..
So, ±Infinity are or are not supported?
Should __json__() return JSON5
Definitely not. Just because JSON5 is a superset of JSON doesn’t mean you can produce JSON5 and give it to services, programs, etc. expecting JSON, any more than you can give them YAML 1.1 or JavaScript source or any other superset of JSON. This would make the JSON module unusable for 99.9999% of its existing uses.
So, instead third-party libraries and eventually stdlib json could have ._json_(spec='json5') where spec='json5' and allow_nan=True ._json_(spec='json5') ._json_(spec='json5', allow_nan=True) #redundant ._json_(spec='json5', use_decimal=True) #redundant but necessary for backwards compatibility (*)
What prevents existing third-party solutions for encoding/serializing/marshalling/dumping and decoding/deserializing/ubmarshalling/loading JSON, JSON5, and JSON-LD from standardizing on e.g. __json__(), __json5__(), and __jsonld11__()?
The fact that __dunder__ names are reserved for Python itself and Python implementations. This is why simplejson changed its protocol to use a method named for_json.
The Jupyter rich display object protocol _repr_json_() does not support kwargs at this time; practically, where would use_decimal be specified in a notebook?
Other than that, this really isn’t a question for python-ideas. it’s a question for the authors of the existing third-party solutions. If there are three different JSON-LD packages and they all have a “here’s a substitute object to serialize” protocol but they’re not compatible with each other, file bugs against those packages asking them to be compatible with each other. Either that, or go out into the world and proselytize for one of them to get it to “category killer” status so the others either go away or adopt its de facto standard.
It would be helpful to have a - even a rejected but widely implemented - PEP specifying the method name and kwargs that standard library and third-party solutions MAY implement for JSON serialization.
This is all making me think it's time for a broader discussion / PEP: The future of the JSON module. *maybe* the way forward is to try to make a new "category killer" JSON lib (or at least a platform for standard protocols). But it also may make sense to start within Python itself -- at least for the protocols. These are few things that seem to be on the table: * support for extensions to the JSON standard (JSON5, etc.) * support for serializing arbitrary precision decimal numbers * support for allowing custom serializations (i.e. not just what can be serialized, but controlling exactly how) * a "dunder protocol" for customization * what role, if any, should the json module have in ensuring only valid JSON is produced? Maybe this is all opening a big can of worms, but the conversation seems to have strayed widely enough that maybe we should embrace that and try to get a handle on it.
Originally, oddly, sorry I can't remember where the standard library JSON module came from? Someone donated and relicensed it; IIRC. Is it sinplejson?
This is a good point -- I'm not sure it matters exactly, but it is worth recalling that decisions about eh details (like allowing arbitrary separators) where not necessarily made as part of a careful discussion. Also -- is there a particular core dev with "ownership" of the json module? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Aug 15, 2019, at 10:23, Christopher Barker <pythonchb@gmail.com> wrote:
This is all making me think it's time for a broader discussion / PEP:
The future of the JSON module.
I think this is overreacting. There’s really only two issues here, and neither one is that major. But nobody has shown any need for any of the other stuff.
* support for extensions to the JSON standard (JSON5, etc.)
One guy saw a bunch of shiny new protocols (half of which are actually stagnant dead protocols) and thought that if the Python stdlib added support for all of them, they’d all take over the world. That’s not what the stdlib is for. Plus, most of them aren’t even JSON extensions, they’re other formats that build on top of JSON. For example, a JSONlines document is not a superset of a JSON document, it’s a sequence of lines, each of which is a plain-old JSON document (with the restriction that any newlines must be escaped, which the stdlib can already handle). There are multiple JSONlines libraries out there that work fine today using the stdlib json module or their favorite third-party package; none of their authors have asked for any new features. So, what should the stdlib add to support JSONlines? Nothing. Maybe a brief section in the docs about related protocols, explaining why they’re not JSON but how they’re related, would be helpful. But I’m not sure even that is needed. How many people come to the json docs looking for how to parse NDJ?
* support for serializing arbitrary precision decimal numbers
Multiple people want this. The use_decimal flag from simplejson is a proven way to do it. The only real objection is “Fine, but can you find a way to do it without slowing down imports for people who don’t use it?” People have suggested answers, and either someone will implement one, or not. It might be nice to spin this off into its own thread to escape all the baggage. Going the other way and holding it hostage to a “let’s redesign everything from scratch before doing that” seems like a bad idea to me.
* support for allowing custom serializations (i.e. not just what can be serialized, but controlling exactly how)
Only one person wants this. And he keeps saying he doesn’t want it. And he only wants it for Decimal or other float-substitutes, not for all types. And it won’t actually solve the problem he wants to solve. If there were prior art on this to consider, it might be worth trying to design a solution that has all the strengths of each of the existing answers. But if no JSON package has such a feature, because nobody needs it, then the answer is simple: don’t invent the first-ever solution for the stdlib module.
* a "dunder protocol" for customization
I personally don’t see much point in this, as it’s trivial to do that yourself. (I had a program that depended on simplejson’s for_json protocol; making it work with the stdlib and ujson was just a matter of writing a function that checks for and tries for_json, and partialling that in as the default function. We’re talking 5 minutes of work, and easily reusable in your personal toolbox.) But if people really want it, the only real question is what to call it. And __json__ seems like the obvious answer. And again, it seems like this would benefit from being separated out to its own small discussion, not being bundled into a huge one.
* what role, if any, should the json module have in ensuring only valid JSON is produced?
I think where it is today is fine. The allow_nan flag is necessary for interoperability. The separators option may not be the best possible design, but it rarely if ever causes problems for anyone, so why break compatibility? The fact that you can monkeypatch in a punning float.__repr__ or whatever isn’t a problem for a consenting-adults library to worry about. So this is only even worth discussing if the “serialize arbitrary text” feature is added.
On Thursday, August 15, 2019, Andrew Barnert <abarnert@yahoo.com> wrote:
On Aug 15, 2019, at 10:23, Christopher Barker <pythonchb@gmail.com> wrote:
This is all making me think it's time for a broader discussion / PEP:
The future of the JSON module.
I think this is overreacting. There’s really only two issues here, and neither one is that major.
But nobody has shown any need for any of the other stuff.
Data interchange with structured types is worthwhile. Lack of implementation does not indicate lack of need, so much as ignorance of the concerns. - There was no JSON module in the stdlib, but that doesn't mean it wasn't needed - We all have CSV, which is insufficient for reuse because, for example, there's nowhere to list metadata like unit URLs like liters/litres or deciliters and we thus all have to waste time later determining what was serialized (hopefully with at least a URL in the references of a given ScholarlyArticle). CSVW JSON-LD is a solution for complex types; such as decimals and numbers with units and precision. It's not that the functionality isn't needed, it's that nobody knows how to do it because it doesn't just work with a simple function call. Decimals, datetimes, complex numbers, ±Infinity, NaN, float precision: these things aren't easy with JSON and that results in us losing fidelity in exchange for serialization convenience. Doing more to make it easy to losslessly serialize and deserialize is a net win for the public good and it's worth the effort to host discussions - in an unfragmented forum - to develop a PEP with a protocol for progress on the lossless serialization front.
* support for extensions to the JSON standard (JSON5, etc.)
One guy saw a bunch of shiny new protocols (half of which are actually stagnant dead protocols) and thought that if the Python stdlib added support for all of them, they’d all take over the world. That’s not what the stdlib is for.
Whether JSON was worthy of inclusion in the stdlib was contentious and required justification. It was a good idea because pickles are dangerous and interacting with JS is very useful. It is unfortunate that we all just use JSON and throw away decimals and float precision and datetimes because json.dumps is so easy. An object.__json__(**kwargs) protocol would inconvenience no-one so long as: - decimal isn't imported unless used - all existing code continues to work
Plus, most of them aren’t even JSON extensions, they’re other formats that build on top of JSON. For example, a JSONlines document is not a superset of a JSON document, it’s a sequence of lines, each of which is a plain-old JSON document (with the restriction that any newlines must be escaped, which the stdlib can already handle). There are multiple JSONlines libraries out there that work fine today using the stdlib json module or their favorite third-party package; none of their authors have asked for any new features. So, what should the stdlib add to support JSONlines? Nothing.
Streaming JSON is not possible without JSON lines support. There are packages to do it, but that's not an argument against making it easy for people to do safe serialization without a dependency.
Maybe a brief section in the docs about related protocols, explaining why they’re not JSON but how they’re related, would be helpful. But I’m not sure even that is needed. How many people come to the json docs looking for how to parse NDJ?
How many people know that: - You can or should use decimal to avoid float precision error, but then you have to annoyingly write a JSONEncoder to save that data, and then the type is lost when it's parsed and cast to a float when it's deserialized? - JSON-LD is the only non-ad-hoc solution to preserving precision, datetimes, and complex numbers and types with JSON - JSON5 supports IEEE 754 ±Infinity and NaN - Pickles do serialize arbitrary objects, but are not safe for data publishing because unmarshalling runs executable code in the pickle (this is in the docs now) Including guidance in the docs would be great. Making it easy to do things correctly and consistently would also be great.
* support for serializing arbitrary precision decimal numbers
Multiple people want this. The use_decimal flag from simplejson is a proven way to do it. The only real objection is “Fine, but can you find a way to do it without slowing down imports for people who don’t use it?” People have suggested answers, and either someone will implement one, or not.
It might be nice to spin this off into its own thread to escape all the baggage. Going the other way and holding it hostage to a “let’s redesign everything from scratch before doing that” seems like a bad idea to me.
Why break the thread? A protocol for object.__json__(**kwargs) is a partial solution for the OT. A PEP PR would be a good place to continue discussion, but nobody will helpfully chime in there. Saving decimals as float strs which deserialize as floats does not preserve the type or precision of the decimals (which are complex types with a numerator and a denominator). JSON-LD is the way to go for complex types in JSON.
* support for allowing custom serializations (i.e. not just what can be serialized, but controlling exactly how)
Only one person wants this. And he keeps saying he doesn’t want it. And he only wants it for Decimal or other float-substitutes, not for all types. And it won’t actually solve the problem he wants to solve.
Would optional kwargs passed to object.__json__(**kwargs) from e.g. json.dumps kwargs allow for parameter-customizable serializations?
If there were prior art on this to consider, it might be worth trying to design a solution that has all the strengths of each of the existing answers. But if no JSON package has such a feature, because nobody needs it, then the answer is simple: don’t invent the first-ever solution for the stdlib module.
The alternative is for every package to eventually solve for these real needs in a different way, stdlib to write a JSON protocol PEP, and then for every package to have to support their original and, now, the spec'd protocol.
* a "dunder protocol" for customization
I personally don’t see much point in this, as it’s trivial to do that yourself. (I had a program that depended on simplejson’s for_json protocol; making it work with the stdlib and ujson was just a matter of writing a function that checks for and tries for_json, and partialling that in as the default function. We’re talking 5 minutes of work, and easily reusable in your personal toolbox.)
There's good reason to not copy paste code that needs to be tested into every application. 5 minutes of everyone's time to copy-paste, and t time for every run of every app's test suite.
But if people really want it, the only real question is what to call it. And __json__ seems like the obvious answer.
object.__json__(**kwargs)
And again, it seems like this would benefit from being separated out to its own small discussion, not being bundled into a huge one.
https://github.com/python/peps
* what role, if any, should the json module have in ensuring only valid JSON is produced?
I think where it is today is fine.
The allow_nan flag is necessary for interoperability.
The separators option may not be the best possible design, but it rarely if ever causes problems for anyone, so why break compatibility?
The fact that you can monkeypatch in a punning float.__repr__ or whatever isn’t a problem for a consenting-adults library to worry about.
Such global float precision (from a __repr__ with a format string) is lost when the JSON is deserialized. (JSON-LD supports complex types in a standard way; which are otherwise necessarily JSON-implementation specific and thus non-portable) It's worth specifying a JSON serialization protocol as a PEP that third-party and stdlib JSON implementations would use.
Wes Turner writes:
Data interchange with structured types is worthwhile.
That's not what the main thread is about. It's about adding support for Decimal to the stdlib's json module. Even the OP has explicitly disclaimed pretty much everything else, although his preferred implementation is more general than that. I'm +1 on that. I think the outline of how to do it has become pretty obvious, and that it should be restricted to automatically converting Decimals to a JSON number, perhaps under control of a use_decimal flag for both encoding and decoding. The rest should go into a separate thread. First let's dispose of this:
Streaming JSON is not possible without JSON lines support.
It is obvious to me that this should be handled in yet another thread from "lossless JSON", because it can and should be independently implemented, if it's done at all. Given (ob,n) = raw_decode(idx=n) support in the json module, the difficulty in implementing is all about buffering, and choosing where to do that buffering (in a separate module? in json.load? in a new json.load_stream generator?) I will now argue that the __json__ protocol is nowhere near so obviously stdlib-able as Decimal and streaming JSON.
An object.__json__(**kwargs) protocol would inconvenience no-one so long as: - decimal isn't imported unless used - all existing code continues to work
I also think that JSON is widely enough used, and deserves better semantic support, that a protocol (specifically, the __json__ dunder) for serializer support and some form of complementary deserializer support are quite justifiable. But the __json__ dunder is the *easy* part. The complexity here is in that complementary deserializer. Here's why. To your desiderata I would add - no complex type's module is imported unless used (easy) - the deserializer support for a type should be linked to its serializer support (something like the codecs registry, but more complicated because each entity will need to invoke support separately, unlike codecs where there's one codec for a whole text) - such object support should be automatically linked in to both the top level serializer and deserializer dispatching. The latter two desiderata look *hard* to me. Without them, you've got the inverse of the current Decimal problem. This is going to require that somebody or somebodies spend many person-hours on design, implementation, and testing. Also - the deserializer support may or may not want to be in json.loads() because it may be preferable to deserialize to the primitive Python objects that correspond to the JSON types, and then allow the Python program to flexibly handle those. Eg, what to do about variable annotations? Should our deserializer automatically deal with those? What if a variable's value conflicts with its annotation? While there may be a clear answer to this question after somebody has thought about it for a bit, it's not obvious to me. The fundamental problem with your overall argument is that the usefulness to the community at large is unclear:
It is unfortunate that we all just use JSON and throw away decimals and float precision and datetimes because json.dumps is so easy.
True for yourself, I assume. But json.dumps is *not* why *the rest of us* do that. We do it because we've *always* done it. The Python objects we are serializing themselves lack units, precision, and pet's name! Until our Python programs become unit- and precision- aware, support for "lossless JSON" is necessarily going to be idiosyncratic, and mostly avoided.
How many people know that:
- You can or should use decimal to avoid float precision error, but then you have to annoyingly write a JSONEncoder to save that data, and then the type is lost when it's parsed and cast to a float when it's deserialized?
- JSON-LD is the only non-ad-hoc solution to preserving precision, datetimes, and complex numbers and types with JSON
- JSON5 supports IEEE 754 ±Infinity and NaN
- Pickles do serialize arbitrary objects, but are not safe for data publishing because unmarshalling runs executable code in the pickle (this is in the docs now)
Very few. But again, that's the wrong set of questions, for reasons similar to the above issue about "why we use json.dumps". The right questions are: 1. Of those who don't know, how many have need to know, and will acknowledge that eed? (If they don't admit it, good luck getting them to change their programs!) 2. Of those who have need to know, how many would have "enough" of their serialization problems solved by any particular packaged set of features that might be added to the stdlib? 3. Is the number of programs in 2 "large enough" to justify the additional maintenance burden and the risk that better but conflicting solutions will be created in the future?
JSON-LD is the way to go for complex types in JSON.
It's worth specifying a JSON serialization protocol as a PEP that third-party and stdlib JSON implementations would use.
All of JSON-LD is way overkill for the examples of complex types you've given. We *do not need or want* a complete reimplementation of the Semantic Web in JSON in our stdlib. So what exactly are you talking about? Here's my idea: I suspect your "serialization protocol" above really means *deserialization* protocol. object.__json__ is all the serialization protocol we need, because it will produce a standard JSON stream that can be deserialized (perhaps with different semantics!) by any standard JSON deserializer. Also, we don't need a PEP to specify the protocol for providing a more accurate deserialization, JSON-LD already did that work, and the parts we need are pretty trivial (definitely @context, maybe @id). So I interpret your word "protocol" to mean "JSON-LD @context". Is that close? For almost all Python applications, a JSON-LD @context specific to Python's object model and standard builtin types would be enough. Since each type is itself a Python object, JSON-LD should be able to represent user-defined classes and their instances within that @context too. For those programs that provide more semantic information about their classes, they'd need additional, idiosyncratic @context anyway, and I have no idea what a "standard extended @context" would want to include. Each large external package (NumPy, Twisted) would want to implement its own @context, I think. We could imagine additional semantic information in this @context that would even tell you which modules you need to pip from PyPI to work with these data types, along with the developers' and auditors'[1] signatures you can authenticate the module and apply your trust model to whether you want to import them. Steve Footnotes: [1] Is this new? I know that frequently software modules are signed by their maintainers, and people decide to extend trust to particular maintainers. But in open source, anybody can audit, so a list of auditors with signatures, dates, and a comment field for the audit might also be useful for maintainers who aren't famous when the auditors are famous. Steve
True for yourself, I assume. But json.dumps is *not* why *the rest of us* do that. We do it because we've *always* done it. The Python objects we are serializing themselves lack units, precision, and pet's name! Until our Python programs become unit- and precision- aware, support for "lossless JSON" is necessarily going to be idiosyncratic, and mostly avoided.
As a more "casual" user of JSON, this part in particular resonated with me. For the majority of use cases, I can't imagine that most users have a need for the degree of precision desired by the OP of the previous topic. As far as I'm aware ``json.dumps()`` serves adequately as a JSON stream for most users. For context, a recent JSON file that I created using the json module: https://github.com/python/devguide/pull/517. It's quite simple and has a very small number of data types.
1. Of those who don't know, how many have need to know, and will acknowledge that eed? (If they don't admit it, good luck getting them to change their programs!)
Prior to this discussion, I'll admit that I had no idea about decimals converting to floats when JSONs were deserialized. But as stated in the question above, I really never had a need to know. For pretty much any time I've ever used JSON, floats provided a "good enough" degree of precision. I'm not saying that just because it isn't useful for me personally that it shouldn't be added, but I wouldn't be surprised if the majority of users of the json module had no idea about this issue because it didn't affect them significantly.
For almost all Python applications, a JSON-LD @context specific to Python's object model and standard builtin types would be enough.
I'm personally not knowledgeable enough about JSON-LD, I've only seen it mentioned before a few times. But, based on what I can tell from the examples on https://json-ld.org/playground/ (the "Place" example was particularly helpful), I could definitely imagine @context being useful. Speaking of examples, I think it would be helpful to provide a brief example of ``object.__json__()`` being used for some added clarity. Would it be a direct alias of ``json.dumps()`` with the same exact parameters and usage, or would there be some substantial differences? Thanks, Kyle Stanley On Fri, Aug 16, 2019 at 4:00 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Wes Turner writes:
Data interchange with structured types is worthwhile.
That's not what the main thread is about. It's about adding support for Decimal to the stdlib's json module. Even the OP has explicitly disclaimed pretty much everything else, although his preferred implementation is more general than that.
I'm +1 on that. I think the outline of how to do it has become pretty obvious, and that it should be restricted to automatically converting Decimals to a JSON number, perhaps under control of a use_decimal flag for both encoding and decoding.
The rest should go into a separate thread. First let's dispose of this:
Streaming JSON is not possible without JSON lines support.
It is obvious to me that this should be handled in yet another thread from "lossless JSON", because it can and should be independently implemented, if it's done at all. Given (ob,n) = raw_decode(idx=n) support in the json module, the difficulty in implementing is all about buffering, and choosing where to do that buffering (in a separate module? in json.load? in a new json.load_stream generator?)
I will now argue that the __json__ protocol is nowhere near so obviously stdlib-able as Decimal and streaming JSON.
An object.__json__(**kwargs) protocol would inconvenience no-one so long as: - decimal isn't imported unless used - all existing code continues to work
I also think that JSON is widely enough used, and deserves better semantic support, that a protocol (specifically, the __json__ dunder) for serializer support and some form of complementary deserializer support are quite justifiable. But the __json__ dunder is the *easy* part. The complexity here is in that complementary deserializer.
Here's why. To your desiderata I would add
- no complex type's module is imported unless used (easy)
- the deserializer support for a type should be linked to its serializer support (something like the codecs registry, but more complicated because each entity will need to invoke support separately, unlike codecs where there's one codec for a whole text)
- such object support should be automatically linked in to both the top level serializer and deserializer dispatching.
The latter two desiderata look *hard* to me. Without them, you've got the inverse of the current Decimal problem. This is going to require that somebody or somebodies spend many person-hours on design, implementation, and testing. Also
- the deserializer support may or may not want to be in json.loads()
because it may be preferable to deserialize to the primitive Python objects that correspond to the JSON types, and then allow the Python program to flexibly handle those. Eg, what to do about variable annotations? Should our deserializer automatically deal with those? What if a variable's value conflicts with its annotation? While there may be a clear answer to this question after somebody has thought about it for a bit, it's not obvious to me.
The fundamental problem with your overall argument is that the usefulness to the community at large is unclear:
It is unfortunate that we all just use JSON and throw away decimals and float precision and datetimes because json.dumps is so easy.
True for yourself, I assume. But json.dumps is *not* why *the rest of us* do that. We do it because we've *always* done it. The Python objects we are serializing themselves lack units, precision, and pet's name! Until our Python programs become unit- and precision- aware, support for "lossless JSON" is necessarily going to be idiosyncratic, and mostly avoided.
How many people know that:
- You can or should use decimal to avoid float precision error, but then you have to annoyingly write a JSONEncoder to save that data, and then the type is lost when it's parsed and cast to a float when it's deserialized?
- JSON-LD is the only non-ad-hoc solution to preserving precision, datetimes, and complex numbers and types with JSON
- JSON5 supports IEEE 754 ±Infinity and NaN
- Pickles do serialize arbitrary objects, but are not safe for data publishing because unmarshalling runs executable code in the pickle (this is in the docs now)
Very few. But again, that's the wrong set of questions, for reasons similar to the above issue about "why we use json.dumps". The right questions are:
1. Of those who don't know, how many have need to know, and will acknowledge that eed? (If they don't admit it, good luck getting them to change their programs!)
2. Of those who have need to know, how many would have "enough" of their serialization problems solved by any particular packaged set of features that might be added to the stdlib?
3. Is the number of programs in 2 "large enough" to justify the additional maintenance burden and the risk that better but conflicting solutions will be created in the future?
JSON-LD is the way to go for complex types in JSON.
It's worth specifying a JSON serialization protocol as a PEP that third-party and stdlib JSON implementations would use.
All of JSON-LD is way overkill for the examples of complex types you've given. We *do not need or want* a complete reimplementation of the Semantic Web in JSON in our stdlib. So what exactly are you talking about? Here's my idea:
I suspect your "serialization protocol" above really means *deserialization* protocol. object.__json__ is all the serialization protocol we need, because it will produce a standard JSON stream that can be deserialized (perhaps with different semantics!) by any standard JSON deserializer. Also, we don't need a PEP to specify the protocol for providing a more accurate deserialization, JSON-LD already did that work, and the parts we need are pretty trivial (definitely @context, maybe @id). So I interpret your word "protocol" to mean "JSON-LD @context". Is that close?
For almost all Python applications, a JSON-LD @context specific to Python's object model and standard builtin types would be enough. Since each type is itself a Python object, JSON-LD should be able to represent user-defined classes and their instances within that @context too. For those programs that provide more semantic information about their classes, they'd need additional, idiosyncratic @context anyway, and I have no idea what a "standard extended @context" would want to include. Each large external package (NumPy, Twisted) would want to implement its own @context, I think.
We could imagine additional semantic information in this @context that would even tell you which modules you need to pip from PyPI to work with these data types, along with the developers' and auditors'[1] signatures you can authenticate the module and apply your trust model to whether you want to import them.
Steve
Footnotes: [1] Is this new? I know that frequently software modules are signed by their maintainers, and people decide to extend trust to particular maintainers. But in open source, anybody can audit, so a list of auditors with signatures, dates, and a comment field for the audit might also be useful for maintainers who aren't famous when the auditors are famous.
Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XACTLM... Code of Conduct: http://python.org/psf/codeofconduct/
On Aug 16, 2019, at 20:35, Kyle Stanley <aeros167@gmail.com> wrote:
Speaking of examples, I think it would be helpful to provide a brief example of ``object.__json__()`` being used for some added clarity. Would it be a direct alias of ``json.dumps()`` with the same exact parameters and usage, or would there be some substantial differences?
No, the idea is that it’s _used_ by dumps, as a substitute for passing a default function as an argument. Let’s say you have a tree class, and you want to serialize it to JSON as an array of arrays of etc. Here’s how you do it today: class MyTreeNode: # … def data(self) -> int: # … def children(self) -> List[MyTreeNode]: # … def jsonize_my_tree(obj): if isinstance(obj, MyTreeNode): return [obj.data(), *map(jsonize_my_tree, obj.children())] raise TypeError() myjson = json.dumps(my_thing_that_might_include_trees, default=jsonize_my_tree) This is only mildly inconvenient. But it does become worse if you have lots of classes you want to serialize. You need to write a default function somewhere that knows about all of your classes: def jsonize_my_things(obj): if isinstance(obj, MyTreeNode): return [obj.data(), jsonize_my_tree(obj.children())] if isinstance(obj, MyRational): return {'Fraction', obj.numerator, obj.denominator} if isinstance(obj, MyWinUTF16String): return str(obj) # … raise TypeError() Or you need to come up with a registry, or a protocol, or a singledispatch overload set, or some other way to let you write a separate function for each class and automatically combine them into one function you can pass to the default argument. The __json__ method would be a pre-made protocol that lets you just write the special handler for each class as part of that class, and not worry about how to combine them because the json module already takes care of that. So: class MyTreeNode: # … def data(self) -> int: # … def children(self) -> List[MyTreeNode]: # … def __json__(self): return [obj.data(), *map(jsonize_my_tree, obj.children())] class MyRational: # … def __json__(self): return {'Fraction', obj.numerator, obj.denominator} myjson = json.dumps(my_thing_that_might_include_trees_and_rationals_and_who_knows_what_else) Not a huge win with a single class, but with lots of classes you want custom serialization for, it can be very handy. Some of the third-party JSON modules, like simplejson, provide exactly this functionality (usually under the name for_json, because __json__ has double underscores and is therefore reserved for Python and its stdlib), and there are definitely people who like it. If you read the simplejson docs, and search for code that imports it, you can probably find lots of real-life examples. As I mentioned earlier in the thread, it’s very easy to write your own default function that calls for_json and partial that into dump/dumps/JSONEncoder (or to write a subclass of JSONEncoder that does the same thing without needing a default argument). And that has some nice benefits if you want to customize things (e.g., pass some of the encoder arguments into the for_json call), or if you prefer a registry or overload set to a method protocol, or whatever. So, I’m not sure this change is necessary. But I don’t see anything wrong with it if people really want it.
Ah, that makes sense, thank you for the detailed example and explanation. I had only read through the start of the previous topic, as it started to become a bit difficult to follow (largely due to the author being non-specific with what precisely they were looking for and the discussion forking off in several directions).
But it does become worse if you have lots of classes you want to serialize. You need to write a default function somewhere that knows about all of your classes
The __json__ method would be a pre-made protocol that lets you just write the special handler for each class as part of that class
I could imagine the default function becoming quite convoluted over time, depending on the complexity and number of individual classes needed. There definitely seems to be an advantage to specifying the behavior on a per-class basis to keep things more organized when necessary.
If you read the simplejson docs, and search for code that imports it, you can probably find lots of real-life examples.
I'll be sure to look over it. For the most part, stdlib's json module has suited most of my purposes, but it would definitely be useful to know some of the other implementations that allow for more extensive typing and customization.
So, I’m not sure this change is necessary. But I don’t see anything wrong with it if people really want it.
It doesn't seem to be needed, but it could definitely help with providing better organization and readability in the right situations. The simplicity is definitely appealing. Thanks, Kyle Stanley On Sat, Aug 17, 2019 at 3:27 AM Andrew Barnert <abarnert@yahoo.com> wrote:
On Aug 16, 2019, at 20:35, Kyle Stanley <aeros167@gmail.com> wrote:
Speaking of examples, I think it would be helpful to provide a brief
example of ``object.__json__()`` being used for some added clarity. Would it be a direct alias of ``json.dumps()`` with the same exact parameters and usage, or would there be some substantial differences?
No, the idea is that it’s _used_ by dumps, as a substitute for passing a default function as an argument.
Let’s say you have a tree class, and you want to serialize it to JSON as an array of arrays of etc. Here’s how you do it today:
class MyTreeNode: # … def data(self) -> int: # … def children(self) -> List[MyTreeNode]: # …
def jsonize_my_tree(obj): if isinstance(obj, MyTreeNode): return [obj.data(), *map(jsonize_my_tree, obj.children())] raise TypeError()
myjson = json.dumps(my_thing_that_might_include_trees, default=jsonize_my_tree)
This is only mildly inconvenient. But it does become worse if you have lots of classes you want to serialize. You need to write a default function somewhere that knows about all of your classes:
def jsonize_my_things(obj): if isinstance(obj, MyTreeNode): return [obj.data(), jsonize_my_tree(obj.children())] if isinstance(obj, MyRational): return {'Fraction', obj.numerator, obj.denominator} if isinstance(obj, MyWinUTF16String): return str(obj) # … raise TypeError()
Or you need to come up with a registry, or a protocol, or a singledispatch overload set, or some other way to let you write a separate function for each class and automatically combine them into one function you can pass to the default argument.
The __json__ method would be a pre-made protocol that lets you just write the special handler for each class as part of that class, and not worry about how to combine them because the json module already takes care of that. So:
class MyTreeNode: # … def data(self) -> int: # … def children(self) -> List[MyTreeNode]: # … def __json__(self): return [obj.data(), *map(jsonize_my_tree, obj.children())]
class MyRational: # … def __json__(self): return {'Fraction', obj.numerator, obj.denominator}
myjson = json.dumps(my_thing_that_might_include_trees_and_rationals_and_who_knows_what_else)
Not a huge win with a single class, but with lots of classes you want custom serialization for, it can be very handy.
Some of the third-party JSON modules, like simplejson, provide exactly this functionality (usually under the name for_json, because __json__ has double underscores and is therefore reserved for Python and its stdlib), and there are definitely people who like it. If you read the simplejson docs, and search for code that imports it, you can probably find lots of real-life examples.
As I mentioned earlier in the thread, it’s very easy to write your own default function that calls for_json and partial that into dump/dumps/JSONEncoder (or to write a subclass of JSONEncoder that does the same thing without needing a default argument). And that has some nice benefits if you want to customize things (e.g., pass some of the encoder arguments into the for_json call), or if you prefer a registry or overload set to a method protocol, or whatever. So, I’m not sure this change is necessary. But I don’t see anything wrong with it if people really want it.
On Sat, 17 Aug 2019 at 08:28, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
def jsonize_my_tree(obj): if isinstance(obj, MyTreeNode): return [obj.data(), *map(jsonize_my_tree, obj.children())] raise TypeError()
myjson = json.dumps(my_thing_that_might_include_trees, default=jsonize_my_tree)
This is only mildly inconvenient. But it does become worse if you have lots of classes you want to serialize. You need to write a default function somewhere that knows about all of your classes:
def jsonize_my_things(obj): if isinstance(obj, MyTreeNode): return [obj.data(), jsonize_my_tree(obj.children())] if isinstance(obj, MyRational): return {'Fraction', obj.numerator, obj.denominator} if isinstance(obj, MyWinUTF16String): return str(obj) # … raise TypeError()
Or you need to come up with a registry, or a protocol, or a singledispatch overload set, or some other way to let you write a separate function for each class and automatically combine them into one function you can pass to the default argument.
This is pretty much exactly what singledispatch is designed for: @singledispatch def jsonize(obj): raise TypeError(f"Cannot serialise {type(obj)} to JSON") @jsonize.register def _(obj: MyTreeNode): return [obj.data(), *map(jsonize_my_tree, obj.children())] @jsonize.register def _(obj: MyRational): return {'Fraction', obj.numerator, obj.denominator} @jsonize.register def _(obj: MyWinUTF16String): return str(obj) myjson = json.dumps(my_thing_that_might_include_trees, default=jsonize) If you want to support a __json__ method as well, you can even do: @singledispatch def jsonize(obj): __json__ = getattr(obj, '__json__', None) if __json__ and callable(__json__): return __json__(obj) raise TypeError(f"Cannot serialise {type(obj)} to JSON") I honestly think singledispatch is under-used, and could avoid the need for many of the new protocols people like to propose. (Note: The proposal to support Decimal would *not* be handled by this, as the "default" argument to json.dumps requires you to go through an already-supported type. But that's a separate concern here). Paul
But it does become worse if you have lots of classes you want to serialize. You need to write a default function somewhere that knows about all of your classes:
Somehow I didn't think of this until now, but when I've needed to use JSON to serialize a non-trivial hierachy of types, I' essentially built my own system for doing so. The system builds a "json-compatible dict", and then passes that off to json.dump[s]. Somehow it never dawned on my to try to use the json module's default feature to do this. And now that I am thinking about, I'm glad I didn't. After all the, hard work is creating the custom JSON compatible data structures. and default really doesn't help with that at all. And this way I can use the exact same code to pass off to ANY json serializer, and indeed, an OTHER serializer that supports the same data structures. Adding a __json__ protocol would make it a tad more useful (and more so if other json libs supported it) but I'd still need to write essentially the same code, so I think I'd still rather keep the "make it JSON compatible" separate from the JSON serializer itself. And this is also reinforcing my opinion that the json module should support ONLY serializing JSON itself, and not any "JSON like" form. Whether in NEEDS to enforce that is up in the air, but I think a focused domain is the way to go. Which means that Adding the ability to serialize Decimal is the only addition needed. - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Aug 11, 2019 at 5:12 PM David Shawley <daveshawley@gmail.com> wrote:
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
I think yes on both counts :-)
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I think if we ARE going to extend json to allow more flexibility, yes, a protocol would be a good way to do it. But it makes me nervous -- I think the goal of the json module is to produce valid json, and nothing else. Opening up a protocol would allow users to fairly easily, and maybe inadvertently, create invalid JSON. I'm not sure there are compelling use cases that make that worth it. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, Aug 12, 2019 at 3:01 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sun, Aug 11, 2019 at 5:12 PM David Shawley <daveshawley@gmail.com> wrote:
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
I think yes on both counts :-)
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I think if we ARE going to extend json to allow more flexibility, yes, a protocol would be a good way to do it.
But it makes me nervous -- I think the goal of the json module is to produce valid json, and nothing else. Opening up a protocol would allow users to fairly easily, and maybe inadvertently, create invalid JSON. I'm not sure there are compelling use cases that make that worth it.
You can already make invalid JSON, and far more easily.
json.dumps({"spam": [1,2,3]}, separators=(' : ',', ')) '{"spam", [1 : 2 : 3]}'
It's not the module's job to stop you from shooting yourself in the foot. ChrisA
side note: I"m reading teh json docs more closely now, and noticed: """ parse_float, if specified, will be called with the string of every JSON float to be decoded. By default, this is equivalent to float(num_str). This can be used to use another datatype or parser for JSON floats (e.g. decimal.Decimal)." """ it's unfortunate that he term "JSON float" is used here when it should really be something like "JSON number with a fractional part" or "JOSN non- integer number", since the JSON spec is not the same as a C or Python float. Not sure it's worth the doc edit, but it explains we there was some confusion in terminology at the beginning of this thread. -CHB On Sun, Aug 11, 2019 at 10:05 PM Chris Angelico <rosuav@gmail.com> wrote:
On Mon, Aug 12, 2019 at 3:01 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sun, Aug 11, 2019 at 5:12 PM David Shawley <daveshawley@gmail.com>
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
I think yes on both counts :-)
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I think if we ARE going to extend json to allow more flexibility, yes, a
wrote: protocol would be a good way to do it.
But it makes me nervous -- I think the goal of the json module is to
produce valid json, and nothing else. Opening up a protocol would allow users to fairly easily, and maybe inadvertently, create invalid JSON. I'm not sure there are compelling use cases that make that worth it.
You can already make invalid JSON, and far more easily.
json.dumps({"spam": [1,2,3]}, separators=(' : ',', ')) '{"spam", [1 : 2 : 3]}'
It's not the module's job to stop you from shooting yourself in the foot.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JSG3EK... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Another doc note: I see this: """ it is common for JSON numbers to be deserialized into IEEE 754 double precision numbers and thus subject to that representation’s range and precision limitations. This is especially relevant when serializing Python int values of extremely large magnitude, or when serializing instances of “exotic” numerical types such as decimal.Decimal. """ This is a bit ironic, as apparently it is actually impossible with the current implementation to serialize a Decimal with many digits into JSON anyway. Perhaps a doc patch with some clarifications will be in order once the question at hand is resolved. -CHB On Mon, Aug 12, 2019 at 2:06 PM Christopher Barker <pythonchb@gmail.com> wrote:
side note:
I"m reading teh json docs more closely now, and noticed:
""" parse_float, if specified, will be called with the string of every JSON float to be decoded. By default, this is equivalent to float(num_str). This can be used to use another datatype or parser for JSON floats (e.g. decimal.Decimal)." """
it's unfortunate that he term "JSON float" is used here when it should really be something like "JSON number with a fractional part" or "JOSN non- integer number", since the JSON spec is not the same as a C or Python float. Not sure it's worth the doc edit, but it explains we there was some confusion in terminology at the beginning of this thread.
-CHB
On Sun, Aug 11, 2019 at 10:05 PM Chris Angelico <rosuav@gmail.com> wrote:
On Mon, Aug 12, 2019 at 3:01 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sun, Aug 11, 2019 at 5:12 PM David Shawley <daveshawley@gmail.com>
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
I think yes on both counts :-)
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I think if we ARE going to extend json to allow more flexibility, yes, a protocol would be a good way to do it.
But it makes me nervous -- I think the goal of the json module is to
wrote: produce valid json, and nothing else. Opening up a protocol would allow users to fairly easily, and maybe inadvertently, create invalid JSON. I'm not sure there are compelling use cases that make that worth it.
You can already make invalid JSON, and far more easily.
json.dumps({"spam": [1,2,3]}, separators=(' : ',', ')) '{"spam", [1 : 2 : 3]}'
It's not the module's job to stop you from shooting yourself in the foot.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JSG3EK... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Aug 11, 2019 at 10:05 PM Chris Angelico <rosuav@gmail.com> wrote:
But it makes me nervous -- I think the goal of the json module is to produce valid json, and nothing else. Opening up a protocol would allow users to fairly easily, and maybe inadvertently, create invalid JSON. I'm not sure there are compelling use cases that make that worth it.
You can already make invalid JSON, and far more easily.
json.dumps({"spam": [1,2,3]}, separators=(' : ',', ')) '{"spam", [1 : 2 : 3]}'
Wow! I had NOT noticed that before. I have to wonder why in the world the json module makes it easy to use arbitrary separators. The use case called out in the docs is that you may want to specify different amounts of whitespace around your separators -- fair enough, but why allowing ANY separator ever made sense is beyond me. If nothing else, a not in the docs that anything other that ":" and "," (with various whitespace) will result in invalid JSON may be worthwhile. That may mean the cat's out of the bag, and we can neglect any role of the json module to try to enforce valid JSON, but still...
It's not the module's job to stop you from shooting yourself in the foot.
Sure, and I take that approach in most cases. But that doesn't mean we should hand people foot guns if there is no legitimate reason to have a gun at all. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, Aug 13, 2019 at 7:55 AM Christopher Barker <pythonchb@gmail.com> wrote:
That may mean the cat's out of the bag, and we can neglect any role of the json module to try to enforce valid JSON, but still...
The *decoder* will enforce valid JSON, but the *encoder* doesn't need to stop you from doing what you've chosen to do. ChrisA
On Mon, Aug 12, 2019 at 2:59 PM Chris Angelico <rosuav@gmail.com> wrote:
That may mean the cat's out of the bag, and we can neglect any role of
On Tue, Aug 13, 2019 at 7:55 AM Christopher Barker <pythonchb@gmail.com> wrote: the json module to try to enforce valid JSON, but still...
The *decoder* will enforce valid JSON, but the *encoder* doesn't need to stop you from doing what you've chosen to do.
It doesn't "need" to, but it would be nice. Having an encoder/decoder that can encode things that it can't decode seems a bit asymmetric to me. Perhaps the goal of the json module is to decode valid JSON, and to encode an arbitrary set of JSON-like encodings, but I don't *think* that is a stated goal of the module. l'm all for practicality -- if there is a use case we want to support, and supporting that use case suggests a solution that allows shoot-footing fine. But if there is a way to support that use-case without the foot gun, then I think that's a better option. An example here -- some years back, when I didn't know a damn thing about XML (I still don't know much) I discovered that ElementTree made it easy to write invalid XML. Whether that was a good or bad decision to not have it do some checking I don't know, but it was surprising, and it does make the module less useful to someone that is not very familiar with XML. And the current json module does do a pretty good job of ensuring that you get valid JSON now, and the fact that the "encode custom types" machinery is designed as it is (rather than providing a hook to generate "raw JSON", makes me think the original designers had that in mind. In fact, I'm pretty sure that setting custom separators the only way to get it to generate invalid JSON now, Do we want to make more ways? Maybe, maybe not. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, Aug 13, 2019 at 8:34 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Mon, Aug 12, 2019 at 2:59 PM Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Aug 13, 2019 at 7:55 AM Christopher Barker <pythonchb@gmail.com> wrote:
That may mean the cat's out of the bag, and we can neglect any role of the json module to try to enforce valid JSON, but still...
The *decoder* will enforce valid JSON, but the *encoder* doesn't need to stop you from doing what you've chosen to do.
It doesn't "need" to, but it would be nice. Having an encoder/decoder that can encode things that it can't decode seems a bit asymmetric to me.
Perhaps the goal of the json module is to decode valid JSON, and to encode an arbitrary set of JSON-like encodings, but I don't *think* that is a stated goal of the module.
l'm all for practicality -- if there is a use case we want to support, and supporting that use case suggests a solution that allows shoot-footing fine. But if there is a way to support that use-case without the foot gun, then I think that's a better option.
It's more that there's no reason to go to great lengths to *stop* you from encoding invalid JSON. For it to block it, it would have to explicitly check for it, instead of simply assume that you aren't doing anything to break it. Permissiveness is the simple option here. ChrisA
On Mon, Aug 12, 2019 at 3:58 PM Chris Angelico <rosuav@gmail.com> wrote:
But if there is a way to support that use-case without the foot gun, then I think that's a better option.
It's more that there's no reason to go to great lengths to *stop* you from encoding invalid JSON. For it to block it, it would have to explicitly check for it, instead of simply assume that you aren't doing anything to break it. Permissiveness is the simple option here.
sure -- which I presume is what the author of the "separator" parameter was thinking. But to bring it down to the specific topic at hand: If the goal is to provide a way to encode arbitrary precision Decimals in JSON, then you could: * (1) support encoding the Decimal type directly (my suggestion) or * (2) support allowing users to control how they encode JSON numbers (Richards suggestion, I think) or * (3) support allowing users to control how they encode any JSON type. Frankly, each of these is a simple option, but each is more permissive than the next. I suggest that the least permissive solution that address the goal should be used. But: (1) would not fully solve the OP's use-case. (2) would not solve the use case of people wanting to control the full unicode normalization (there's also escaping non-ascii characters that may require normalization as well) Because these are different goals. So what goals are trying to be accomplished needs to drive the decision. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Aug 12, 2019, at 15:34, Christopher Barker <pythonchb@gmail.com> wrote:
In fact, I'm pretty sure that setting custom separators the only way to get it to generate invalid JSON now,
There’s also allow_nan. But that one is clearly intentional, because so many other implementations (including the original JS reference implementation) break the spec in that way that it’s useful to have an option to do the same. (Even without that option, Python raises a ValueError instead of emitting null, but that’s not a way to get invalid JSON, it’s just a way to not get any JSON when the library should have allowed it…) Also, because Python follows the obsolete RFC 7159 instead of 8259, there may be other issues—I don’t think there are, but I wouldn’t want to guarantee that without checking.
Just a short comment, I agree that "parse_float" is an unfortunate name, which also confuses me in retrospect. "parse_real" will be more fitting, especially since the source use this term internally. I am not sure it is enough to justify the change though. Concerning the custom separators, and why the implementation supports them, I believe it is not actually about the separators themselves, but the whitespaces around them. When you generate JSON with an indentation, you probably also want some spacing after double colon or comma so it looks "pretty". While when creating JSON for "machine" processing, you want it as dense as it gets, without any unnecessary whitespaces.
On Aug 13, 2019, at 01:04, Richard Musil <risa2000x@gmail.com> wrote:
Concerning the custom separators, and why the implementation supports them, I believe it is not actually about the separators themselves, but the whitespaces around them. When you generate JSON with an indentation, you probably also want some spacing after double colon or comma so it looks "pretty".
Yes. But this could have been done with, say, a whitespace tuple that includes just the post-colon and post-comma spacing, rather than letting you also replace the colon and comma themselves. Still, you could always pass “Q~¥” as the post-colon space (or even some characters that are Unicode whitespace but not JSON whitespace, to confuse everyone). Ignoring safety, it might be a bit simpler to use, but probably a tiny bit slower, and I’m sure someone would ask “what if I want a space before the comma instead of after” even though that’s a silly thing to want, and… My guess is that the separators solution was not the cleanest possible design that gave the best answer to every bikeshedding issue, but just the simplest design that avoided most of them. Remember that this was designed pretty early, when nobody knew exactly what would be demanded of a JSON library. In hindsight, I think you could scrap separators, indent, and maybe ensure_ascii, because people only ever use them to produce either (a) typical pretty-printed JSON, (b) as-compact-as-possible JSON, or (c) JSON that’s as readable as possible for a single line that meets the standards of an entry in all of JSONlines, NDJ, and LDJSON. But at the time nobody could have known about JSONlines, or that 7-bit-clean pretty-printed JSON would never be useful, etc.
On Wed, Aug 14, 2019 at 3:12 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Aug 13, 2019, at 01:04, Richard Musil <risa2000x@gmail.com> wrote:
Concerning the custom separators, and why the implementation supports them, I believe it is not actually about the separators themselves, but the whitespaces around them. When you generate JSON with an indentation, you probably also want some spacing after double colon or comma so it looks "pretty".
Ignoring safety, it might be a bit simpler to use, but probably a tiny bit slower...
Something to bear in mind is that the JSON module is often called upon to deal with gobs of data. Just last night I was showcasing a particular script that has to load a 200MB JSON file mapping Twitch emote names to their emote IDs, and it takes a solid 5-6 seconds to parse it. A small slowdown or speedup can have significant impact on real-world programs. That's one of the reasons that a simple solution of "make JSONEncoder respect decimal.Decimal" was rejected - it would require that the json module import decimal, which is extremely costly. Having a __json__ protocol would be less costly, but would still have a cost, so this needs to be factored in. ChrisA
On Aug 13, 2019, at 11:21, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Aug 14, 2019 at 3:12 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Aug 13, 2019, at 01:04, Richard Musil <risa2000x@gmail.com> wrote:
Concerning the custom separators, and why the implementation supports them, I believe it is not actually about the separators themselves, but the whitespaces around them. When you generate JSON with an indentation, you probably also want some spacing after double colon or comma so it looks "pretty".
Ignoring safety, it might be a bit simpler to use, but probably a tiny bit slower...
Something to bear in mind is that the JSON module is often called upon to deal with gobs of data.
Sure. Replacing the separator strings with whitespace strings would mean that generating each array element in dumps now takes one extra single-character string emit for the comma (and each object element takes two). I suspect this would have negligible impact on the C implementation, but maybe not on the pure Python one. And at any rate, if that were a serious proposal, I’d certainly benchmark it rather than guessing. But that would be a silly change to make to the module at this point, so I’m not proposing it. We’ve lived with separators for all these years, and I don’t think it’s a serious wart that needs fixing.
Just last night I was showcasing a particular script that has to load a 200MB JSON file mapping Twitch emote names to their emote IDs, and it takes a solid 5-6 seconds to parse it. A small slowdown or speedup can have significant impact on real-world programs.
I’ve never had to parse a 200MB JSON doc, but I have had to parse a massive JSONlines doc with zillions of 1KB JSON docs, which is not nearly as bad for memory use, but just as bad for parse time. At that point, it’s worth looking into other JSON packages outside the stdlib, unless you’re only doing it once.
That's one of the reasons that a simple solution of "make JSONEncoder respect decimal.Decimal" was rejected - it would require that the json module import decimal, which is extremely costly.
To be fair, your program only imports json once, and so does mine, and the linear-in-size-of-doc parsing cost isn’t affected by the import time. Still, there will be someone out there who runs a script zillions of times on a bunch of separate JSON docs, and for that someone, import time will matter. But I think the lazy-import-decimal-on-first-dump-with-use_decimal solves that, and solves it even better than __json__, even besides the fact that it’s a better API than exposing “dump raw text into any JSON, and it’s up to you to get it right”. No import time if you’re not using it, just setting a global to None. Even if you are using it, the cost of importing it from the sys.modules cache is pretty tiny. (After all, you won’t have any Decimal objects without having imported decimal, unless you do some nasty tricks—at which point monkeypatching json to fake the import isn’t any nastier.) And the code within the exporter itself should be unaffected. You don’t need to check for Decimal until after all the other types have failed, so on successful dumps without use_decimal there’s no cost at all, and on failed dumps it’s just one extra check before raising. And even when you need use_decimal, the isinstance is faster than a getattr or special method lookup to find a __json__ method. But of course you’d still want to actually implement and benchmark it to be sure. Maybe the code for that extra check, even if you never reach it, pushes an inner loop out of the cache or something; who knows?
On 8/13/2019 5:49 PM, Andrew Barnert via Python-ideas wrote:
But I think the lazy-import-decimal-on-first-dump-with-use_decimal solves that, and solves it even better than __json__, even besides the fact that it’s a better API than exposing “dump raw text into any JSON, and it’s up to you to get it right”.
No import time if you’re not using it, just setting a global to None. Even if you are using it, the cost of importing it from the sys.modules cache is pretty tiny. (After all, you won’t have any Decimal objects without having imported decimal, unless you do some nasty tricks—at which point monkeypatching json to fake the import isn’t any nastier.)
dataclasses does something similar: it wants to know if something is a typing.ClassVar, but it doesn't want to import typing to find out. So it has: typing = sys.modules.get('typing') if typing: if (_is_classvar(a_type, typing) or (isinstance(f.type, str) and _is_type(f.type, cls, typing, typing.ClassVar, _is_classvar))): If typing hasn't been imported, it knows that a_type can't be typing.ClassVar. So, this pattern isn't unheard of. Eric
On Aug 13, 2019, at 15:01, Eric V. Smith <eric@trueblade.com> wrote:
On 8/13/2019 5:49 PM, Andrew Barnert via Python-ideas wrote: But I think the lazy-import-decimal-on-first-dump-with-use_decimal solves that, and solves it even better than __json__, even besides the fact that it’s a better API than exposing “dump raw text into any JSON, and it’s up to you to get it right”. No import time if you’re not using it, just setting a global to None. Even if you are using it, the cost of importing it from the sys.modules cache is pretty tiny. (After all, you won’t have any Decimal objects without having imported decimal, unless you do some nasty tricks—at which point monkeypatching json to fake the import isn’t any nastier.)
dataclasses does something similar: it wants to know if something is a typing.ClassVar, but it doesn't want to import typing to find out. So it has:
typing = sys.modules.get('typing') if typing: if (_is_classvar(a_type, typing) or (isinstance(f.type, str) and _is_type(f.type, cls, typing, typing.ClassVar, _is_classvar))):
That’s clever, but I don’t think it’s actually needed here. We can just do the usual lazy import thing: set it to None, and then when we need it (whenever you construct a JSONEncoder object with use_decimal*), if it’s None, import it. That will automatically use the sys.modules cache anyway, and it only costs a few extra nanoseconds, and we only do it once in your entire program. Surely it’s not worth optimizing away a few ns one time in exchange for making every single test more complicated, and probably slower? What I was referring to in the parens was that it is technically possible to get objects that would pass isinstance(decimal.Decimal) except that decimal isn’t in sys.modules, but I don’t think we have to worry about that. For example, anyone who uses importlib to get decimal imported into a local that’s not stored in the cache ought to know how to monkeypatch json.decimal = my_secret_decimal_import if they want to avoid the cost of a second import. —- * If you’re worried about the None test on each JSONEncoder construction where use_decimal, given that every call to dump or dumps constructs an encoder, consider this: that construction takes microseconds, and adding a None test makes about an 0.1% difference. Anyone who’s dumping zillions of objects to JSON already has to explicitly construct and reuse an encoder, so that extra cost only happens once in their program.
On Tue, Aug 13, 2019 at 06:01:27PM -0400, Eric V. Smith wrote:
dataclasses does something similar: it wants to know if something is a typing.ClassVar, but it doesn't want to import typing to find out.
Why not? Is importing typing especially expensive or a bottleneck? Wouldn't it be a once-off cost?
So it has:
typing = sys.modules.get('typing') if typing: if (_is_classvar(a_type, typing) or (isinstance(f.type, str) and _is_type(f.type, cls, typing, typing.ClassVar, _is_classvar))):
If typing hasn't been imported, it knows that a_type can't be typing.ClassVar.
That's very clever, but alas it's not clever enough because its false. Or rather, it's false in general. ClassVar is magic, it can't be used with isinstance() or issubclass(), so I won't argue about the above snippet. I don't understand it well enough to argue its correctness. But for regular classes like Decimal, this clever trick doesn't always work. Here's an example: py> import decimal py> x = decimal.Decimal() py> del decimal py> del sys.modules['decimal'] At this point, we have a Decimal instance, but no "decimal" in the module cache. Your test for whether x is a Decimal will wrongly deny x is a Decimal: # the wrong way to do it py> _decimal = sys.modules.get('decimal') py> isinstance(x, _decimal.Decimal) if _decimal else False False but if we're a bit less clever, we get the truth: py> import decimal py> isinstance(x, decimal.Decimal) True You can't rely on the presence of 'decimal' in the module cache as a proxy for whether or not x might be a Decimal instance. -- Steven
On 8/13/2019 8:53 PM, Steven D'Aprano wrote:
On Tue, Aug 13, 2019 at 06:01:27PM -0400, Eric V. Smith wrote:
dataclasses does something similar: it wants to know if something is a typing.ClassVar, but it doesn't want to import typing to find out.
Why not? Is importing typing especially expensive or a bottleneck?
It's especially expensive. More so in 3.7, and/or before PEP 560 (I think that's the right one).
Wouldn't it be a once-off cost?
Yes, it would. But even that was too expensive.
So it has:
typing = sys.modules.get('typing') if typing: if (_is_classvar(a_type, typing) or (isinstance(f.type, str) and _is_type(f.type, cls, typing, typing.ClassVar, _is_classvar))):
If typing hasn't been imported, it knows that a_type can't be typing.ClassVar.
That's very clever, but alas it's not clever enough because its false.
Or rather, it's false in general. ClassVar is magic, it can't be used with isinstance() or issubclass(), so I won't argue about the above snippet. I don't understand it well enough to argue its correctness.
But for regular classes like Decimal, this clever trick doesn't always work. Here's an example:
py> import decimal py> x = decimal.Decimal() py> del decimal py> del sys.modules['decimal']
At this point, we have a Decimal instance, but no "decimal" in the module cache. Your test for whether x is a Decimal will wrongly deny x is a Decimal:
# the wrong way to do it py> _decimal = sys.modules.get('decimal') py> isinstance(x, _decimal.Decimal) if _decimal else False False
but if we're a bit less clever, we get the truth:
py> import decimal py> isinstance(x, decimal.Decimal) True
You can't rely on the presence of 'decimal' in the module cache as a proxy for whether or not x might be a Decimal instance.
I realize there are ways to trick it, and probably even more so with "virtual" subclassing. And you're right, they matter more here than in dataclasses. Eric
Steven D'Aprano wrote:
dataclasses does something similar: it wants to know if something is a typing.ClassVar, but it doesn't want to import typing to find out. Why not? Is importing typing especially expensive or a bottleneck? Wouldn't it be a once-off cost? So it has: typing = sys.modules.get('typing') if typing: if (_is_classvar(a_type, typing) or (isinstance(f.type, str) and _is_type(f.type, cls, typing, typing.ClassVar, _is_classvar))):
If typing hasn't been imported, it knows that a_type can't be typing.ClassVar. That's very clever, but alas it's not clever enough because its false. Or rather, it's false in general. ClassVar is magic, it can't be used with isinstance() or issubclass(), so I won't argue about the above snippet. I don't understand it well enough to argue its correctness. But for regular classes like Decimal, this clever trick doesn't always work. Here's an example:
On Tue, Aug 13, 2019 at 06:01:27PM -0400, Eric V. Smith wrote: py> import decimal py> x = decimal.Decimal() py> del decimal py> del sys.modules['decimal'] At this point, we have a Decimal instance, but no "decimal" in the module cache. Your test for whether x is a Decimal will wrongly deny x is a Decimal: # the wrong way to do it py> _decimal = sys.modules.get('decimal') py> isinstance(x, _decimal.Decimal) if _decimal else False False but if we're a bit less clever, we get the truth: py> import decimal py> isinstance(x, decimal.Decimal) True You can't rely on the presence of 'decimal' in the module cache as a proxy for whether or not x might be a Decimal instance.
The `isinstance` check doesn't necessarily prevent tampering: import decimal d = decimal.Decimal('1.23') # oops (1) decimal.Decimal = type('Decimal', tuple(), {}) # oops (2) import sys sys.modules['decimal'] = type('decimal', tuple(), dict(Decimal=type('Decimal', tuple(), {}))) # oops (3) import sys del sys.modules['decimal'] with open('decimal.py', 'w') as fh: fh.write('Decimal = type("Decimal", tuple(), {})') import json json.dumps(d) But I agree that this is reasonably weird to not be expected.
Chris Angelico wrote:
That's one of the reasons that a simple solution of "make JSONEncoder respect decimal.Decimal" was rejected - it would require that the json module import decimal, which is extremely costly.
I don't follow that -- importing Decimal is a one-time cost, so it's not going to affect the time to parse large amounts of data. Also it could be done lazily in response to a use_decimal flag, in which case the import will only happen if other parts of the program are already using Decimal, so the import won't cost any more in startup time. -- Greg
On Wed, 14 Aug 2019 04:21:05 +1000 Chris Angelico <rosuav@gmail.com> wrote:
That's one of the reasons that a simple solution of "make JSONEncoder respect decimal.Decimal" was rejected - it would require that the json module import decimal, which is extremely costly. Having a __json__ protocol would be less costly, but would still have a cost, so this needs to be factored in.
You can take a page from pickle: __reduce__ is not looked up on built-il types such as int, str, tuple, etc. whose serialization is hard-coded. Regards Antoine.
That's one of the reasons that a simple solution of "make JSONEncoder respect decimal.Decimal" was rejected -
I know concerns were raised, was it actually “rejected”? By whom, when? it would require that the json
module import decimal, which is extremely costly.
Is it though? I mean, relative to the whole process of encoding/decoding JSON — which is not exactly a high performance operation in any case. The case has been made that JSON is used with large data sets, and this performance matters, but with a large JSON blob, import cost would not be the issue. Which leaves things like small scripts that use JSON for configuration — I.e. short running processes that need to read/write JSON. Finally: if JSON performance is important, than the built in lib may not be your best option anyway. (https://pythonspeed.com/articles/faster-json-library/) But anyway, for backwards compatibility, we probably wouldn’t want to change the default float-based JSON number encoding/decoding anyway, so there’s got to be a way to make the Decimal import optional. Also, there are a lot of issues with import speed of semi-standard modules that cause issues with quickie scripts. If cPython really wants to be better at that, other methods will need to be found. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, 11 Aug 2019 20:09:41 -0400 David Shawley <daveshawley@gmail.com> wrote:
On Aug 8, 2019, at 3:55 PM, Chris Angelico <rosuav@gmail.com> wrote:
There are two broad suggestions that came out of that thread and others, and I think it may be worth reopening them.
1) Should JSONEncoder (the class underlying json.dumps) natively support decimal.Decimal, and if so, can it avoid importing that module unnecessarily?
2) Should there be a protocol obj.__json__() to return a string representation of an object for direct insertion into a JSON file?
I'm inclined towards the protocol, since there are protocols for various other encoders (eg deepcopy, pickle), and it avoids the problem of json importing decimal. It can also be implemented entirely as a third-party patch, although you'd need to subclass Decimal to add that method.
I proposed something similar about a year ago [1]. I really like the idea of a protocol for this. Especially since the other encoders already use this approach. Should I reboot this approach? The implementation was really simple [2].
I think this would be worthwhile. Here is a use case where it may remove some pain from users'life: https://bugs.python.org/issue24313 Regards Antoine.
On Tue, Aug 27, 2019 at 10:48 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 11 Aug 2019 20:09:41 -0400 David Shawley <daveshawley@gmail.com> wrote:
On Aug 8, 2019, at 3:55 PM, Chris Angelico <rosuav@gmail.com> wrote:
I proposed something similar about a year ago [1]. I really like the idea of a protocol for this. Especially since the other encoders already use this approach. Should I reboot this approach? The implementation was really simple [2].
I think this would be worthwhile.
Here is a use case where it may remove some pain from users'life: https://bugs.python.org/issue24313
Since __json__ protocol is being raised again, I would just like to point out that during the previous discussion the term has been used in two different contexts: 1) allowing complete custom serialization (i.e. controlling the JSON encoding) Which was an approach I mostly liked, but thought it was generally not acceptable here (because of the total control of the encoding). My reasoning was that I believed that the only JSON type "lacking the full support" was really *JSON number*, so in order to keep my proposal's impact small I went on with the custom serialization just for the JSON number. That said I would consider __json__ protocol to be otherwise more elegant solution. 2) only allowing serialization of Python native types (but with custom values as 'for_json' in simplejson does) This was commented on as already achievable with custom encoder class (or custom 'default' function). I did not really care, because it did not seem to bring any advantage or a fix for my problem problem. To resolve the bpo issue with numpy, one would need to implement complete custom serialization (1) or simply convert numpy number types into Python number types. Richard
On Tue, 27 Aug 2019 11:20:58 +0200 Richard Musil <risa2000x@gmail.com> wrote:
To resolve the bpo issue with numpy, one would need to implement complete custom serialization (1) or simply convert numpy number types into Python number types.
Yes, both are possible and both could even be implemented. For Numpy it's enough to be able to return the equivalent Python integer and let the json module do the actual serialization. You can't really ask people to write a custom encoder or object function, though. That's not very user-friendly. A third-party type should be able to say how it serializes. Regards Antoine.
On Tue, 27 Aug 2019 at 10:22, Richard Musil <risa2000x@gmail.com> wrote:
On Tue, Aug 27, 2019 at 10:48 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 11 Aug 2019 20:09:41 -0400 David Shawley <daveshawley@gmail.com> wrote:
On Aug 8, 2019, at 3:55 PM, Chris Angelico <rosuav@gmail.com> wrote:
I proposed something similar about a year ago [1]. I really like the idea of a protocol for this. Especially since the other encoders already use this approach. Should I reboot this approach? The implementation was really simple [2].
I think this would be worthwhile.
Here is a use case where it may remove some pain from users'life: https://bugs.python.org/issue24313
Since __json__ protocol is being raised again, I would just like to point out that during the previous discussion the term has been used in two different contexts:
1) allowing complete custom serialization (i.e. controlling the JSON encoding) Which was an approach I mostly liked, but thought it was generally not acceptable here (because of the total control of the encoding). My reasoning was that I believed that the only JSON type "lacking the full support" was really *JSON number*, so in order to keep my proposal's impact small I went on with the custom serialization just for the JSON number. That said I would consider __json__ protocol to be otherwise more elegant solution.
2) only allowing serialization of Python native types (but with custom values as 'for_json' in simplejson does) This was commented on as already achievable with custom encoder class (or custom 'default' function). I did not really care, because it did not seem to bring any advantage or a fix for my problem problem.
To resolve the bpo issue with numpy, one would need to implement complete custom serialization (1) or simply convert numpy number types into Python number types.
The key point here, IMO, is that this is precisely the sort of use case for numbers other than Decimal that people (including me) kept asking for in the original discussion. Thanks, Antoine for finding it and flagging it here. I agree that this seems like a good reason for having *some* means of letting types decide how they should be serialised in JSON. There are still questions that need to be addressed (such as how we deal with types that don't return a well-formed JSON value) but at least we have a use case to drive the discussion now. For example, the numpy case might be covered completely by the JSON module just adding supporting for types that provide an __index__ method. So rather than needing a new protocol, an existing one might be perfectly adequate. (Although I've not looked into this in any great detail, so it's entirely likely I've missed something critical - my point is mainly that with a proper use case, we can discuss solutions in the context of a real requirement). Paul Paul
On Tue, 27 Aug 2019 10:51:52 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
For example, the numpy case might be covered completely by the JSON module just adding supporting for types that provide an __index__ method. So rather than needing a new protocol, an existing one might be perfectly adequate. (Although I've not looked into this in any great detail, so it's entirely likely I've missed something critical - my point is mainly that with a proper use case, we can discuss solutions in the context of a real requirement).
I suspect that __index__ might work for integer-like objects. But then you have to deal with other objects such as numpy.float32, numpy.bool_, etc. Regards Antoine.
Antoine Pitrou writes:
Here is a use case where it may remove some pain from users'life: https://bugs.python.org/issue24313
The problem is that a protocol like __json__ doesn't help in deserializing, so you either can't use json.loads(), or more likely you'll need to have a post-parser to rebuild the objects. And it actually doesn't help as much as you'd hope in serializing, either, not for a decade or so, because most objects won't have a __json__ method. So every time somebody creates an object that they want to serialize, they'll have to subclass any imported class that doesn't have a __json__ method, and in the somewhat likely event they want to deserialize, they'll need to provide an ad hoc parser for that kind of object rather than using json.loads() (or hook it into the post-parser somehow). Robust programs will have to trap the inevitable AttributeErrors. And you can be sure that people will be adding __json__ methods to some of their application classes, so they'll start expecting __json__ methods on any object they put in such classes. Seems like a magnet for bug reports and RFEs to me. It's still true that just defining the protocol could allow users to just use json.dumps() for many common cases, which is clearly very convenient. I don't think it's worth the fragility, though. Steve
On Wed, 28 Aug 2019 03:03:00 +0900 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Antoine Pitrou writes:
Here is a use case where it may remove some pain from users'life: https://bugs.python.org/issue24313
The problem is that a protocol like __json__ doesn't help in deserializing, so you either can't use json.loads(), or more likely you'll need to have a post-parser to rebuild the objects.
If you have a use case where you need to recreate the data, in a Python interpreter, in the exact original form, then sure, use pickle. But JSON is commonly used where you transport data to some other runtime, or at least to an endpoint which only cares about JSON-transportable semantics; there the difference between a "Python int", a "Numpy int8", a "Numpy uint32"... needn't apply, because the receiver needn't care.
And it actually doesn't help as much as you'd hope in serializing, either, not for a decade or so, because most objects won't have a __json__ method.
By the exact same reasoning, you can believe JSON "doesn't help very much" because most Python objects don't serialize with it. There's no point in discussing such stupid and out-of-touch arguments. Regards Antoine.
Antoine Pitrou writes:
And it actually doesn't help as much as you'd hope in serializing, either, not for a decade or so, because most objects won't have a __json__ method.
By the exact same reasoning, you can believe JSON "doesn't help very much" because most Python objects don't serialize with it. There's no point in discussing such stupid and out-of-touch arguments.
Now, now, Antoine, you are not exempt from the CoC. But your logic is incorrect; you generalize in a way that I don't accept. Go back and read the tracker issue you cited. It's about surprisingly getting an Exception. If there's a protocol for converting objects to JSON, people will expect it to work, pretty much as much as __str__ does, at least in the domain that they're working in. And frequently it won't, and the user will be surprised by an exception, possibly in production when encountering an unusual case. To me, it's analogous to the UnicodeErrors that have been the bane of Mailman development for two decades. Our parsers are supposed to just work on any email message, but somehow the users keep finding new ways to crash them. :-/ (Not in quite a while now, but there is no proof of correctness yet either.) Worse, the __json__ method will *seem* to work when inherited by a derived class, but doesn't know to serialize essential information about that derived class. That's an error by the programmer of the derived class, of course, but it seems a very likely one to me. Perhaps NumPy and Pandas converting all their types to use the __json__ protocol will be enough that the majority of json.dumps users will never see another such Exception, and surely no data corruption in production. Seems unlikely to me, and the possibility of such problems should be considered when we think about the development burden we are imposing on library maintainers, and the risks to downstreams, of adopting this protocol. I repeat: to the extent that the protocol is adopted it will be very convenient for users of JSON. But I see risks as well. Regards, Steve
On Thu, 29 Aug 2019 12:51:04 +0900 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
If there's a protocol for converting objects to JSON, people will expect it to work, pretty much as much as __str__ does, at least in the domain that they're working in. And frequently it won't, and the user will be surprised by an exception, possibly in production when encountering an unusual case.
There isn't a specific argument here. This is just FUD.
To me, it's analogous to the UnicodeErrors that have been the bane of Mailman development for two decades. Our parsers are supposed to just work on any email message, but somehow the users keep finding new ways to crash them. :-/
And that is related to JSON... how?
Worse, the __json__ method will *seem* to work when inherited by a derived class, but doesn't know to serialize essential information about that derived class. That's an error by the programmer of the derived class, of course, but it seems a very likely one to me.
We should probably ban subclassing from Python so that you feel better.
Perhaps NumPy and Pandas converting all their types to use the __json__ protocol will be enough that the majority of json.dumps users will never see another such Exception, and surely no data corruption in production. Seems unlikely to me, and the possibility of such problems should be considered when we think about the development burden we are imposing on library maintainers, and the risks to downstreams, of adopting this protocol.
Which "development burden" are we imposing on library maintainers? Library maintainers are free to adopt or not the __json__ protocol. There is no formal obligation. Antoine.
Would this work for you? import struct floatval = 0.6441726684570312 s = struct.pack("d", floatval) # converts floatval into a string (Python 2) or bytes (Python 3), of 8 raw bytes # then serialise s To be portable between different platforms, you would also need to store whether the format is little-endian or big-endian (sys.byteorder is 'big' or 'little'; I don't know what it does on bi-endian machines). Rob Cliffe On 08/08/2019 20:05:50, Richard Musil wrote:
I am not sure `(str(o),)` is what I want. For a comparison here are three examples: ``` json:orig = {"val": 0.6441726684570313} json:pyth = {'val': 0.6441726684570312} json:seri = {"val": 0.6441726684570312}
dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": ["0.6441726684570313"]}
sjson:orig = {"val": 0.6441726684570313} sjson:pyth = {'val': Decimal('0.6441726684570313')} sjson:seri = {"val": 0.6441726684570313} ``` Each one has three outputs, `orig` is the input text, `pyth` is its python representation in a `dict`, `seri` is the serialized text output of `pyth`.
Now, the prefixes are `json` for standard Python module (which gets the last digit different from the output). `dson` is standard Python module using `parse_float=decimal.Decimal` on `json.loads` and custom serializer with proposed `return (str(o),)`. Finally `sjson` is `simplejson` using `use_decimal=True` on `json.loads` and the same (which is default) on its `json.dumps`.
When I had `return str(o)` in the custom impl. I ended up with the string in the output: ``` dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": "0.6441726684570313"} ``` and finally, with `return float(o)` I am basically back at the square one: ``` dson:orig = {"val": 0.6441726684570313} dson:pyth = {'val': Decimal('0.6441726684570313')} dson:seri = {"val": 0.6441726684570312} ``` The possibility to specify the "raw" textual output, which does not get mangled by the JSONEncoder when custom encoder is used seems to be missing. ˙`` _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5JKBNF... Code of Conduct: http://python.org/psf/codeofconduct/
--- This email has been checked for viruses by AVG. https://www.avg.com
On Aug 8, 2019, at 03:22, Richard Musil <risa2000x@gmail.com> wrote:
What matters is that I did not find a way how to fix it with the standard `json` module. I have the JSON file generated by another program (C++ code, which uses nlohmann/json library), which serializes one of the floats to the value above.
...
If anyone would want to know, why the last digit matters (or why I cannot double quote the floats), it is because the file has a secure hash attached and this basically breaks it.
If you need to exactly match a JSON file byte for byte, you really shouldn’t rely on parsing it and re-creating it in the first place, and especially not with two different libraries. The fact that your C++ library is apparently using a different rounding mode in representing floats than Python’s default round-to-even. But different libraries also have different rules for when they switch to exponential numbers, and how they represent that. And a C++ library may well represent 64-bit integers above 1<<56 imprecisely, while Python won’t. And, beyond numbers, different libraries produce different white space, different ordering within dicts, and different escaped representations of strings (not to mention how they handle things like “\uDEAD”, which the spec says is legal but doesn’t tell you how to interpret, because it doesn’t map to any Unicode character). There’s no way to guarantee that dumps(loads(x) == x, even if you use Decimal instead of float. And this isn’t really a limitation of either of the libraries you’re using, it’s the way JSON is supposed to work, by design. Even if both libraries follow all of the interoperability recommendations in the RFC, they’re still not expected to produce the same bytes for the same input. Usually you just shouldn’t be hashing JSON files. But sometimes you have to, to fit into a poorly-designed ecosystem that you can’t change. In that case, if your goal is to write a program that sometimes makes a substantive change (in which case you want to re-sign the package, or tell the client there’s an update to download, etc.), but usually doesn’t, and you want it t leave the file byte-for-byte unchanged (so you don’t need to re-sign, re-download, etc.), the best thing to do is check that the dict is unchanged and, if so, not write the file at all, or write back the original un-parsed string.
I appreciate the responses though I feel like I stirred a completely different topic than I wanted :). I am aware of the limitations of the float binary representation, and consequently also the Python handling of that matter, and I am perfectly fine with it. This was not the point of my post. The reason I gave the example was only its authenticity, i.e. there _is_ apparently a JSON serializer, which prefers the value ending with '3'. Just for the info, I run the simple test with the C++ lib, and indeed the values: json j3 = "{\"val\": 0.6441726684570313}"_json; json j2 = "{\"val\": 0.6441726684570312}"_json; both serialize to:0.6441726684570313 so it is clearly the matter of choice. But as I, and others, said, it really does not matter. I also did not want to discuss the design decisions about having JSON unreliable for the hash calculation. In my app, I know I am only dealing with integers, floats and strings (and arrays and maps), all withing the range of a standard binary representation (i.e. nothing fancy) and while I agree that in general, there are many unclear "corner cases" in JSON, I decided that hashing the (normalized) textual representation was best I could do. If I had to devise some other representation (possibly binary) for that, I would have to also implicate this particular representation to any other client verifying the hash, while the textual representation is already available and does not need any other conversion or re-representation. Anyway, what I wanted to point out is that the standard (in a sense of being standard Python implementation) JSON decoder can apparently handle floats of arbitrary precision with simply `parse_float` flag, which even uses another built-in type decimal.Decimal. On the other hand, there is no such flag available on its encoder counterpart, nor even the custom written encoder (for decimal.Decimal) can do that because of how it is implemented. It strikes me, even on a very abstract level, as a kind of asymmetry in the API. I do not know if this problem really only impacts the float case (and therefore `simplejson` has already thought it out well by simply adding an explicit option for that for its encoder), or it may affect some other types (maybe big integers), in which case, something like I suggested in my OP could be the solution. What do you think about that?
On Aug 8, 2019, at 10:33, Richard Musil <risa2000x@gmail.com> wrote:
I also did not want to discuss the design decisions about having JSON unreliable for the hash calculation. In my app, I know I am only dealing with integers, floats and strings (and arrays and maps), all withing the range of a standard binary representation (i.e. nothing fancy) and while I agree that in general, there are many unclear "corner cases" in JSON, I decided that hashing the (normalized) textual representation was best I could do.
You don’t seem to get that is not obscure “corner cases” that’s the issue, it’s the basic design of JSON. It is not a design goal that different implementations (or even different runs of the same implementation), will produce the same output for equal values. The fact that you’re only dealing with integers, floats, strings, arrays, and maps doesn’t make any difference. The only other values in JSON are the three singletons true/false/null, and all of the variable-formatting issues are with the types you’re using, and with “standard values” for those types (for any definition of “standard” that you can think of, unless you think 1E6 or a string with non-ASCII characters or -0.0 or any array or map at all are outside the “standard” range). So you’re arguing that you’re only affected by 100% of the problems, and therefore you don’t have to think about them. Trying to get two different JSON libraries to always emit exactly the same string for all possible inputs is a mug’s game. You fix one difference, and you’re just going to run into a different one later. If your goal is to not change the hash when the values are unchanged, just compare the dicts or keep track of a dirty flag or whatever and don’t rewrite the file when the values are unchanged.
The hash is calculated over the "normalized" JSON output, where "normalized" basically means stripped of all whitespaces by the "generator". This is as canonical as it gets. Then the same data are transmitted in "loose" form, i.e. with some indentation so it is humanly readable. The other party has two options how to verify the hash. 1) Take the file as the text file, remove all the whitespaces, why doing some hardcoded primitive "JSON parsing" probably very limited and very error-prone and recalculate the hash from that. Since it will only use the data already available in the original text input, it could not anyhow corrupt them or change them, it just needs to know how to remove all white spaces correctly. 2) Use JSON decoder to decode it (hopefully without losing anything in the process) and then dump it into "normalized" form and compute the hash over this one. This has the risk of conversion error, but if I could avoid that risk by using a custom type which does not have such an error, it would be much easier and maintenable solution. Recoding the data into some other format (binary or textual) for the hash would just add another level of complexity and will face the exactly same issues. Plus the goal of the hash is to protect the information in its transmitted form (i.e. in its textual form) because this is the only one which is available to both the sender and receiver, and not to authenticate some other representation of the same data which may be subject to "rounding errors" depending on the situation. But as I said, discussing this was not the point of the OP.
On 08/08/2019 21:18, Richard Musil wrote:
2) Use JSON decoder to decode it (hopefully without losing anything in the process) and then dump it into "normalized" form and compute the hash over this one. This has the risk of conversion error, but if I could avoid that risk by using a custom type which does not have such an error, it would be much easier and maintenable solution.
This is the part that everyone is saying JSON itself does not guarantee will work. You cannot, by its very nature, produce such a custom type. -- Rhodri James *-* Kynesim Ltd
JSON objects are specified by their textual representation (json.org). Choosing some binary representation for them (so they can be processed efficinetly) which does not preserve their value is the problem of the underlying binary representation, not of the JSON format per se.
On Fri, 9 Aug 2019 at 15:20, Richard Musil <risa2000x@gmail.com> wrote:
JSON objects are specified by their textual representation (json.org). Choosing some binary representation for them (so they can be processed efficinetly) which does not preserve their value is the problem of the underlying binary representation, not of the JSON format per se.
From ECMA-404, linked from that page: The goal of this specification is only to define the syntax of valid JSON texts. Its intent is not to provide any semantics or interpretation of text conforming to that syntax. It also intentionally does not define how a valid JSON text might be internalized into the data structures of a programming language. There are many possible semantics that could be applied to the JSON syntax and many ways that a JSON text can be processed or mapped by a programming language. Meaningful interchange of information using JSON requires agreement among the involved parties on the specific semantics to be applied. Defining specific semantic interpretations of JSON is potentially a topic for other specifications So yes, how JSON is translated into language data structures is out of the scope of the JSON spec. So you're proposing a change to the Python language stdlib implementation of that translation. Fine. But you have yet to provide a justification for such a change, except the original background description, which you yourself have repeatedly claimed is irrelevant, of getting identical output from a JSON->internal->JSON round trip. So as far as I can see we're left with "please change the stdlib json module, because I think this behaviour would be better (oh, and incidentally it would solve an issue I have that's not relevant to my argument that the module should change)".
From the JSON point of view 0.6441726684570313 is perfectly valid float (or better say _number_ as it is what JSON uses) and 0.6441726684570312 is perfectly valid _and different_ number, because it differs in the last digit.
Well, technically, JSON doesn't claim they are different, because it doesn't define comparison between its "number" objects... As you point out later in the same post, "3.0" and "3.0000" are different in the sense that you define, but how is that relevant or helpful? The JSON spec adds no semantics, so language bindings get to do what they want. So you're proposing a change to the Python language bindings. We get that. (At least I think most of us do by now). But I'm not sure you're getting the point that you need to justify your proposal, and you can't (by your own argument) do so by reference to the JSON spec, and you aren't (by your own admission) doing so based on round tripping. So what is your justification for wanting this change? "It makes more sense" doesn't seem to be getting much traction with people here. "Decimal offers better accuracy" also seems not to be very persuasive. Paul
Paul Moore wrote:
So you're proposing a change to the Python language stdlib implementation of that translation. Fine. But you have yet to provide a justification for such a change,
I think it can be justified on the grounds that it allows all of the information in the JSON text to be preserved during both deserialisation and serialisation. Seems to me this is objectively better. You can always discard information you don't need, but you can't get it back if you need it and it's not there. -- Greg
On Aug 9, 2019, at 17:16, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I think it can be justified on the grounds that it allows all of the information in the JSON text to be preserved during both deserialisation and serialisation.
Except that it doesn’t allow that. Using Decimal doesn’t preserve the difference between 1.0000E+3 and 1000.0, or between +12 and 12. Not to mention things like which characters in your strings, and even object keys, were read from backslash escapes, or which of your lists have spaces after their commas. JSON is not canonicalizable or round-trippable, by design. So I don’t think giving people the illusion of round-tripping their input is a good thing. Especially if smart people who’ve read this thread still don’t get that it’s an illusion. People needing to actually preserve 100% of the JSON that gets thrown at them are going to think this does it (because the docs imply it, or some random guy on StackOverflow says so, or it passed the couple of unit tests they thought of), and then deploy code they relies on that, and only discovering later that it’s.broken, and can’t be fixed without changing big chunks of their design. I mean, the OP wants to use this for secure hashes; think about what kind of debugging nightmare that’s likely to lead to. (And I hope nobody actually tries to attack him while he’s debugging the secure hash failures that are just side effects of this bug…) The fact that a feature _can_ be misused isn’t a reason to reject it. But the fact that a feature will _almost always_* be misused is a different story. —- * I say “almost” because there are presumably some cases where being able to preserve 98.8% of the input that gets thrown at you is qualitatively better than being able to preserve 97.8%, or where the only JSON docs you ever receive are just individual numbers, and they’re all between -0.1 and -0.2, so the only possible error that can arise is this one. And so on. But I doubt any of those is common.
Andrew Barnert wrote:
Except that it doesn’t allow that. Using Decimal doesn’t preserve the difference between 1.0000E+3 and 1000.0, or between +12 and 12.
That's true. But it does preserve everything that's important for interpreting it as a numerical value without losing any precision, which I think is enough of an improvement to recommend having it as an option. -- Greg
On Aug 9, 2019, at 19:09, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Andrew Barnert wrote:
Except that it doesn’t allow that. Using Decimal doesn’t preserve the difference between 1.0000E+3 and 1000.0, or between +12 and 12.
That's true. But it does preserve everything that's important for interpreting it as a numerical value without losing any precision, which I think is enough of an improvement to recommend having it as an option.
I don’t see why 00.12 vs. 0.12 is a problem (but ok as a problem since it only affects very weird code), while 0.12 and 0.012E1 are not a problem at all. If there’s any extra information in the extra 0 in the first case, surely there’s the same extra information in the second: in both cases, it’s a difference between 1 leading zero and 2. I don’t think this is very important, because 00.12 is not something people usually expect to have any extra meaning (and not something that any of the relevant specs give any meaning to), but you did bring it up. More importantly, the OP isn’t asking for preserving mathematical precision. In fact, float already gives him 100% mathematical precision for his example. The two strings that he’s complaining about are different string representations of the same float value. It’s not the value he’s complaining about, but the fact that Python won’t give him the same string representation for that float that his C++ library does, and therefore dumps(loads(…)) isn’t perfectly round-tripping his JSON. This will be exactly the same case if the text is +12 or 1.0000E+3, and switching from float to Decimal won’t do anything to fix that. If every library in the world used the same algorithm for float representations as Python, identical in every way except that some of them round half away from zero instead of rounding to even, then using Decimal would solve that problem. (But so would making it easy to specify a different rounding mode…) But that’s not the case. (And, even if it were, it still wouldn’t solve any of the other problems with JSON not being canonicalizable.) There are good uses for use_decimal, like when you actually want to pass around numbers with more precision than float and know both your producer and consumer can handle them. That’s presumably why simplejson offers it. But this thread, nobody is suggesting any good uses. They’re either suggesting we should have it so people can be fooled into believing they can get perfect round-tripping or JSON, or so we can solve a mathematical precision problem that doesn’t actually exist. If someone had come up with a good way to port use_decimal into the stdlib without needing json to import decimal I would have said it was a great idea—but after this thread, I’m not so sure. It everyone who wants it, or wants other people to have it, is wrong about what it would do, that just screams attractive nuisance rather than useful feature.
Andrew Barnert wrote:
I don’t see why 00.12 vs. 0.12 is a problem ... while 0.12 and 0.012E1 are not a problem at all.
I don't think either of them are problems. Or to be more precise, I think that being able to use Decimal objects with JSON would be a good thing, even though it wouldn't allow these to be distinguished. I understand that it doesn't solve the OP's hashing problem. -- Greg
Andrew Barnert wrote:
Except that it doesn’t allow that. Using Decimal doesn’t preserve the difference between 1.0000E+3 and 1000.0, or between +12 and 12. Not to mention
but it preserves the exact precision and value, i.e. not a fraction is lost. This alone is worthy considering it as a feature update for me.
People needing to actually preserve 100% of the JSON that gets thrown at them are going to think this does it (because the docs imply it, or some random guy on StackOverflow says so, or it passed the couple of unit tests they thought of), and then deploy code they
Not sure which people you refer to, if you in fact refer to me (because I did not actually noticed requests for that particular feature on stackoverflow) than just stay assured than I am perfectly aware of what you wrote here and I am still convinced that this feature (I am asking for) will be quite useful for me. And I also believe that others who may potentially want it (and are now dumping decimal.Decimal to string for that) would be perfectly fine with it as well. And those with very specific needs, like me, will be aware of the pitfalls, and will be able to judge in the context of their application, whether it is a benefit or not.
I mean, the OP wants to use this for secure hashes; think about what kind of debugging nightmare that’s likely to lead to. (And I hope nobody actually tries to attack him while he’s debugging the secure hash failures that are just side effects of this bug…)
Forget the secure hash. The OP wants to be able dump the same float he read on the input to the output. If you want do discuss design decisions of why and how I calculate the hash, and why the feature I am asking for is beneficial for my app, feel free to write me on my email and we can take this discussion offline.
The fact that a feature _can_ be misused isn’t a reason to reject it. But the fact that a feature will _almost always_* be misused is a different story.
Being able to dump the float as a float without losing any precision is the feature we are discussing here. The rest is just a background info. How is that going to be misused _almost always_?
On Aug 10, 2019, at 01:01, Richard Musil <risa2000x@gmail.com> wrote:
Andrew Barnert wrote:
Except that it doesn’t allow that. Using Decimal doesn’t preserve the difference between 1.0000E+3 and 1000.0, or between +12 and 12. Not to mention
but it preserves the exact precision and value, i.e. not a fraction is lost. This alone is worthy considering it as a feature update for me.
But float _already_ preserves the exact precision and value for the example you gave. The two strings are different string representations of the exact same float value, and will therefore give the exact same mathematical results with the exact same error bars. You’ve made it very clear that the exact precision and value isn’t what you’re interested in anyway, but the exact bytes of the text, so you can hash them. (You’ve then denied it, then made it very clear again, and gone around and around in circles, but that doesn’t change anything.)
The fact that a feature _can_ be misused isn’t a reason to reject it. But the fact that a feature will _almost always_* be misused is a different story.
Being able to dump the float as a float without losing any precision is the feature we are discussing here.
No, that’s the feature you already have for your example input, with float, today, and aren’t happy with. The feature we are discussing here is a way to do something that will, in that one specific example, give you back the exact same text, which is what you actually want. You either don’t have a problem (in which case you wouldn’t be here in the first place), or you have a problem that can’t be solved this way, and can’t even be solved in principle, but rather than accept that, you keep denying that you asked what you asked, and then pretending to have a different problem in hopes that different problem might be solved by the feature you asked for, so maybe the feature will be added so you can misuse it for what you really wanted.
Andrew Barnert wrote:
But float _already_ preserves the exact precision and value for the example you gave. The two strings are different string representations of the exact same float value, and will therefore give the exact same mathematical results with the exact same error bars. You’ve made it very clear that the exact precision and value isn’t what you’re interested in anyway, but the exact bytes of the text, so you can hash them.
The original text representation defines the value. The fact that the two numbers mentioned in my example both end up as the same value in the binary representation the JSON decoder uses does not mean that those two values are same. Try to separate the JSON format (and the values it represent), from the internal representation JSON decoder/encoder uses to process these values. JSON (format) alone, does not stipulate any particular internal representation, and using something else than IEEE-754 Floating Point for the float representation is perfectly acceptable (e.g. decimal.Decimal). If you do that, you will realize that what I am asking for is really preserving the value (precision) and the text representation at the same time, because they are actually defining one another.
On Sat, 10 Aug 2019 at 01:17, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Paul Moore wrote:
So you're proposing a change to the Python language stdlib implementation of that translation. Fine. But you have yet to provide a justification for such a change,
I think it can be justified on the grounds that it allows all of the information in the JSON text to be preserved during both deserialisation and serialisation.
Seems to me this is objectively better. You can always discard information you don't need, but you can't get it back if you need it and it's not there.
Agreed. *That* is a reasonable justification. Whether it's sufficient to get the change accepted to Python remains to be seen, but that can be decided once a PR is submitted. Backward compatibility probably requires that it be opt-in, and the reduced performance and import cost of Decimal probably means it needs care to implement it efficiently, without reducing performance for people who don't opt into it, but those are details that can be thrashed out in the implementation. So IMO, the next step is probably an issue on bpo, combined with a PR implementing the proposed behaviour. I don't think this is big enough to need a PEP, and I don't think any more debate is really needed here. Someone motivated and able to do the work is really what's needed next. Paul
Paul Moore wrote:
So IMO, the next step is probably an issue on bpo, combined with a PR implementing the proposed behaviour. I don't think this is big enough to need a PEP, and I don't think any more debate is really needed here. Someone motivated and able to do the work is really what's needed next.
I am willing and motivated, but will need some guidance in the process and peer review of the code. I am going on one week leave tomorrow (so will be only able to communicate from a phone), but after my return I can put together a patch (along the lines I outlined here) and with the help proceed. Would it be acceptable, and if yes, whom I might contact about that later? Or how should I proceed in general, should I post it here again, or somewhere else?
On 8/10/2019 10:59 AM, Richard Musil wrote:
Paul Moore wrote:
So IMO, the next step is probably an issue on bpo, combined with a PR implementing the proposed behaviour. I don't think this is big enough to need a PEP, and I don't think any more debate is really needed here. Someone motivated and able to do the work is really what's needed next.
I am willing and motivated, but will need some guidance in the process and peer review of the code. I am going on one week leave tomorrow (so will be only able to communicate from a phone), but after my return I can put together a patch (along the lines I outlined here) and with the help proceed.
Would it be acceptable, and if yes, whom I might contact about that later? Or how should I proceed in general, should I post it here again, or somewhere else?
I don't have a problem with posting questions here, but maybe core-mentorship might be a better place: https://mail.python.org/mailman3/lists/core-mentorship.python.org/ Eric
On 8/10/2019 11:30 AM, Eric V. Smith wrote:
On 8/10/2019 10:59 AM, Richard Musil wrote:
Paul Moore wrote:
So IMO, the next step is probably an issue on bpo, combined with a PR implementing the proposed behaviour. I don't think this is big enough to need a PEP, and I don't think any more debate is really needed here. Someone motivated and able to do the work is really what's needed next.
I am willing and motivated, but will need some guidance in the process and peer review of the code. I am going on one week leave tomorrow (so will be only able to communicate from a phone), but after my return I can put together a patch (along the lines I outlined here) and with the help proceed.
Would it be acceptable, and if yes, whom I might contact about that later? Or how should I proceed in general, should I post it here again, or somewhere else?
I don't have a problem with posting questions here, but maybe core-mentorship might be a better place: https://mail.python.org/mailman3/lists/core-mentorship.python.org/
Also: you might want to create an issue on bpo right now and state that you're working on a patch. There might be useful feedback while you're working on it. Eric
Ok, thanks for the tip. I guess I will post it on bpo once I will be back at the keyboard (in one week), so I could reply to the question and comments in timely manner. I had a quick look at JSONEncoder and it seems that a patch of the Python code will be sufficient, so I will probably post a PR as well (not as much as a request, but to use the GitHub PR features to let people comment on the proposal easily).
I guess we can wait for your patch, but I have no idea, having read through this whole thread, exactly what you are actually proposing. I suggest teh first step would be to lay that out and see what the core devs think. In the meantime, I hope it will be helpful for me to summarize what I got out of this discussion, and then make my own proposal, which may or may not be similar to Richards: TL;DR : I propose that python's JSON encoder encode the Decimal type as maintaining full precision. 1) The original post was inspired by a particular problem the OP is trying to solve, and a suggested solution that I suspect the OP thought was the least disruptive and maybe most general solution. However, what I think that did was throw some red herrings into the conversation.However, it made me go read the RFC and see that the JSON spec really says about numbers, and think about whether the Python json module does as well as it could in transcoding numbers. And I think we've found a limitation. 2) To be clear about vocabulary: a "float" is a binary floating point number, for all intents and purposes an IEEE754 float -- or at the very least, a Python float. Which is compatible between many computer systems. JSON does not have a "float" specification. JSON has a "number" specification -- and it's a textual representation that can (only) represent the subset of rational numbers that can be represented in base ten with a finite number of digits. But it does not specify a maximum number of digits. But it does allow implementations to set a limit. In the RFC, it addresses this issue with: """ This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available. """ OK -- so this means: if you want to be generally interoperable, than limit yourself to numbers that can be represented by IEEE-754. But it does not prohibit greater precision, or different binary representation when decoded. Python's json module, like I imagine most JSON decoders, takes the very practical approach of using (IEEE-754) float as a default for JSON numbers with a fractional part. But it also allows you to decode a JSON number as a Decimal type instead. But it does not have a way to losslessly encode a python Decimal as JSON. Since the JSON spec does in fact allow lossless representation of a Python Decimal, it seems that for completeness' sake, the json default encoding of Decimal should maintain the full precision. This would provide round tripping in the sense that a Python Decimal encoded and then decoded as JSON would get back the same value. (but a given JSON number would not necessarily get the exact same text back when round-tripped though Decimal) And it would provide compatibility with any hypothetical other JSON implementation that fully supports Decimal numbers. Note that this *might* solve the OPs problem in this particular case, but not in the general case -- it relies on the Python user to know how some *other* JSON encoder is encoding its floats. But it would provide a consistent encoding of Decimal that should be compatible with other *decimal* numeric types. Final points: I fully concur with many posters that byte for byte consistency of JSON is NOT a reasonable goal. I also fully agree that the Python JSON encoder should not EVER generate invalid JSON, so the OP's idea of a "raw" encoder seems like a bad idea. I can't think of, and I don't think anyone else has come up with, any examples other than Decimal that require this "raw" encoding. And if anyone finds any others, then those should be addressed properly. The fundamental problem here is not that we don't allow a raw encoding, but that the JSON spec is based on decimal numbers, and Python also support Decimal numbers, but there was that one missing piece of how to "properly" encode the Decimal type -- it is clear in the JSON spec how best to do that, so Python should support it. Richard: if your proposal is different, I'd love to hear what it is, and why you think Python needs something else. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker wrote:
I don't think anyone else has come up with, any examples other than Decimal that require this "raw" encoding.
It's possible that someone could have their own custom numeric type that they want to encode as a JSON number. But if the base code is enhanced to understand Decimal, that could be achieved using the existing mechanism for returning an alternative object to be encoded -- just convert your type to an equivalent Decimal and return that. -- Greg
On Sat, Aug 10, 2019 at 11:08 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I don't think anyone else has come up with, any examples other than Decimal that require this "raw" encoding.
It's possible that someone could have their own custom numeric type that they want to encode as a JSON number. But if the base code is enhanced to understand Decimal, that could be achieved using the existing mechanism for returning an alternative object to be encoded -- just convert your type to an equivalent Decimal and return that.
Exactly -- what I realized is that the JSON number is the exact same "data model" as the Python Decimal type. So being able to encode Decimal (and int) allows you to encode ANY valid JSON number. Which is why I think we don't need to allow the user to be able to customize encoding of numbers (or anything else). But the current float encoding does not fully support JSON numbers, so we do have a limitation that could/should be addressed. -CHB
-- Greg _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/M2LT53... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher, let me go through your summary and add some remarks, hopefully for the benefit of all (who made it so far) in this conversation: Christopher Barker wrote:
TL;DR : I propose that python's JSON encoder encode the Decimal type as maintaining full precision.
My proposal will be to extend JSON encoder to allow custom type to encode into "JSON fractional number" (explanation will follow).
1) The original post was inspired by a particular problem the OP is trying to solve, and a suggested solution that I suspect the OP thought was the least disruptive and maybe most general solution. However, what I think that did was throw some red herrings into the conversation.However, it made me go read the RFC and see that the JSON spec really says about numbers, and think about whether the Python json module does as well as it could in transcoding numbers. And I think we've found a limitation.
After I decided to write a proposal on bpo (and the patch) I made a mental note that I would need to address some things in my proposal differently, to avoid misunderstanding which was apparent in this thread. My (biggest) mistake was that I used word "float" loosely for different things based on the context. First, I used it for "JSON fractional number", i.e. the *number* (as defined by JSON spec) with decimal point. Sometimes I wrote it as "JSON float", sometimes as "float" only and I guess it was _the_ (unintentional) red herring. Second, I used it for IEEE-754 Floating Point number, usually referring to it as "floating point binary representation". What did not occur to me (yes!) that for most (all) here, "float" would probably prominently mean "Python native type float", as I really did not make a distinction between IEEE-754 and the Python type and considered the latter just an implementation of the former. So being loose was not helpful at all, and I realized, I would need to address it my "official" proposal.
2) To be clear about vocabulary: a "float" is a binary floating point number, for all intents and purposes an IEEE754 float -- or at the very least, a Python float. Which is compatible between many computer systems. JSON does not have a "float" specification.
Fully agree on that (as per my remark above).
OK -- so this means: if you want to be generally interoperable, than limit yourself to numbers that can be represented by IEEE-754. But it does not prohibit greater precision, or different binary representation when decoded.
That was the premise with which I came in, and manage to communicate so poorly.
Python's json module, like I imagine most JSON decoders, takes the very practical approach of using (IEEE-754) float as a default for JSON numbers with a fractional part. But it also allows you to decode a JSON number as a Decimal type instead. But it does not have a way to losslessly encode a python Decimal as JSON.
This triggered this thread.
Since the JSON spec does in fact allow lossless representation of a Python Decimal, it seems that for completeness' sake, the json default encoding of Decimal should maintain the full precision. This would provide round tripping in the sense that a Python Decimal encoded and then decoded as JSON would get back the same value. (but a given JSON number would not necessarily get the exact same text back when round-tripped though Decimal) And it would provide compatibility with any hypothetical other JSON implementation that fully supports Decimal numbers.
Here I would like to clarify two things: 1) At the moment Python `json` module allows using custom type for parsing "JSON fractional number". This custom type is passed to the decoder (`json.load(s)`) in an optional keyword argument `parse_float`. This type could be Python's own decimal.Decimal, but can be something else. 2) decimal.Decimal has been used throughout the discussion as an example of such custom type and because `simplejson` already allows using it for both (decoding and encoding). `simplejson` however does not allow custom type to be used (neither for parsing nor encoding) decimal.Decimal is "hardcoded" in the API by `use_decimal` Boolean keyword argument.
Note that this might solve the OPs problem in this particular case, but not in the general case -- it relies on the Python user to know how some other JSON encoder is encoding its floats. But it would provide a consistent encoding of Decimal that should be compatible with other decimal numeric types.
This is not a question only of the interoperability (between two different codec/platforms). Imagine an application which reads a JSON file, which contains those "problematic" values, does some operation on completely unrelated parts (e.g. changes some metadata, etc.) and then dumps the file back. Imagine those "problematic" values are some financial data. Such application, even without actually needing (or even caring about) any JSON numbers in the file is not able to dump them back into the changed file without changing them.
Final points: I fully concur with many posters that byte for byte consistency of JSON is NOT a reasonable goal.
If the extension to support custom type for decoding and encoding "JSON fractional number" will be accepted, then also byte-to-byte accuracy for this particular type could be implemented. Imagine this "custom type": ``` class JsonNumCopy(): def __init__(self, src): self.src = src def __str__(self): return self.src def __repr__(self): return self.src ``` It can already be used with a decoder (`json.loads(msg, parse_float=JsonNumCopy)`) and works as expected.
I also fully agree that the Python JSON encoder should not EVER generate invalid JSON, so the OP's idea of a "raw" encoder seems like a bad idea. I
I came to the same conclusion during the discussion here. It seems there is no need for similar treatment for other Python native types as they are already handled accordingly (int -> big int).
The fundamental problem here is not that we don't allow a raw encoding, but that the JSON spec is based on decimal numbers, and Python also support Decimal numbers, but there was that one missing piece of how to "properly" encode the Decimal type -- it is clear in the JSON spec how best to do that, so Python should support it.
You summed it up nicely. I could not do it better. The only thing I am not sure we are aligned is the actual implementation (or what we have in mind). As I wrote, I am aiming for custom type (or types as suggested by Joao) being allowed to serialize into "JSON fractional number". This may seem too broad (or risky) at first, but I believe there are two good reasons for that: 1) The parser already allows custom type. It would be good to be able to do something like this: ``` json_dict = json.loads(json_in, parse_float=MyFloat) json_out = json.dumps(json_dict, dump_as_float=MyFloat) ``` 2) This allows offloading the burden of deciding which type it should use from JSONEncoder to the client code. Very primitive custom type (example above) can be used instead of decimal.Decimal, but more importantly, as raised by others here, the proposed implementation should avoid importing decimal. This solution will solve also that. The client code will import (or define) whatever type it would need. (I believe that was the reason why the implementer of the decoder in `json` module went for this solution instead of simple `use_decimal` flag as `simplejson` did.)
Richard: if your proposal is different, I'd love to hear what it is, and why you think Python needs something else. -CHB
Thanks again, for the sum up. I realized (still in the learning process) that even though I planned something like this for my bpo post, it was better to do it here right now. Unfortunately I will not be very responsive in the next week, so will only be able to come up with a proposal on bpo after that. Since you seem to have something else in mind for the implementation I guess you either go ahead with yours (and then I will join in later with my comments), or you wait for my proposal and then join with yours. Richard
On Sat, Aug 10, 2019 at 11:20 PM Richard Musil <risa2000x@gmail.com> wrote:
Christopher, let me go through your summary and add some remarks, hopefully for the benefit of all (who made it so far) in this conversation:
And more comments from me now :-)
Christopher Barker wrote:
TL;DR : I propose that python's JSON encoder encode the Decimal type as maintaining full precision.
My proposal will be to extend JSON encoder to allow custom type to encode into "JSON fractional number" (explanation will follow).
Here is where I think we have slightly different ideas: See the recent exchange with Greg Ewing, but my thought is that we have the built in Python Decimal type that matches the "JSON fractional number" -- if it can losslessly encode into JSON, then any other custom numeric type can be encoded into JSON by using the Decimal encoding. The advantage of this is that it would then ensure that users's couldn't accidentally crate invalid JSON.
My (biggest) mistake was that I used word "float" loosely for different things based on the context. First, I used it for "JSON fractional number", i.e. the *number* (as defined by JSON spec) with decimal point. Sometimes I wrote it as "JSON float", sometimes as "float" only and I guess it was _the_ (unintentional) red herring.
yeah, the terminology can be confusing.
Second, I used it for IEEE-754 Floating Point number, usually referring to it as "floating point binary representation".
What did not occur to me (yes!) that for most (all) here, "float" would probably prominently mean "Python native type float", as I really did not make a distinction between IEEE-754 and the Python type and considered the latter just an implementation of the former.
Which is pretty much the case, so I don't think that's a real problem -- at least until/unless Python decides to support a different float implementation in the future. Here I would like to clarify two things:
1) At the moment Python `json` module allows using custom type for parsing "JSON fractional number". This custom type is passed to the decoder (`json.load(s)`) in an optional keyword argument `parse_float`. This type could be Python's own decimal.Decimal, but can be something else.
OK -- *maybe* there could be a more "standard" way to make a Decimal out of a JSON number, but this seems fine to me. Having a JSON number become a Python float by default does seem like the best option (and is what it does now, and backwards compatibility and all that).
2) decimal.Decimal has been used throughout the discussion as an example of such custom type and because `simplejson` already allows using it for both (decoding and encoding).
`simplejson` however does not allow custom type to be used (neither for
I'm not sure simplejson is particularly relevant here, but yes, as an example, sure. parsing nor encoding) decimal.Decimal is "hardcoded" in the API by `use_decimal` Boolean keyword argument. I think that is the right approach -- while there may be a use case for people to use a custom numeric type with JSON numbers, the JSON spec does not allow all possible numeric values to be represented -- it can only represent numbers that can be represented with a finite number of base ten digits (i.e no rational -- there is now way to precisely specify the value 1/3 for instance, and no rational numbers (srt(2), pi, etc.). As it happens, the Pyton Decimal type can also represent exactly these same numbers (and a few more -- JSON exclude, Inf, -Inf, NaN). So using int, float, and Decimal as the only way to encode/decode numbers provides access to the full functionality of JSON numbers. If someone does need to encode another number type, they will need to convert to int, float, or decimal, and I'm pretty sure that will provide a way to access any legal JSON value. So why not allow users to write any custom encoder they want for JSON numbers? Because we want the jonn lib to ideally only ever produce valid JSON. I don't think it's horrible to allow users to do do something wrong (consenting adults and all that), but it's better not to make it easy to do, and only to allow that if it provided some functionality that can not be provided with a more restrictive system. An example -- say a user is using the Fraction type in Python. They want to store that value in JSON. How can they do that now? Two ways: 1) Convert to float and accept the loss of precision 2) Store it in some custom string r (or object) representation What should they be allowed to do? Well, in pure JSON the options are: 1) store as a JSON number, accepting loss of precision, but choosing what that loss will be: 15 digits, 100 digits? all valid JSON. 2) Store it in some custom string (Object) representation If the python json lib provides a Decimal encoder, then users will have exactly these same two options. However, if it allows a custom encoder, then the user *could* use a non-legal JSON option -- which would NOT be a good idea -- so why allow it? However, I see one reason folks may want to control the encoding of Decimal numbers: As pointed out this this thread, the way one can encode a particular value in JSON is not unique: 1.01 and 0.101e+1 and 101.0e-2 all represent the same value. The encoding of a Decimal in JSON will presumably normalize this, so that value would always be stored the same way. But other JSON libs could encode teh same value in a different way while still being totally valid. The Python decoder would produce the same Decimal value for any legal JSON that represented the same value, but there would be no way to ensure that the Python Decimal encoder produced exactly the same JSON text for all Decimal values. (much like it currently does not for float values) One way to address this would be to allow the users to write their own custom encoder that could then match what some other encoder does, but I think that isn't a use case that we should aim to support, for all the reasons previously mentioned in this thread. This is not a question only of the interoperability (between two different
codec/platforms). Imagine an application which reads a JSON file, which contains those "problematic" values, does some operation on completely unrelated parts (e.g. changes some metadata, etc.) and then dumps the file back. Imagine those "problematic" values are some financial data.
Such application, even without actually needing (or even caring about) any JSON numbers in the file is not able to dump them back into the changed file without changing them.
This is the question -- is this something to support -- in my example, such an application would return JSON that represented exactly the same values, but may not be exactly the same JSON text. Again, I think the preserving exactly the same JSON text (rather than the value) should not be a goal of the Python json lib (and indeed, the only way to ensure that would be to not actually decode the data at all (or at least keep the original version around for round tripping) I think that a) this is not a valid goal for the lib, and b) that your proposal wouldn't solve it anyway -- it would work for a larger set of cases, but not in general. If the extension to support custom type for decoding and encoding "JSON
fractional number" will be accepted, then also byte-to-byte accuracy for this particular type could be implemented. Imagine this "custom type":
``` class JsonNumCopy(): def __init__(self, src): self.src = src
def __str__(self): return self.src
def __repr__(self): return self.src ``` It can already be used with a decoder (`json.loads(msg, parse_float=JsonNumCopy)`) and works as expected.
Well, OK. If you store the original JSON text, then you can reproduce it exactly. But you then have to use that special type -- yo aren't getting a decimal, or whatever numeric type you might want -- you are getting a special, "reproduce the JSON exactly" type. If I actually had that use case, I think I'd forget trying to use the built-in json module, and write code that was designed to manipulate and work with the JSON text itself.
I also fully agree that the Python JSON encoder should not EVER generate
invalid JSON, so the OP's idea of a "raw" encoder seems like a bad idea. I
I came to the same conclusion during the discussion here. It seems there is no need for similar treatment for other Python native types as they are already handled accordingly (int -> big int).
Well, as others have pointed out Unicode (even UTF-8) doesn't guarantee reproducibility of the encoded bytes. So if want to be abel to guarnatee exact round-tripping of JSON, you really have to do it everywhere.
The fundamental problem here is not that we don't allow a raw encoding, but that the JSON spec is based on decimal numbers, and Python also support Decimal numbers, but there was that one missing piece of how to "properly" encode the Decimal type -- it is clear in the JSON spec how best to do that, so Python should support it.
You summed it up nicely. I could not do it better. The only thing I am not sure we are aligned is the actual implementation (or what we have in mind).
As I wrote, I am aiming for custom type (or types as suggested by Joao) being allowed to serialize into "JSON fractional number".
not as far as i can tell -- as I wrote there isn't any real reason to serialize ANY custom type to a "JSON fractional number". What you want is to be able to round-trip the exact JSON text, and you've identified the number type as the only JSON type that doesn't do this now. But IIUC, strings don't either, necessarily. Then there are the other issues with whitespace an all that. That can be normalized out, but if the goal is to reproduce the original JSON, I think a library designed with that in mind would make more sense. the proposed implementation should avoid importing decimal.
I didn't quite follow the logic here, but it's not hard to only import Decimal if it was asked for.
Unfortunately I will not be very responsive in the next week, so will only be able to come up with a proposal on bpo after that. Since you seem to have something else in mind for the implementation I guess you either go ahead with yours (and then I will join in later with my comments), or you wait for my proposal and then join with yours.
Frankly, I don't have a use case, and i have a lot of other things to do -- so I won't be doing much beyond this kibitzing. But I don't think the ultimate goal of being able tp preserve the exact text in the original JSON is appropriate for the json package. The core devs may have another opinion, so good luck. -CHB
Christopher, I understood that the risk of producing invalid JSON if custom type is allowed to serialize into output stream seems to be a major problem for many. This has been already mentioned in this discussion. However, I thought it was related to the original idea of "raw output" (for generic custom user type - which could serialize into arbitrary JSON type). Since we agreed that the only type which needs such a treatment is JSON fractional number, it should not be that hard to check if the custom type output is valid. I had a look at how current 'json' module implementation handles the JSON numbers on decode. There is a regex defined in scanner.py (NUMBER_RE) which is used to recognize JSON number in the input stream. If the regex matches it matches into three groups conveniently parsed as 'integer', 'frac' and 'exp'. If either 'frac' or 'exp' is not empty, the decoder decodes the JSON input as a float (if default decoding is used) or as <parse_float> type if this one is set. decoder.py defines (in the source comment) JSON number with either 'exp' or 'frac' part as "JSON real number" and one without either as "JSON integer number". So I gues I should have written in my previous post "JSON real number" instead of "JSON fractional number", to be perfectly aligned with current implementation nomenclature. I have not verified if the NUMBER_RE regex defined in scanner.py matches exactly JSON number syntax or there are some deviations, but I would say it would be good start for checking the custom type output - by the rule, if the decoder lets this in, then the encoder should let it out. The check will involve only the custom type specified by 'dump_as_float', so it should not impact the default use and for those who would want to use custom type, it would be acceptable price to pay for the flexibility they get. If the check based on the same regex will be accepted, it will ensure that what will get in, will be also able to get out (in exactly same form) on one side and on the other side, if there were some decimal.Decimal outputs which current 'json' would reject in decode, the custom type output sanity check (even if the custom type was decimal.Decimal) will not let those values into the output either. This type of consistency seems to be worthy the performance impact of the additional check for me. Apart from that I am not aware of any other adverse impact of the additional sanity check.
On Mon, Aug 12, 2019 at 9:53 AM Richard Musil <risa2000x@gmail.com> wrote:
Christopher, I understood that the risk of producing invalid JSON if custom type is allowed to serialize into output stream seems to be a major problem for many. This has been already mentioned in this discussion. However, I thought it was related to the original idea of "raw output" (for generic custom user type - which could serialize into arbitrary JSON type
well, that would make it all the easier to make arbitrarily invalid JSON, but no, I don't think that was the only issue. And as Chris A. points out, it's plenty easy to make totally invalid JSON anyway: In [29]: json.dumps({"spam": [1,2,3]}, separators=(' } ','] ')) Out[29]: '{"spam"] [1 } 2 } 3]}' So *maybe* that's a non issue?
Since we agreed that the only type which needs such a treatment is JSON fractional number, it should not be that hard to check if the custom type output is valid.
well, no -- we did not agree to that. If your goal is tp support full precision of Decimal, than the way to do that is to support serialization of the Decimal type. And that would also allow serialization of any other (numeric) type into a valid JSON number. But it seems your goal is not to be able to serialize Decimal numbers with full precision, but rather, to be able perfectly preserve the original JSON representation (Not just it's value), or to be a bit more generic, to be able to control exactly the JSON representation of a value that may have more than one valid representation. If that is the goal, then strings would need a hook, too, as Unicode allows different normalized forms for some "characters" (see previous discussion for this, I, frankly, don't quite "get" it. Maybe it would be helpful to have full control over numbers, but still not strings -- but be clear that that's what's on the table if it is. My summary: I support full precision serialization of the Decimal type -- it would extend the json module to more fully support the JSON spec. I don't think the json module should aim to allow users to fully control exactly how a given item is serialized, though if it did, it would be nice if it did a validity check as well. Others, I'm sure, have different opinions, and mine hold no particular weight. But I suggest the way forward is to separate out the goal: a) This is what I want to be able to do. from b) This is how I suggest that be achieved Because you need to know (a) in order to know how to do (b) , and because you should make sure the core devs support (a) in the first place. I have not verified if the NUMBER_RE regex defined in scanner.py matches
exactly JSON number syntax or there are some deviations, but I would say it would be good start for checking the custom type output - by the rule, if the decoder lets this in, then the encoder should let it out.
and it it doesn't that may be considered a bug in the NUMBER_RE that could be fixed.
The check will involve only the custom type specified by 'dump_as_float', so it should not impact the default use and for those who would want to use custom type, it would be acceptable price to pay for the flexibility they get.
This type of consistency seems to be worthy the performance impact of the
additional check for me.
Agreed. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Aug 12, 2019, at 15:18, Christopher Barker <pythonchb@gmail.com> wrote:
If that is the goal, then strings would need a hook, too, as Unicode allows different normalized forms for some "characters" (see previous discussion for this, I, frankly, don't quite "get" it.
Although normalization can be a problem, there’s a much simpler—and more common—issue: what to escape, and how to escape it. For example, many pre-ES5 JS implementations escaped forward slashes. This is legal, but unnecessary, and Python’s module doesn’t do it. And you can’t make Python’s module do it. So, you load a JSON document with the string “abc\/def”, you get the Python string “abc/def”, you dump it and get a JSON document with “abc/def”. JSON says the two documents are identical, but they obviously aren’t the same bytes. Similar issues include case for hex letters in \u escapes, whether to use \u0008 instead of \b (and similar for all the other special escape sequences, but I think \b is the most commonly different one), whether to escape all non-ASCII (and what that means—Python doesn’t count \x7f as ASCII), whether to escape all non-BMP, whether to treat Unicode separators (or just the two that JS source doesn’t allow) as control characters, etc. So, even if you only care about preserving the output of one specific library, and you know exactly what rule it uses for escaping, you still can’t do it. And really, the same is true for array and object whitespace. At least most libraries are consistent in how they use whitespace, and most rules can be covered by the simple separators hook, so if you know which library generated the document, you can usually write code to reproduce the whitespace. But if you have to work with the output of two different libraries? Or a library you don’t know? Or JSON edited by hand or by sed scripts that might not even be consistent?
On Mon, Aug 12, 2019 at 8:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:
Although normalization can be a problem, there’s a much simpler—and more common—issue: what to escape, and how to escape it.
Yup -- I realized this after writing that post -- thanks for fleshing it out. This tell me that trying to use the json module to create the exact same JSON is a pretty bad idea. On the other hand, if it did allow user code to control exactly how an object is represented, then a user could create a specific "version", and round-trip anything. I just don't think that's really the job of the json module -- I'd rather have it work harder to enforce valid JSON.
And really, the same is true for array and object whitespace. At least most libraries are consistent in how they use whitespace, and most rules can be covered by the simple separators hook, so if you know which library generated the document, you can usually write code to reproduce the whitespace. But if you have to work with the output of two different libraries? Or a library you don’t know? Or JSON edited by hand or by sed scripts that might not even be consistent?
I think the OP suggested that they remove all whitespace to normalize it. But anyway, other good points as to why trying to hash JSON to check for changes is touchy at best, and maybe impossible. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker wrote:
Since we agreed that the only type which needs such a treatment is JSON fractional number, it should not be that hard to check if the custom type output is valid.
well, no -- we did not agree to that.
Then I misread your statement: "I can't think of, and I don't think anyone else has come up with, any examples other than Decimal that require this "raw" encoding." The way I misread it is that only "JSON real number" requires ..., since I consider Decimal just an implementation means to achieve the goal, not the goal itself. Now I see that Decimal support is your goal.
If your goal is tp support full precision of Decimal, than the way to do that is to support serialization of the Decimal type. And that would also allow serialization of any other (numeric) type into a valid JSON number. But it seems your goal is not to be able to serialize Decimal numbers with full precision, but rather, to be able perfectly preserve the original JSON
My goal is, as already mentioned: To extend JSON encoder to support custom type to serialize into "JSON real number" (changed "fractional" to "real" to be aligned to Python source internal comment). This includes support for decimal.Decimal, if it is the custom type the user chooses. I do not understand why you write that supporting Decimal is not my goal.
My summary: I support full precision serialization of the Decimal type -- it would extend the json module to more fully support the JSON spec.
My proposal is compatible with this, it is just it doesn't restrict the support to Decimal only.
I don't think the json module should aim to allow users to fully control exactly how a given item is serialized, though if it did, it would be nice if it did a validity check as well.
Allowing the custom serialization into JSON real number only can be checked rerelatively easily (as I tried to outline in my previous post).
But I suggest the way forward is to separate out the goal: a) This is what I want to be able to do. from b) This is how I suggest that be achieved Because you need to know (a) in order to know how to do (b) , and because you should make sure the core devs support (a) in the first place.
I believe I have defined my goal. If you think it is lacking in some way, I would try to improve it, let me know how. The reason why I mentioned the current implementation (and outlined a possible way to extend it) was to show that it should not be difficult to ensure the JSON validity, which, strictly speaking, will be needed regardless the custom type implementation (even if the custom type was restricted to Decimal).
On Tue, 13 Aug 2019 at 10:29, Richard Musil <risa2000x@gmail.com> wrote:
My goal is, as already mentioned:
To extend JSON encoder to support custom type to serialize into "JSON real number" (changed "fractional" to "real" to be aligned to Python source internal comment).
This includes support for decimal.Decimal, if it is the custom type the user chooses. I do not understand why you write that supporting Decimal is not my goal.
Support of *just* decimal provides custom type support:
class MyNumber: def serialise_to_string(self): ....
class MyEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, MyNumber): return Decimal(obj.serialise_to_string())
json.dumps([MyNumber(), MyNumber()], cls=MyEncoder())
By limiting support to Decimal, we do the simplest thing that works, and we ensure that any custom classes get the benefit of having their string serialisation checked (if it can't be converted to Decimal, it's not going to be valid JSON) without the person implementing the custom class having to do anything. In fact, the custom class probably doesn't even have to implement custom JSON support at all, as it's quite likely that its normal str() representation is suitable to be passed to the Decimal constructor. So defining a special protocol to allow JSON serialisation of numeric classes *other* than Decimal seems like a clear case of YAGNI. As the MyEncoder example above shows, support for Decimal allows the end user to define their own protocol if needed, so even "difficult" cases (should any exist, I can't think why they would) should be covered. +1 on supporting Decimal, -1 on supporting anything more complex. Paul
But my proposal is not more complex. Implementing support for Decimal requires exactly the same steps as implementing the support for custom type. 1) One has to recognize the type in the input data, 2) Get the JSON representation, possibly by using obj.__str__(). 3) Check the output that it conforms JSON real number spec. The only difference is that in one case you need a kw to specify the custom type, in the other, you need a Boolean kw to explicitly pull in decimal.Decimal (to avoid unconditional import of decimal). Then, the decoder already allows custom type, in parse_float, so having the same option on the output seems to me better. I admit, I consider my proposal to be as complex as yours, and better fitting, to what is already implemented in decoder.
On Tue, 13 Aug 2019 at 11:40, Richard Musil <risa2000x@gmail.com> wrote:
But my proposal is not more complex. Implementing support for Decimal requires exactly the same steps as implementing the support for custom type.
1) One has to recognize the type in the input data,
How do you recognise the type? For Decimal, you just check if it's an instance of the stdlib type. How does *your* proposal define which custom types are valid?
2) Get the JSON representation, possibly by using obj.__str__().
If you're using str() then this is easy enough, agreed. But what happens when someone raises an issue saying they want a type that has different str() and JSON formats? Are such types not allowed? Or do we need a "JSON representation" protocol, which defaults to str()? For Decimal, again it's easy, we just use str() and stop.
3) Check the output that it conforms JSON real number spec.
For Decimal, there's no need as we know what the str() representation of Decimal looks like, so we don't need this step at all. For arbitrary types, we do.
The only difference is that in one case you need a kw to specify the custom type, in the other, you need a Boolean kw to explicitly pull in decimal.Decimal (to avoid unconditional import of decimal).
See above for all of the *other* differences. And it's arguable that with Decimal, which is a known stdlib type, we could just support it without needing a flag at all. We could defer the import by doing some heuristic checks when we see an unknown type (for example, if obj.__type__.__name__ != 'Decimal' we don't need to do an import and a full typecheck). But even if we do need a boolean flag to opt in, that's not such a big deal. Whereas with the custom type option, someone is bound to say they want a list of types to support in the same call, etc. More maintenance complexity.
Then, the decoder already allows custom type, in parse_float, so having the same option on the output seems to me better.
The symmetry argument is valid, but IMO pretty weak without an actual example of a case that would benefit from the more general proposal.
I admit, I consider my proposal to be as complex as yours, and better fitting, to what is already implemented in decoder.
Hopefully I've explained why your proposal is more complex. If the benefits justify it, I wouldn't object to extra complexity, but you've still not presented a single example where your proposal allows simpler/better end user code than the alternative of just supporting Decimal. Nor have you demonstrated that your proposal is simpler to teach or explain. Without a motivating use case, I very much prefer the simpler approach. If you still think your proposal is as simple as just supporting Decimal, I don't know what else to say. We may just have to agree to differ, and I'll leave it to others to judge. If you write the code, and some other core dev chooses to accept it having seen my objections but not been convinced by them, then I'm not going to stop them. But I won't merge code implementing your proposal myself in that case. Paul
On Tue, 13 Aug 2019, 14:46 Paul Moore, <p.f.moore@gmail.com> wrote:
1) One has to recognize the type in the input data,
How do you recognise the type? For Decimal, you just check if it's an instance of the stdlib type. How does *your* proposal define which custom types are valid?
I expect it will be exactly the same for the custom type. Calling isinstance(o, dump_as_float), where "dump_as_float" will be user specified type (class). Basically the same way the encoder handles the other (Python) types during the encode.
2) Get the JSON representation, possibly by using obj.__str__().
If you're using str() then this is easy enough, agreed. But what happens when someone raises an issue saying they want a type that has different str() and JSON formats? Are such types not allowed? Or do we need a "JSON representation" protocol, which defaults to str()?
I guess asking the custom type to use __str__() as means to provide the JSON output is not unreasonable. It is an implicit contract on the same level as asking the same custom type to accept JSON real number in its constructor (and not for example in function "from_json") when used in decoder (specified in "parse_float").
For Decimal, again it's easy, we just use str() and stop.
That is basically what we would do in my proposal too.
3) Check the output that it conforms JSON real number spec.
For Decimal, there's no need as we know what the str() representation of Decimal looks like, so we don't need this step at all. For arbitrary types, we do.
There were some comments in this thread that Decimal allows non-JSON numbers (e.g. ". 3"). I am not sure what happens on serialization or if it is really a problem, but I believe Decimal does not officially claim full JSON compliance and it is not its goal either. And if it does now, assuming that it will behave this way in the future is still risky (unless the maintainer of Decimal is aware of your assumption and commits to honoring it over Decimal lifetime). Then, even if Decimal always guarantees JSON compliance, how would you handle if someone decided to subclass his JSON real number custom type from Decimal. Will you allow it, and then allow the custom output as well, or try explicitly prevent it? And what if the client code just imports Decimal and mocks its constructor and __str__() function? We may question the usefullness of such an approach (though I believe the former is pretty legit), but unless you make sure they are really forbidden they can easily allow "non-allowed" JSON output. What I am saying is that basing the assumption of the type's output JSON validity on the type identification is more tricky than it may appear at first. And is far more complex to do "right" (if it really concerns you) than running the regex on the type's output and not care about the type at all. It is easier and clearer for both the implementer as the code is self-explaining and for the user too, since he just have to care that his type passes the check and not about the "right origin". So even in your solution I would still vote for using regex sanity check instead of basing the validity of JSON output on the identity of Decimal.
The only difference is that in one case you need a kw to specify the custom type, in the other, you need a Boolean kw to explicitly pull in decimal.Decimal (to avoid unconditional import of decimal).
See above for all of the *other* differences. And it's arguable that with Decimal, which is a known stdlib type, we could just support it without needing a flag at all. We could
The purpose of the flag is make explicit what you propose to do implicitly. You can come up with some heuristic, but eventually you would need to import the right Decimal in order to do the full typecheck. I guess it is a matter of taste, but I prefer having the explicit control over whether 'json' module actually uses Decimal or not, but to each his own. defer the import by doing some
heuristic checks when we see an unknown type (for example, if obj.__type__.__name__ != 'Decimal' we don't need to do an import and a full typecheck). But even if we do need a boolean flag to opt in, that's not such a big deal. Whereas with the custom type option, someone is bound to say they want a list of types to support in the
With custom type solution the support for a list of custom types comes free (as a courtesy of the function 'isinstance') as suggested by Joao earlier, because 'isinstance' will be the only place where the "dump_as_float" value will be used. same call, etc. More maintenance complexity.
Then, the decoder already allows custom type, in parse_float, so having the same option on the output seems to me better.
The symmetry argument is valid, but IMO pretty weak without an actual example of a case that would benefit from the more general proposal.
Imagine this (my proposal) : json_dict = json.loads(json_in, parse_float=decimal.Decimal) json_out = json.dumps(json_dict, dump_as_float=decimal.Decimal) is functionally equivalent to yours: json_dict = json.loads(json_in, parse_float=decimal.Decimal) json_out = json.dumps(json_dict, use_decimal=True) Now assume custom type (my solution): json_dict = json.loads(json_in, parse_float=MyFloat) json_out = json.dumps(json_dict, dump_as_float=MyFloat) and yours: json_dict = json.loads(json_in, parse_float=MyFloat) json_out = json.dumps(json_dict, use_decimal=True, cls=MyFloatEncoder) where MyFloatEncoder is implemented along the line of your proposal plus the client code must also import and use Decimal, even if MyFloat technically does not need it. Apart from the JSON compliance check controversy, where we do not agree, I see no advantage of your proposal, while it makes the intuitive things more convoluted and tricky (both for the client code and the implementation). We may just have to agree to
differ, and I'll leave it to others to judge. If you write the code,
OK, let's agree to that. Richard
I have originally planned to post the proposal on bpo, but things turned out unexpectedly (for me), so I am returning here. I wrote the patch (for the Python part). If anyone is interested it is here: https://github.com/python/cpython/compare/master...risa2000:json_patch The patch follows the original idea about serializing the custom type to JSON and I believe it is "as simple as it gets" except the JSON number validity check, which turned out to be problematic. I run some timeit benchmarks on my code, and compared it to simplejson. The test I run was: (simplejson) py -m timeit -s "import simplejson as sjson; from decimal import Decimal; d=[Decimal('1.000000000000000001')]*10000" "sjson.dumps(d)" (my code) py -m timeit -s "import json; from decimal import Decimal; d=[Decimal('1.000000000000000001')]*10000" "json.dumps(d, dump_as_number=Decimal)" Since my code runs in pure Python only, I disabled the C lib in simplejson too. Here are the results: simplejson - with C code: 50 loops, best of 5: 5.89 msec per loop simplejson - pure Python: 20 loops, best of 5: 10.5 msec per loop json_patch (regex check): 10 loops, best of 5: 21.3 msec per loop json_patch (float check): 20 loops, best of 5: 15.1 msec per loop json_patch (no check): 50 loops, best of 5: 9.75 msec per loop The different "checks" mark different _check_json_num implementations (included in the code). "float check" is used just as an example of something accessible (and possibly faster), but I guess there could be cases which float accepts, but which are not valid JSON numbers. The JSON validity check turned out to be the cause of the performance hit. simpljson does not do any validity check on Decimal output, so it is on par in perf with "no check" (I guess it is a tad bit slower because it implements and handles more features in the encoder loop). I previously argued with Paul that making an assumption about the object output validity based on its type is not safe (which I still hold), but making it safe in this particular case presents the performance hit I cannot accept, or to word it differently, if I should choose between stdlib json and simplejson, while knowing that the stdlib runs 50-100% slower (but safe), I would choose simplejson. From the previous discussion here I also understood that letting the custom type serialize without the validity check is unacceptable for some. Since I am basically indifferent in this matter, I would not argue about it either. Which leaves me with only one possible outcome (which seems to be acceptable) - porting the Decimal handling from simplejson to stdlib. Apart from the fact that simplejson already has it, so if I need it, I could use simplejson, the other part is that whoever pulled simplejson code into stdlib either made deliberate effort to remove this particular functionality (if it was present at the time) or never considered it worthy to add (once it was added to simplejson). Second point is that when looking at the code in the stdlib and in simplejson, it is clear that simplejson has more features (and seems also to be more actively maintained) than the stdlib code, so importing one particular feature into the stdlib just to make it "less inferior" without any additional benefit seems like a waste of time. Why simplejson remained separated from the main CPython is also a question (I guess there was/is a reason), because it seems like including the code completely and maintain it inside CPython could be better use of the resources. Richard
On Aug 23, 2019, at 00:33, Richard Musil <risa2000x@gmail.com> wrote:
Why simplejson remained separated from the main CPython is also a question (I guess there was/is a reason), because it seems like including the code completely and maintain it inside CPython could be better use of the resources.
The README explains that: “simplejson is the externally maintained development version of the json library included with Python 2.6 and Python 3.0”. This has multiple advantages: * It can evolve much more quickly than the 18-month stdlib cycle. Notice that it’s gone from 1.0 to 3.16.1 during the time Python has only had 5 releases. * It can add features that aren’t appropriate for the stdlib. * It can add features that might be appropriate for the stdlib or might not, to gain experience with real-world use before deciding. * It works the same way on all of Python 2.5+ and 3.3+ (including PyPy, etc.), rather than new features only being available in the newest Python; if there were no simplejson there would be a backports.json or json35 or whatever module on PyPI instead. Bob Ippolito has said in the past that simplejson has grown too many options, which can’t be removed for backward compatibility reasons, so it’s a good thing the stdlib doesn’t have all of them. The fact that the stdlib can add _one_ of them, after years of use in the wild, while ignoring all of the others, is exactly why it’s a good thing simplejson exists as a separate project. But you can’t just port the feature as-is, because I’m pretty sure unconditionally importing decimal is the main reason simplejson is slower to import than json. You will need to add some way around that (whether a simple lazy import on first use, or one of the more clever ideas people have brought up on this thread). Also, IIRC, it doesn’t do any post-check; it assumes calling str on any Decimal value (after an isfinite check if not allow_nan) produces valid JSON. I think there are unit tests meant to establish that fact, but you’d need to copy those into the stdlib test suite and make the case that their coverage is sufficient, or make some other argument that it’s guaranteed to be safe, or ignore str and write a _decimal_str similar to the existing _float_str, or find a way to validate it that isn’t too slow.
Andrew, thanks for the background. On Fri, Aug 23, 2019 at 8:25 AM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
Also, IIRC, it doesn’t do any post-check; it assumes calling str on any Decimal value (after an isfinite check if not allow_nan) produces valid JSON. I think there are unit tests meant to establish that fact, but you’d need to copy those into the stdlib test suite and make the case that their coverage is sufficient,
That seems like the way to go -- if it;s in the stdlib, than any changes to Decimal that breaks it would fail the test. So no harm in relying on Decimal's __str__. But, of course, comprehensive test coverage is hard/impossible.
or make some other argument that it’s guaranteed to be safe, or ignore str and write a _decimal_str similar to the existing _float_str,
I'm not sure there is any advantage to that -- it would still require the same comprehensive tests -- unless, of course Decimal's __str__ does work for all cases. Or if there is a reason to use a different legal representation than Decimal uses.
or find a way to validate it that isn’t too slow.
Almost by definition validation is going to be slower -- probably on order of twice as slow. Validation is a good idea if you are not controlling the input, but theoretically a waste of time if you are. - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Aug 23, 2019, at 09:45, Christopher Barker <pythonchb@gmail.com> wrote:
Andrew, thanks for the background.
On Fri, Aug 23, 2019 at 8:25 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Also, IIRC, it doesn’t do any post-check; it assumes calling str on any Decimal value (after an isfinite check if not allow_nan) produces valid JSON. I think there are unit tests meant to establish that fact, but you’d need to copy those into the stdlib test suite and make the case that their coverage is sufficient,
That seems like the way to go -- if it;s in the stdlib, than any changes to Decimal that breaks it would fail the test. So no harm in relying on Decimal's __str__.
But, of course, comprehensive test coverage is hard/impossible.
From a quick glance, it looks like the test cases in simplejson are pretty minimal. (Although they do test something I wouldn’t have thought of—what happens if you reload decimal but not _decimal, which apparently is a serious issue for wsgi or other uses of subinterpreters?) The feature has been in simplejson since 2010, and on by default since 2011, and from a quick glance, the last relevant reported bug (the one that made them add the reload test) was 2012. But good enough for an easily-fixed external project where the upgrade costs to users are minimal and most of them are there because they want “development” features beyond the stdlib may not be good enough for the stdlib… And I’m not even sure what all of the relevant test cases are. Surveying a range of other JSON implementations (in multiple languages) to scavenge their tests might be the best place to start?
or make some other argument that it’s guaranteed to be safe, or ignore str and write a _decimal_str similar to the existing _float_str,
I'm not sure there is any advantage to that -- it would still require the same comprehensive tests -- unless, of course Decimal's __str__ does work for all cases.
Well, it wouldn’t be easier to _test_, but it might be easier to _argue_. There’s a Haskell implementation that comes with a formal proof that the number encode function can’t produce anything that the number decode function can’t consume. That doesn’t seem likely to be reasonable for Python, but it’s not quite impossible…
or find a way to validate it that isn’t too slow.
Almost by definition validation is going to be slower -- probably on order of twice as slow. Validation is a good idea if you are not controlling the input, but theoretically a waste of time if you are.
Good point. But then twice as slow as a feature that doesn’t exist at all is still a step forward. If someone wants to implement use_decimal with validation for 3.9, and meanwhile keep working on a test suite that will convince everyone sufficiently to allow removing the validation and doubling the speed for 3.10, that might be better than waiting for 3.10 to add the feature. As long as it’s not too slow to actually use in practice, it might still be worth having. Also, keep in mind that, at least in simplejson, using Decimal instead of float already means an 80%-300% slowdown, so people who can’t sacrifice performance for precision already have to come up with other alternatives anyway.
I gave it some thought over the weekend and came to the conclusion that I am not going to go further with it (where "it" means my or anyone else's idea). The reason is that I totally lost any motivation. I however feel some more elaborate answer might be due and I will try to give one. The other day (actually before I posted my last reply), I went to core-mentorship list to get some ideas about how to continue. There was this thread about how people got their first contribution and while most were positive there was one post which kind of stood out because it described an unsuccessful attempt which finally led to parting ways. I realized there is no shame in that. I came here with some rough idea about the JSON module features, but had no clue what are the "real" use cases, what are peoples' expectations, etc. This thread actually helped me to get more of the understanding and the insight. I thought I had a nice feature in mind, and was wondering what it would take to get it into Python. On the other hand, I did not have any other particular ambitions, like becoming a Python contributor. During the discussion I realized that there were 3 aspects (of the potential acceptable solution), proposed by 3 different persons, about which they were quite imperative: 1) It must use Decimal (Paul) 2) It must check validity of serialized output (Christopher) 3) It must avoid unconditional import of Decimal (Andrew) Originally, I thought that I could fulfill 2) and 3), without jeopardizing 1) (my opinion on 1) I already expressed), so I implemented the Python part and run some performance tests only to find out that my solution cannot compete in performance with Decimal solution because of the additional validity check and I could not promote it anymore. I am not particularly convinced that the validity check is really needed, but I understand why others are requesting it. So the only way to continue seemed to be implementing 1+2+3 and I realized I really did not want to do it. One reason was I did not particularly "like" it, while it is not meant to be read as that I thought it was wrong to do it this way, I just did not really feel invested in those ideas anymore, the other was, that I was no longer able to argue about it, because I had basically no idea, if the users really need full validity check, or if the cost of one time import of Decimal really overweights the performance hit of the heuristic for a lazy import, and had to rely on what someone claimed on some mailing list (no disrespect meant). I also realized that implementing this would not give me any advantage over using simplejson, neither in the performance nor in the features, so it lost also the practical aspect of needing it. So I guess I am going to leave my patch on github for a while, if anyone decides to go ahead with 1+2+3. It is not exactly a rocket science but could save some typing, or if you want to run some quick benchmark. If you supply it with dump_as_number=Decimal, it would behave exactly as the version with hardcoded Decimal (sans lazy import). One thing to note, if you choose to use Decimal for validating JSON number, you will need to handle the case where allow_nan is False, and check that Decimal does not serialized them (it does in simplejson as there is no check). Should not have a big impact though as allow_nan is True by default. Richard
On Mon, 26 Aug 2019 at 09:47, Richard Musil <risa2000x@gmail.com> wrote:
I gave it some thought over the weekend and came to the conclusion that I am not going to go further with it (where "it" means my or anyone else's idea). The reason is that I totally lost any motivation. I however feel some more elaborate answer might be due and I will try to give one.
The other day (actually before I posted my last reply), I went to core-mentorship list to get some ideas about how to continue. There was this thread about how people got their first contribution and while most were positive there was one post which kind of stood out because it described an unsuccessful attempt which finally led to parting ways. I realized there is no shame in that.
I came here with some rough idea about the JSON module features, but had no clue what are the "real" use cases, what are peoples' expectations, etc. This thread actually helped me to get more of the understanding and the insight. I thought I had a nice feature in mind, and was wondering what it would take to get it into Python. On the other hand, I did not have any other particular ambitions, like becoming a Python contributor.
Thanks for the feedback, it's both interesting and valuable. Maybe you would be willing to add a pointer to this comment to that discussion on core-mentorship? I'm sure it would be useful information for the people looking to try to remove barriers to entry. And the point that you made, that you weren't coming here with an ambition to become a contributor in any more general sense, is also very relevant, as it's quite possible that people coming here with nothing more than an idea that they'd like to propose may well be put off by the feeling that they have to implement their idea or no-one is willing to listen. (There *is* in reality a problem in that many ideas are fine but pointless unless someone implements them, but that doesn't mean we should block people from just discussing things in the abstract).
During the discussion I realized that there were 3 aspects (of the potential acceptable solution), proposed by 3 different persons, about which they were quite imperative: 1) It must use Decimal (Paul) 2) It must check validity of serialized output (Christopher) 3) It must avoid unconditional import of Decimal (Andrew)
A summary like this is immensely helpful in clarifying both where the discussion has got to, and what the sticking points are. I don't think we do enough on this list to encourage or offer such summaries, or to help new contributors to think in terms of checkpointing the discussion like this. We get so stuck in the technical discussion that we forget that people may *also* need help in the softer skills, like managing the discussion. So we keep throwing ideas into the mix until the contributor is overwhelmed and doesn't know how to proceed.
Originally, I thought that I could fulfill 2) and 3), without jeopardizing 1) (my opinion on 1) I already expressed), so I implemented the Python part and run some performance tests only to find out that my solution cannot compete in performance with Decimal solution because of the additional validity check and I could not promote it anymore. I am not particularly convinced that the validity check is really needed, but I understand why others are requesting it.
So the only way to continue seemed to be implementing 1+2+3 and I realized I really did not want to do it. One reason was I did not particularly "like" it, while it is not meant to be read as that I thought it was wrong to do it this way, I just did not really feel invested in those ideas anymore, the other was, that I was no longer able to argue about it, because I had basically no idea, if the users really need full validity check, or if the cost of one time import of Decimal really overweights the performance hit of the heuristic for a lazy import, and had to rely on what someone claimed on some mailing list (no disrespect meant).
I agree, there's a real risk here that proposals get overwhelmed by additional requirements suggested by other people. And when those other people are long-time contributors, or even more so core devs, it's extremely hard for a newcomer to say that they don't think that such a requirement is necessary. But as you have demonstrated here, those requirements are sometimes mutually incompatible, and at some point, someone has to make a judgement call on the trade-offs. Again, subjecting a newcomer to the need to do that right up front isn't exactly fair or helpful.
I also realized that implementing this would not give me any advantage over using simplejson, neither in the performance nor in the features, so it lost also the practical aspect of needing it.
Fundamentally, this is where there is a disconnect between people's expectations occurs. Ideas discussed on this list are intended to be for implementation in future versions of Python - so it's pretty much always the case that anything agreed here will be of no immediate use to the individual proposing it. Historically, Python releases have been every 18 months, so even if we assume that the proposer is an extremely early adopter of new releases, and has no need to support older versions of Python, we're talking about 12-18 months before they can use a new feature. Compare that to *right now* for a PyPI package or a workaround in code. But people *think* of ideas because they hit a problem of their own. And they come here out of a sense that sharing a possible solution would help the community. It's not very encouraging if they get treated as if they are simply saying "solve my problem for me". We probably need to get better at helping such people to polish their ideas, *without* focusing too heavily on their original problem (or worse still, on criticising their original solution of that problem). The original problem is the initial use case for a new feature, of course, but focusing on "how does your original problem generalise, so we can see what common features a solution should have" rather than on "why do you think your problem is important enough to need solving in the core language/stdlib" (I exaggerate somewhat, but sadly I suspect not a lot :-() would be a much more welcoming approach.
So I guess I am going to leave my patch on github for a while, if anyone decides to go ahead with 1+2+3. It is not exactly a rocket science but could save some typing, or if you want to run some quick benchmark. If you supply it with dump_as_number=Decimal, it would behave exactly as the version with hardcoded Decimal (sans lazy import). One thing to note, if you choose to use Decimal for validating JSON number, you will need to handle the case where allow_nan is False, and check that Decimal does not serialized them (it does in simplejson as there is no check). Should not have a big impact though as allow_nan is True by default.
Thanks. Even if your PR doesn't ultimately get accepted, the discussion was useful, and highlighted the fact that we can't currently write full-precision Decimal values using native JSON (we can round-trip them using custom encoders and object_hook, but that's a non-standard layer on top of base JSON). Sometimes these things take a few rounds of discussion before getting accepted (again, the long-term view is important here). Thanks for both the proposal and subsequent thread, and for the helpful and thought-provoking summary post. Please don't be *too* put off from coming back with any future ideas you may have! Paul PS Your discussion of the 3 constraints people were asking for and how you viewed them and tried to address them, made me think of some other possible approaches that might be productive. But as you've said you don't want to take the proposal further, and I think that's an entirely reasonable position for you to take, I won't push you by re-opening the debate right now. But I'll keep the thoughts in mind for if someone else wants to take the proposal further.
Paul, thanks for an insightful reflection from "the other side", I really appreciate it. It feels like we share some understanding after all. Richard
Paul Moore writes:
[S]ubjecting a newcomer to the need to [deal with extra requirements from other participants] right up front isn't exactly fair or helpful.
But avoiding that is what core-mentorship is for. Perhaps we can advertise that list better, and maybe more people can mentor there. But a proposal that comes to Python-Ideas is going to be presumed the object of public discussion for the benefit of Python, not mentorship for the benefit of newcomers (and in the long run for Python). People will have their say, and much of that will be pretty abrupt from this point of view. I don't think it's a good idea to impede discussion.
But people *think* of ideas because they hit a problem of their own. And they come here out of a sense that sharing a possible solution would help the community. It's not very encouraging if they get treated as if they are simply saying "solve my problem for me".
I don't think I've seen that as a matter of general attitude (though I have seen it said explicitly to obnoxiously pushy would-be contributors or people clearly unwilling or unable to do the work themselves, but unwilling to take "maybe" or "later" or "no" for an answer). Definitely, new contributors get asked why their problem is so important it needs to be in the stdlib or language, but they also get help with that ("what are *your* use cases? are there others that you know of? is the three-line implementation easy to get wrong for some reason?" And closest to "solve my problem for me" is "is this the only specification, or would other use cases prefer a different one?") My feeling is that very few contributors originally come to give to Python what it needs. They're here to give what they need to Python. There's nothing wrong with that. Very often it's a shared need. But as you point out, there are often other requirements, and the bar is higher than many projects. I don't think it's possible to pretend that's not true, outside of mentoring environments.
We probably need to get better at helping such people to polish their ideas, *without* focusing too heavily on their original problem (or worse still, on criticising their original solution of that problem). The original problem is the initial use case for a new feature, of course, but focusing on "how does your original problem generalise, so we can see what common features a solution should have"
In a word, we need to be prepared to mentor every newcomer. Would be nice, but I don't think it's practical. In fact, I think this thread is one where the discussion went about as you say you would want it to. That is, we fairly quickly focused on the issue of precisely representing all valid JSON (specifically high-precision numbers), it got generalized to __json__ to handle not only Decimal but user-defined types. Then it was pointed out that the generated language will be valid ECMA 434 JSON with no way to specify how to convert back to to the original Python types, so Wes suggested JSON-LD. The general response to that was "nope, nope! out of scope!" (for now, anyway), so the proposal reverted to "implement simplejson's use_decimal API", and now it's pretty much tabled as the OP has decided simplejson is sufficient for his purposes. The main problem was at the very start where we hung up on the red herring of round-tripping JSON, which I agree went on too long, and was in large part extended by Python-Ideas regulars, not the OP. I guess the main takeway for me is avoiding the red herring. The principle is something pretty straightforward like "find something Python can't do, is doing wrong, or really should provide more performance, in the original issue, and don't discuss how weird the use case is." Even if weird, it's probably simplified and taken out of context. Focus on how it indicates Python can be improved. If there seems to be a TOOWTDI already, describe it and ask if it satisfies the need and if not, why not.
rather than on "why do you think your problem is important enough to need solving in the core language/stdlib" (I exaggerate somewhat, but sadly I suspect not a lot :-() would be a much more welcoming approach.
The question "why is it important enough?", like most of the points you raise, often exposes a genuine point of difference between the patch the Python project needs, and the proposals of many would-be new contributors. I don't see a way to avoid that discussion, because it's the fundamental reason why we insist on finding use cases, and it underlies the rationale for generalization and finding commonalities. Maybe we can find more pleasant ways to ask that question, but I don't think it's fair to the contributor or to the Python-Ideas regulars to do much work without asking it in some variation or other. Steve
On Tue, Aug 27, 2019 at 8:04 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Paul Moore writes:
[S]ubjecting a newcomer to the need to [deal with extra requirements from other participants] right up front isn't exactly fair or helpful.
But avoiding that is what core-mentorship is for. Perhaps we can advertise that list better, and maybe more people can mentor there.
I would like to add my fresh experience with the whole process and why it may seem different from here, compared to how you see from there. I tried to go by the the official doc: https://devguide.python.org/communication/#communication (direct quote): Python-ideas is a mailing list open to the public to discuss ideas on changing Python. If a new idea does not start here (or python-list, discussed below), it will get redirected here. Sometimes people post new ideas to python-list to gather community opinion before heading to python-ideas. The list is also sometimes known as comp.lang.python, the name of the newsgroup it mirrors (it is also known by the abbreviation c.l.py). Which I read as "you may post in python-list, but it will eventually get redirected here". Concerning the core-mentorship from what I read about it (and in it) it did not look as a place to discuss new ideas to me, but the place where new contributors are looking for help when (starting) contributing.
But a proposal that comes to Python-Ideas is going to be presumed the object of public discussion for the benefit of Python
I guess more emphasis (or clearer formulation) could be used in the doc, to explicitly direct people to one place (which I presume should be python-list from what you wrote) and to explain the "process of acceptance" into python-ideas. In retrospect maybe posting in python-list first would give me some ideas and better understanding on what is used (and expected) by others, but I had to post here first to realize it. Richard
On Aug 26, 2019, at 01:47, Richard Musil <risa2000x@gmail.com> wrote:
I gave it some thought over the weekend and came to the conclusion that I am not going to go further with it (where "it" means my or anyone else's idea). The reason is that I totally lost any motivation. I however feel some more elaborate answer might be due and I will try to give one.
I think this is a pretty good summary. But it’s missing something important. At least two ideas came out of the discussion that could actually get done: adding support for serializing Decimal values to JSON, and adding a different customization hook than the default function (whether that’s a __json__ method like simplejson’s for_json, or something different like a singledispatch function). It’s fine that you don’t want to commit the time to develop either of those ideas further. Neither of them are the idea you started with, and, even if they were, raising an idea doesn’t commit you to volunteering for all the hard work. It may be that nobody wants to do so at this time. But at the very least, it means that next time something related comes up, there’s a good history for someone to look back to. So, I hope you don’t see the whole thing as a waste of your time and everyone else’s.
During the discussion I realized that there were 3 aspects (of the potential acceptable solution), proposed by 3 different persons, about which they were quite imperative: 1) It must use Decimal (Paul) 2) It must check validity of serialized output (Christopher) 3) It must avoid unconditional import of Decimal (Andrew)
I think Christopher would be happy with Decimal serializing without runtime checks _if_ there were sufficient unit tests to prove that Decimal can never serialize anything that’s invalid JSON (except for nan/infinity, and then only if the flag is set). If someone else takes up the idea, I hope they push for that. (I suggested that validity checking for 3.x and then sufficient unit tests to remove those checks in 3.x+1 might be a way forward, but I’m not sure if anyone would want to commit to double the work just to get something in faster.)
I also realized that implementing this would not give me any advantage over using simplejson, neither in the performance nor in the features, so it lost also the practical aspect of needing it.
This is important. Even if you come up with something better than simplejson, it’s highly unlikely to actually help you unless you can extract it into a backport that you can publish and maintain on PyPI, unless you only need the new feature for code whose release date is so far in the future that it can require Python 3.9. And at that point, it’s probably easier to get it into simplejson (or, if that’s not possible, to fork simplejson), get some experience with the feature in the wild, and only then propose it for stdlib inclusion. This is something most people don’t realize when they come to python-ideas with an idea. And, sadly, the way people end up discovering it tends to put them off from doing it at all, rather than encouraging them to do it in the way that has the best chance of success.
Just trying to get a bit of clarification here. We really do need to be clear on teh goals in order to choose the best implementation.
well, no -- we did not agree to that.
Then I misread your statement:
"I can't think of, and I don't think anyone else has come up with, any examples other than Decimal that require this "raw" encoding."
I think after I posted that, it became clear that there was interest in being able to preserve the exact encoding in the original JSON -- if that's the case, then strings need to be covered as well.
If your goal is tp support full precision of Decimal, than the way to do
that is to support serialization of the Decimal type. And that would also allow serialization of any other (numeric) type into a valid JSON number. But it seems your goal is not to be able to serialize Decimal numbers with full precision, but rather, to be able perfectly preserve the original JSON
My goal is, as already mentioned:
To extend JSON encoder to support custom type to serialize into "JSON real number" (changed "fractional" to "real" to be aligned to Python source internal comment).
This includes support for decimal.Decimal, if it is the custom type the user chooses. I do not understand why you write that supporting Decimal is not my goal.
I did't mean supporting Decimal wasn't a goal, but that it was not the only goal. You have made it clear (from your examples) that you want not just "to support custom type to serialize into "JSON real number", which could be supported by supporting Decimal, but that you want to be able to "to support custom type to serialize into "JSON real number" while controlling *exactly* how that is done. You may not have encountered any issues yet where getting dull decimal precision wouldn't work, and think you only need to support full precision of decimal numbers (and I say decimal, because that is built in to JSON), but as has been pointed out: "1.234" and ".1234E+1" and "1234E-3" Are all equally valid JSON encodings for the same value. So if you want to be able to round-trip JSON, or exactly match another encoder, then you need to be able to control exactly how your custom type (Or any type) is serialized into a "JSON real number". And if you want to support exactly matching another encoder for all JSON, then you need to support customized encoding of strings, too. Personally, I don't think that this control is a valid goal of the built in json module -- but maybe others think it is. The other issue on the table is whether it's a goal of the json module to make it hard to create invalid JSON -- I think that would be nice, but others don't seem to think so. A NOTE on that: Maybe it would be good to have an optional "validate" flag for the encoder, that would validate the generated JSON. False be default, so it wouldn't effect performance if not used. And users could turn it on only in their test code if desired. I would advocate for something like this if we do give users more control over the encoding. (though I suppose simply trying to decode the generated JSON in your tests may be good enough) -CHB
My summary: I support full precision serialization of the Decimal type -- it would extend the json module to more fully support the JSON spec.
My proposal is compatible with this, it is just it doesn't restrict the support to Decimal only.
I don't think the json module should aim to allow users to fully control exactly how a given item is serialized, though if it did, it would be nice if it did a validity check as well.
Allowing the custom serialization into JSON real number only can be checked rerelatively easily (as I tried to outline in my previous post).
But I suggest the way forward is to separate out the goal: a) This is what I want to be able to do. from b) This is how I suggest that be achieved Because you need to know (a) in order to know how to do (b) , and because you should make sure the core devs support (a) in the first place.
I believe I have defined my goal. If you think it is lacking in some way, I would try to improve it, let me know how.
The reason why I mentioned the current implementation (and outlined a possible way to extend it) was to show that it should not be difficult to ensure the JSON validity, which, strictly speaking, will be needed regardless the custom type implementation (even if the custom type was restricted to Decimal). _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2PRTXR... Code of Conduct: http://python.org/psf/codeofconduct/
\ -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Aug 14, 2019, at 09:57, Christopher Barker <pythonchb@gmail.com> wrote:
The other issue on the table is whether it's a goal of the json module to make it hard to create invalid JSON -- I think that would be nice, but others don't seem to think so.
A NOTE on that:
Maybe it would be good to have an optional "validate" flag for the encoder, that would validate the generated JSON. False be default, so it wouldn't effect performance if not used. And users could turn it on only in their test code if desired. I would advocate for something like this if we do give users more control over the encoding.
(though I suppose simply trying to decode the generated JSON in your tests may be good enough)
This is trivial to implement in your own toolbox: just write a wrapper function that does that loads on the result before returning it. And I think users would actually be more often interested in “valid for the way this protocol defines JSON” than “valid according to the RFC”, and that’s just as easy to do in your own toolbox: def dumps(obj): result = json.dumps(obj, my_usual_flags…) json.loads(result, allow_nan=True, …) if '\n' in result or '\r' in result: raise ValueError('JSON with newlines') it '\u200b' in result or '\u200c' in result: raise ValueError('JSON with Unicode separators that JS hates') return result … but it would be a pain to do with a mess of options to dump/dumps/JSONEncoder. So I’m not sure this really needs to be added to the module even if some form of “raw” encode support is added. (To be clear, I agree with you that we probably don’t want to add that support in the first place. But, as you say, at least some people disagree, so…)
Richard Musil wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
What matters is that I did not find a way how to fix it with the standard `json` module. I have the JSON file generated by another program (C++ code, which uses nlohmann/json library), which serializes one of the floats to the value above. Then when reading this JSON file in my Python code, I can get either decimal.Decimal object (when specifying `parse_float=decimal.Decimal`) or float. If I use the latter the least significant digit is lost in deserialization.
If I use Decimal, the value is preserved, but there seems to be no way to "serialize it back". Writing a custom serializer:
class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This becomes quoted in the serialized output return super.default(o)
seems to only allow returning "string" value, but then serializes it as a string! I.e. with the double quotes. What seems to be missing is an ability to return a "raw textual representation" of the serialized object which will not get mangled further by the `json` module.
You have overwritten the wrong method: $ cat demo.py import io import json import decimal class DecimalEncoder(json.JSONEncoder): def encode(self, o): if isinstance(o, decimal.Decimal): return str(o) return super().encode(o) input = "0.6441726684570313" obj = json.loads(input, parse_float=decimal.Decimal) output = json.dumps(obj, cls=DecimalEncoder) assert input == output $ python3 demo.py $
On Aug 8, 2019, at 13:43, __peter__@web.de wrote:
You have overwritten the wrong method:
Unfortunately, he hasn’t. Overriding encode isn’t particularly useful.
$ cat demo.py import io import json import decimal
class DecimalEncoder(json.JSONEncoder): def encode(self, o): if isinstance(o, decimal.Decimal): return str(o) return super().encode(o)
input = "0.6441726684570313" obj = json.loads(input, parse_float=decimal.Decimal) output = json.dumps(obj, cls=DecimalEncoder) assert input == output
Try it with: input = ["0.6441726684570313"] You’ll get a TypeError because Decimal isn’t serializable. The encode method doesn’t get called recursively; instead, a function gets created that calls itself recursively, and encode calls that. So, any hooks you add in encode only affect the top-level value, not values inside arrays or objects.
OK, going back to your original question: On Thu, 8 Aug 2019 at 11:24, Richard Musil <risa2000x@gmail.com> wrote:
If I use Decimal, the value is preserved, but there seems to be no way to "serialize it back". Writing a custom serializer:
class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This becomes quoted in the serialized output return super.default(o)
seems to only allow returning "string" value, but then serializes it as a string! I.e. with the double quotes. What seems to be missing is an ability to return a "raw textual representation" of the serialized object which will not get mangled further by the `json` module.
The thing is, if you were allowed to insert arbitrary text (your "raw textual representation") into a JSON stream, the result wouldn't be JSON any more. If you want to dump user-defined types in a JSON stream, you need to do so by defining a structure (defined in terms of the fundamental types that JSON supports) that you can serialise to the JSON stream. That's why the encoder default() method returns an object, not a string. So, for example, you could serialise and deserialise decimals as follows:
import json import decimal
class DecimalEncoder(json.JSONEncoder): ... def default(self, o): ... if isinstance(o, decimal.Decimal): ... # Return a JSON "object" with 2 attributes: type="Decimal", value=the string representation of the value ... return {"type": "Decimal", "value": str(o)} ... return super.default(o) ...
def as_decimal(dct): ... if dct.get("type") == "Decimal" and "value" in dct: ... return decimal.Decimal(dct["value"]) ... x = json.dumps([12, 34.56, decimal.Decimal("789.876")], cls=DecimalEncoder) x '[12, 34.56, {"type": "Decimal", "value": "789.876"}]' json.loads(x, object_hook=as_decimal) [12, 34.56, Decimal('789.876')]
What you *can't* do is change how the fundamental types are represented, or have a specialised deserialiser for anything that's not a JSON "object" type. The point of the customisation facility is to transport non-JSON types via a JSON stream, retaining type information. Not to modify how the JSON stream is represented as a string. So no, I don't think there's any fundamental asymmetry here, nor do I think there's anything particular missing[1]. I just think you've misunderstood the purpose behind the customisation facilities in the json module. Paul [1] Assuming you accept that controlling the exact serialisation of fundamental JSON types is a non-goal, which you claim you do accept, even though your use case seems based on the idea that you want to be able to serialise floating point numbers in a consistent format across different JSON libraries...
I have not asked for means to serialize invalid JSON objects. Yes, with the "raw" output, you can create invalid JSON, but it does not mean you have to. Let's take a look at it from a different POV and focus on the original problem. Imagine this situation, I have a JSON string with this value: ``` msg2 = '{"val": 1.0000000000000000000000000000000000000000000000000000000000001}' ``` This is perfectly valid JSON representation of the float. I can parse it with standard module with default float handling and get this: ``` json:orig = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} json:pyth = {'val': 1.0} json:seri = {"val": 1.0} ``` i.e Python chooses to represent it as 1.0 (which I guess is the closest to the original value) and then serialize it as such into the output. But I can use `parse_float=decimal.Decimal` option of the standard module and get this (with the custom encoder encoding it into the string): ``` dson:orig = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} dson:pyth = {'val': Decimal('1.0000000000000000000000000000000000000000000000000000000000001')} dson:seri = {"val": "1.0000000000000000000000000000000000000000000000000000000000001"} ``` There is nothing wrong with the float in the original JSON and there is nothing wrong with representing that float with decimal.Decimal type either. What is missing is just corresponding encoder support. I want to be able to serialize the Decimal object into its JSON float form, not into a string or some custom map. Which is exactly what happens with `simplejson` and `use_decimal=True` set: ``` sjson:orig = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} sjson:pyth = {'val': Decimal('1.0000000000000000000000000000000000000000000000000000000000001')} sjson:seri = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} ``` And this is what is missing in the current implementation and there seems to be different ways how to address it, as Angelo pointed out. At the moment standard module basically does not allow creating a perfectly valid JSON float which does not have valid binary representation, while at the same time, allows decoding such a float (without a precision loss with the decimal.Decimal). This is the asymmetry I am talking about. Why the output should be limited by underlying implementation, while the input is not?
On Thu, 8 Aug 2019 at 22:31, Richard Musil <risa2000x@gmail.com> wrote:
I have not asked for means to serialize invalid JSON objects. Yes, with the "raw" output, you can create invalid JSON, but it does not mean you have to.
True. But my point was simply that the json module appears to be designed in a way that protects against the possibility of ending up with invalid JSON, and than seems like a reasonable design principle. I'm not arguing that there should not ever be such a feature, just that in the absence of a need for it (see below), designing for safety seems like a good choice.
Let's take a look at it from a different POV and focus on the original problem. Imagine this situation, I have a JSON string with this value: ``` msg2 = '{"val": 1.0000000000000000000000000000000000000000000000000000000000001}' ``` This is perfectly valid JSON representation of the float. I can parse it with standard module with default float handling and get this: ``` json:orig = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} json:pyth = {'val': 1.0} json:seri = {"val": 1.0} ``` i.e Python chooses to represent it as 1.0 (which I guess is the closest to the original value) and then serialize it as such into the output. But I can use `parse_float=decimal.Decimal` option of the standard module and get this (with the custom encoder encoding it into the string): ``` dson:orig = {"val": 1.0000000000000000000000000000000000000000000000000000000000001} dson:pyth = {'val': Decimal('1.0000000000000000000000000000000000000000000000000000000000001')} dson:seri = {"val": "1.0000000000000000000000000000000000000000000000000000000000001"} ``` There is nothing wrong with the float in the original JSON and there is nothing wrong with representing that float with decimal.Decimal type either.
OK, and so far there's nothing that I would describe as a "problem".
What is missing is just corresponding encoder support. I want to be able to serialize the Decimal object into its JSON float form, not into a string or some custom map.
OK, so you're saying that this is "the original problem". Fine, but my response would be that without a reasonable use case for this, it's just how the json module works, and not a "problem". It's only a problem if someone wants to do something specific, and can't. The only use case you have given for needing this capability is to produce identical output when round tripping from JSON to objects and back to JSON. But whenever people have pushed back on the validity of this requirement, you've said that's not the point. So OK, if it's not the point, where's your use case that makes the lack of the encoder support you're requesting a problem? Basically you can't have it both ways - either persuade people that your use case (identical output) is valid, or present a different use case. Paul
Please, let me explain once more what want. I need a JSON codec (serializer-deserializer), which can preserve its input on its output. After some thought it seems to be really concerning only the floats. (The attempt to generalize this should be regarded as secondary - just suggestion subject to comments). What you, Andrew and possibly Chris have probably understood is that I was asking for a JSON codec which would ensure the same bit to bit (or byte to byte) output with any other JSON codec out there. This never was my point, as, in my first post, in this thread, I already gave an example of two different serializers which give different results and invalidate directly such a request. And, as Chris pointed out in one of his replies, it is simply matter of choice and none is either "right" or "wrong". Second, the fact that the JSON codec (I am looking for) should be able to preserve its inputs in its outputs byte-to-byte does not necessarily mean that the (default) standard implementation must do that. As, again using the same example I gave above, I am fine with default implementation, which chooses platform native binary float for the JSON float representation and is as such subject to the consequences already discussed here. The only thing what my "request" requires from the standard module is to allow serialize decimal.Decimal or, in broader scope, some custom type in a way that it can preserve its input in its output byte-to-byte. And this and only this is exactly the kind of byte-to-byte precision I need from the implementation. Or better say I need an adequate support from the standard implementation that it should be possible. After thinking it through, it seems to me that the current implementation with JSONEncoder.default override satisfies this requirement for any custom type, except decimal.Decimal, because of how it handles the custom value provided by custom serializer and because decimal.Decimal is in fact a representation of the native JSON type (float). There is no way to serialize the float, which was deserialized into decimal.Decimal as a float again and have the exactly same value (literally) in the output as was in the input. Serializing it into custom map or a string might acceptable for some other application (though one would wonder why to do that to something which is by nature a native JSON type with perfectly valid representation of its own), but is not acceptable in my case for reasons stated above.
@Richard I suggest you open a new thread to discuss support for dumping `Decimal` as a number in JSON. I agree that allowing arbitrary strings to be inserted in the JSON is not a good idea, but apart from the `Decimal` issue I can't think of any other case that can't be solved via `JSONEncoder.decode`. However there is an asymmetry with respect to parsing / dumping numbers. The JSON specs don't limit numbers to any precision and the `json` module can parse numbers of arbitrary precision. Python further supports arbitrary precision through `Decimal` but any JSON dump is limited to the underlying binary `float` representation. As the OP indicates one can parse `0.6441726684570313` to `Decimal` (preserving the precision) but there's not way to dump it back. This seems like a limitation of the `json` implementation. For dumping `Decimal` without importing the `decimal` module, since this type seems to be the only exception, can't we just dump any non-default type as `str(o)` and check whether it can be parsed back to float: # if not any of the default types dump_val = str(o) try: float(dump_val) except ValueError: raise TypeError('... is not JSON serializable') from None
@Dominik, I believe your proposal (if I understand it correctly) will break it for people, who are already serializing the decimal.Decimal into the string (for lack of better support).
On the second though though, what you propose should work right, if: 1) json.dumps will be extended to accept `dump_float` keyword argument, which would accept a custom class (for example decimal.Decimal, but could the same one which was specified with `parse_float`). 2) Then the serializer when seeing this particular type will do the string converson of the object, do the check for valid float as you suggest (as it is supposed to be dumping float after all) and then dumps the string into the output _but without the double quotes_.
So - I think that is clear by now that in your case you really are not loosing any precision with these numbers. However, as noted, there is no way to customize Python JSON encoder to encode an arbitrary decimal number in JSON, even though the standard does allow it, and Python supports then via decimal.Decimal. Short of a long term solution, like a __json__ protocol, or at least special support in Python json module for objects of type "numbers.Number", the only way to go, is, as you are asking, being able to insert "raw strings into json". Given that this feature can be needed now, I fashioned a JsonEncoder class that is able to do that - by annotating decimal.Decimal instances on encoding, and making raw string replacements before returning the final encoded value. This recipe is ready to be used at https://gist.github.com/jsbueno/5f5d200fd77dd1233c3063ad6ecb2eee (Note that I don't consider this approach fit for the stdlib due to having to rely on regular expressions, and having to create a copy of the whole encoded json-body - if there is demand, I might package it though). Please enjoy. js -><- On Thu, 8 Aug 2019 at 07:27, Richard Musil <risa2000x@gmail.com> wrote:
I have found myself in an awkward situation with current (Python 3.7) JSON module. Basically it boils down to how it handles floats. I had been hit on this particular case:
In [31]: float(0.6441726684570313) Out[31]: 0.6441726684570312
but I guess it really does not matter.
What matters is that I did not find a way how to fix it with the standard `json` module. I have the JSON file generated by another program (C++ code, which uses nlohmann/json library), which serializes one of the floats to the value above. Then when reading this JSON file in my Python code, I can get either decimal.Decimal object (when specifying `parse_float=decimal.Decimal`) or float. If I use the latter the least significant digit is lost in deserialization.
If I use Decimal, the value is preserved, but there seems to be no way to "serialize it back". Writing a custom serializer:
class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This becomes quoted in the serialized output return super.default(o)
seems to only allow returning "string" value, but then serializes it as a string! I.e. with the double quotes. What seems to be missing is an ability to return a "raw textual representation" of the serialized object which will not get mangled further by the `json` module.
I noticed that `simplejson` provides an explicit option for its standard serializing function, called `use_decimal`, which basically solves my problem., but I would just like to use the standard module, I guess.
So the question is, if something like `use_decimal` has been considered for the standard module, and if yes, why it was not implemented, or the other option could be to support "raw output" in the serializer, e.g. something like: class DecimalEncoder(json.JSONEncoder): def raw(self, o): if isinstance(o, decimal.Decimal): return str(o) # <- This is a raw representation of the object return super.raw(o) Where the returning values will be directly passed to the output stream without adding any additional characters. Then I could write my own Decimal serializer with few lines of code above.
If anyone would want to know, why the last digit matters (or why I cannot double quote the floats), it is because the file has a secure hash attached and this basically breaks it. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WT6Z6Y... Code of Conduct: http://python.org/psf/codeofconduct/
Joao S. O. Bueno wrote:
However, as noted, there is no way to customize Python JSON encoder to encode an arbitrary decimal number in JSON, even though the standard does allow it, and Python supports then via decimal.Decimal. Short of a long term solution, like a __json__ protocol, or at least special support in Python json module for objects of type "numbers.Number", the only way to go, is, as you are asking, being able to insert "raw strings into json".
Would the approach I outlined in my answer to Dominik be acceptable?: 1) Add keyword argument to `json.dump(s)` called `dump_float` which will act as a counter part to `parse_float` keyword argument in `json.load(s)`. The argument will accept custom type (class) for the user's "float" representation (for example `decimal.Decimal`). 2) If specified by the client code, JSONEncoder, when identifying object of that type in the input data will encode it using the special rule suggested by Dominik: ``` # if o is custom float type if isinstance(o, <dump_float Type>): dump_val = str(o) try: float(dump_val) except ValueError: raise TypeError('... is not JSON serializable float number') from None <write dump_val to the stream> ``` This would have following implications/consequences: 1) str(o) may return invalid float, but the check will not let it into the stream. 2) the contract between the custom float class implementation and standard `json` module will be pretty clear - it must implement the serialization in its __str__ function and must return valid float. 3) the standard implementation does not need to `import decimal`. If the client code needs this feature, it will `import decimal` itself. 4) definition which class/type objects should be handled by this rule will be pretty clear, it will be the only one specified in `dump_float` argument (if specified at all).
On Fri, 9 Aug 2019 at 14:49, Richard Musil <risa2000x@gmail.com> wrote:
Joao S. O. Bueno wrote:
However, as noted, there is no way to customize Python JSON encoder to encode an arbitrary decimal number in JSON, even though the standard does allow it, and Python supports then via decimal.Decimal. Short of a long term solution, like a __json__ protocol, or at least special support in Python json module for objects of type "numbers.Number", the only way to go, is, as you are asking, being able to insert "raw strings into json".
Would the approach I outlined in my answer to Dominik be acceptable?:
1) Add keyword argument to `json.dump(s)` called `dump_float` which will act as a counter part to `parse_float` keyword argument in `json.load(s)`. The argument will accept custom type (class) for the user's "float" representation (for example `decimal.Decimal`).
yes, just that it should be called `dump_as_float` and take either a class or a tuple-of-classes (or maybe just another argument that when set to "True" would work for any object for which isinstance(obj, numbers.Number) is True)
2) If specified by the client code, JSONEncoder, when identifying object of that type in the input data will encode it using the special rule suggested by Dominik: ``` # if o is custom float type if isinstance(o, <dump_float Type>): dump_val = str(o) try: float(dump_val) except ValueError: raise TypeError('... is not JSON serializable float number') from None <write dump_val to the stream> ``` This would have following implications/consequences: 1) str(o) may return invalid float, but the check will not let it into the stream. 2) the contract between the custom float class implementation and standard `json` module will be pretty clear - it must implement the serialization in its __str__ function and must return valid float. 3) the standard implementation does not need to `import decimal`. If the client code needs this feature, it will `import decimal` itself. 4) definition which class/type objects should be handled by this rule will be pretty clear, it will be the only one specified in `dump_float` argument (if specified at all).
Yes, but I don know if the reverse float checking is over-policying it - it is not the role of the language or its libraries to prevent any way that the JSON encoded string is valid-json. So, maybe emitting a warning there, but raising TypeError will only make someone intending to encode numbers in Hexadecimal in her custom JSON to pop here crying tomorrow.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7HCLSK... Code of Conduct: http://python.org/psf/codeofconduct/
Joao S. O. Bueno wrote:
yes, just that it should be called dump_as_float and take either a class or a tuple-of-classes
I saw kind of symmetry with the `parse_float` which only accepted one class for having only one class on the output as well. Besides there are probably not many (different) ways how to write a custom type for JSON float in one application. But I cannot see any argument for not having a tuple.
(or maybe just another argument that when set to "True" would work for any object for which isinstance(obj, numbers.Number) is True)
I cannot verify it right now, but if integer (or big int) are derived from `numbers.Number` than it would not work as a distinction for a float. Big int is already handled by standard module correctly.
is not the role of the language or its libraries to prevent any way that the JSON encoded string is valid-json. So, maybe emitting a warning there, but
From the other responses I got an impression that ensuring the validity of the output was important part of the standard implementation. But regardless that, here I believe the check with `float(dump_val)` is actually a check to validate the contract with the custom serializer, which seems reasonable, whether it should be an error or a warning I have no idea. I hope Andrew or Paul could comment on that.
raising TypeError will only make someone intending to encode numbers in Hexadecimal in her custom JSON to pop here crying tomorrow.
I am not sure hexadecimal representation is officially recognized as a number i JSON and float number in particular, so in that case she will probably be encoding it a string already.
On Fri, 9 Aug 2019 at 19:53, Richard Musil <risa2000x@gmail.com> wrote:
Joao S. O. Bueno wrote:
yes, just that it should be called dump_as_float and take either a class or a tuple-of-classes
I saw kind of symmetry with the `parse_float` which only accepted one class for having only one class on the output as well. Besides there are probably not many (different) ways how to write a custom type for JSON float in one application. But I cannot see any argument for not having a tuple.
(or maybe just another argument that when set to "True" would work for any object for which isinstance(obj, numbers.Number) is True)
I cannot verify it right now, but if integer (or big int) are derived from `numbers.Number` than it would not work as a distinction for a float. Big int is already handled by standard module correctly.
Maybe we are all just over-discussing a lot of things that could be solved in a straightforward way by allowing decimal.Decimal to be json encoded as numbers by default, with no drawbacks whatsoever. The fact that arbitrarily long integers are encoded with no complaints up to now seems to indicate so. [clip]
Joao S. O. Bueno wrote:
Short of a long term solution, like a __json__ protocol, or at least special support in Python json module for objects of type "numbers.Number", the only way to go, is, as you are asking, being able to insert "raw strings into json".
No, that's not the only way. It would be sufficient just to add a "use decimal" option to the stdlib json module. Since Python's Decimal preserves all the information in the JSON representation of a float (including trailing zeroes!), anything else you might want can be achieved by pre/postprocessing the decoded data structure. It's just occurred to me that this isn't quite true, since Decimal doesn't preserve *leading* zeroes. But an application that cared about that would be pretty weird. So I think a "use decimal" option would cover the vast majority of use cases. -- Greg
On Aug 9, 2019, at 17:44, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Since Python's Decimal preserves all the information in the JSON representation of a float (including trailing zeroes!), anything else you might want can be achieved by pre/postprocessing the decoded data structure.
It's just occurred to me that this isn't quite true, since Decimal doesn't preserve *leading* zeroes.
That one isn’t a problem, because JSON doesn’t allow leading zeroes. If you want to use 00.012 vs. 0.012 to say something meaningful about your precision, you’d have to encode that meaning in JSON as something like 0.0012E1 vs. 0.012E0. Of course Decimal will throw away _that_ distinction, but only for the same reason it throws away all of the distinctions between 0.012E0 vs. 0.012 vs 1.2E-2 vs. 1.2e-2 and so on, so you’re only hitting the everyday problem that Decimal can’t solve for anyone; there’s no additional problem that Decimal can’t solve here that’s only faced by your weird app.
But an application that cared about that would be pretty weird. So I think a "use decimal" option would cover the vast majority of use cases.
Greg Ewing wrote:
No, that's not the only way. It would be sufficient just to add a "use decimal" option to the stdlib json module. Since Python's Decimal preserves all the information in the JSON representation of a float (including trailing zeroes!), anything else you might want can be achieved by pre/postprocessing the decoded data structure.
Actually it appeared to me that the best (for my case) would be to be able to store the original representation in the custom type as well (I am not sure if decimal.Decimal does that and cannot figure it right now) and then being able to use `dump_as_float` (discussed earlier) to define the custom float type for the serializer and then in the serializer function (e.g. `__str__`) return the original form from the input. This could be easily done by subclassing `decimal.Decimal` and implementing it in the subclass, then specifying this type in `json.load`'s `parse_float` argument and then pass the same type to `dump_as_float` of `json.dump`. This should resolve all the issues mentioned by Andrew, e.g: ``` In [7]: decimal.Decimal('1.0000E+3') Out[7]: Decimal('1000.0') ```
participants (21)
-
__peter__@web.de
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
Christopher Barker
-
David Shawley
-
Dominik Vilsmeier
-
Eric V. Smith
-
Greg Ewing
-
Joao S. O. Bueno
-
Jonathan Fine
-
Kyle Stanley
-
Paul Moore
-
Random832
-
Rhodri James
-
Richard Musil
-
Rob Cliffe
-
Ronald Oussoren
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Wes Turner