
On Aug 9, 2019, at 07:25, Richard Musil <risa2000x@gmail.com> wrote:
There is no "normalized" representation for JSON. If you look at the "standard" it is pretty simple (json.org). The JSON object is defined solely by its textual representation (string of characters).
json.org is not the standard; RFC 8259 is the standard.* The description at json.org is not just informal and incomplete, it’s wrong about multiple things. For example, it assumes all Unicode characters can be escaped with \uxxxx, and that JSON strings are a subset of JS strings despite allowing two characters the JS doesn’t, and that its float format is the same as C’s except for octal and hex when it isn’t. . The RFC nails down all the details, fixes all of the mistakes, and, most relevant here, makes recommendations about what JSON implementations should do if they care about interoperability. None of these are "MUST" recommendations, so you can still call an implementation conforming if it ignores them, but it’s still not a good idea to ignore them.** And it’s not just your two examples that are different representations of the same float value that implementations should treat as (approximately, to the usual limits of IEEE) equal. So are 100000000.0 and +1E8 and 1.0e+08. If you’ve received the number 1E8 and write it back, you’re going to get 100000000.0. Storing it in Decimal instead of float doesn’t magically fix that. The fact that it does fix the one example you’ve run into so far doesn’t mean that it guarantees byte-for-byte*** round-tripping, just that it happens to give you byte-for-byte round-tripping in that one example. Arguing that we must allow it because we have to allow people to guarantee byte-for-byte round-tripping is effectively arguing that we have to actively mislead people into thinking we’re making a guarantee that we can’t make. And even if it did solve the problem for numbers, you’d still have the problem of different choices for which characters to escape, different Unicode normalizations in strings, different order of object members, different handling for repeated keys, different separator whitespace in arrays and objects, and so on. All of these are just as much a problem for round-tripping a different library’s JSON as they are for generating the same thing from scratch. Other text-based formats solve the hashing problem by specifying a canonical form: you can’t guarantee that any implementation can round-trip another implementation’s output, but you can guarantee that any implementation that claims to support the canonical form will produce the same canonical form for the same inputs, so you just hash the canonical form. But JSON intentionally does not have a canonical form. The intended way to solve the hashing problem is to not use JSON. Finally, all of your proposals for solving this so far either allow people to insert arbitrary strings into JSON, or don’t work. For example, using bytes returns from default to signal “insert these characters into the stream instead of encoding this value”**** explicitly lets people insert anything they want, even as you say you don’t want to allow that. I don’t get why you aren’t just proposing that the stdlib adopt simplejson’s use_decimal flag (and trying to figure out a way to make that work) instead of trying to invent something new and more general and more complicated even though it can’t actually be used for anything but decimal. But again, use_decimal does not solve the round-tripping, canonicalization, or hashing problem at all, so if your argument is that we should include it to solve that problem, you need a new argument. —- * Python, and I believe simplejson, still conforms to RFC 7159, which was the standard until December 2017, so if you wanted to make an argument from 7159, that _might_ be valid, although I think if there were a significant difference, people would be more likely to take that as a reason to update to 8259. ** And in fact, even the informal description doesn’t say what you want. It defines numbers are “very much like a C or Java number”. C and Java numbers are explicitly representations of the underlying machine type (for C) or of IEEE binary64 (for Java). *** I don’t get why you’re obsessed with the question of bit-for-bit vs. byte-for-byte. Are you worried about behavior on an ancient PDP-7 where you have 9-bit bytes but ignore the top bit for characters or something? **** And why would bytes ever mean “exactly these Unicode characters”? That’s what str means; bytes doesn’t, and in fact can’t unless you have an out-of-band-specified encoding; that’s half the reason we have Python 3. And even if you did confusingly specify that bytes can be used for “exactly the characters these bytes decode to as UTF-8”, that still doesn’t specify what happens if the bytes has, say, a newline, or a backslash followed by an a, or a non-BMP Unicode character. The fact that you weren’t expecting to give it any of those things doesn’t mean you don’t have to design what happens if someone else does.