On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou email@example.com wrote:
As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*)
That is one short and sweet RFC. :-)
The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module:
Given the RFC specifies that the encoding used should be one of the encodings defined by Unicode, wouldn't be a better idea to remove the "unicode" support, instead? To me, it would make sense to use the detection algorithms for Unicode to sniff the encoding of the JSON stream and then use the detected encoding to decode the strings embed in the JSON stream.
Cheers, -- Alexandre