Putting Unicode characters in JSON
Richard Damon
Richard at Damon-Family.org
Fri Mar 23 10:12:03 EDT 2018
On 3/23/18 6:35 AM, Chris Angelico wrote:
> On Fri, Mar 23, 2018 at 9:29 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Fri, 23 Mar 2018 18:35:20 +1100, Chris Angelico wrote:
>>
>>> That doesn't seem to be a strictly-correct Latin-1 decoder, then. There
>>> are a number of unassigned byte values in ISO-8859-1.
>> That's incorrect, but I don't blame you for getting it wrong. Who thought
>> that it was a good idea to distinguish between "ISO 8859-1" and
>> "ISO-8859-1" as two related but distinct encodings?
>>
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>
>> The old ISO 8859-1 standard, the one with undefined values, is mostly of
>> historical interest. For the last twenty years or so, anyone talking
>> about either Latin-1 or ISO-8859-1 (with or without dashes) is almost
>> meaning the 1992 IANA superset version which defines all 256 characters:
>>
>> "In 1992, the IANA registered the character map ISO_8859-1:1987,
>> more commonly known by its preferred MIME name of ISO-8859-1
>> (note the extra hyphen over ISO 8859-1), a superset of ISO
>> 8859-1, for use on the Internet. This map assigns the C0 and C1
>> control characters to the unassigned code values thus provides
>> for 256 characters via every possible 8-bit value."
>>
>>
>> Either that, or they actually mean Windows-1252, but let's not go there.
>>
> Wait, whaaa.......
>
> Though in my own defense, MySQL itself seems to have a bit of a
> problem with encoding names. Its "utf8" is actually "UTF-8 with a
> maximum of three bytes per character", in contrast to "utf8mb4" which
> is, well, UTF-8.
>
> In any case, abusing "Latin-1" to store binary data is still wrong.
> That's what BLOB is for.
>
> ChrisA
One comment on this whole argument, the original poster asked how to get
data from a database that WAS using Latin-1 encoding into JSON (which
wants UTF-8 encoding) and was asking if something needed to be done
beyond using .decode('Latin-1'), and in particular if they need to use a
.encode('UTF-8'). The answer should be a simple Yes or No.
Instead, someone took the opportunity to advocate that a wholesale
change to the database was the only reasonable course of action.
First comment, when someone is proposing a change, it is generally put
on them the burden of proof that the change is warranted. This is
especially true when they are telling someone else they need to make
such a change.
Absolute statements are very hard to prove (but the difficulty of proof
doesn't relieve the need to provide it), and in fact are fairly easy to
disprove (one counter example disproves an absolute statement). Counter
examples to the absolute statement have been provided.
When dealing with a code base, backwards compatibility IS important, and
casually changing something that fundamental isn't the first thing that
someone should be thinking about, We weren't given any details about the
overall system this was part of, and they easily could be other code
using the database that such a change would break. One easy Python
example is to look back at the change from Python 2 to Python 3, how
many years has that gone on, and how many more will people continue to
deal with it? This was over a similar issue, that at least for today,
Unicode is the best solution for storing arbitrary text, and forcing
that change down to the fundamental level.
--
Richard Damon
More information about the Python-list
mailing list