UTF-8 and latin1
Jon Ribbens
jon+usenet at unequivocal.eu
Wed Aug 17 20:20:53 EDT 2022
On 2022-08-17, Barry <barry at barrys-emacs.org> wrote:
>> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list at python.org> wrote:
>> On 2022-08-17, Tobiah <toby at tobiah.org> wrote:
>>> I get data from various sources; client emails, spreadsheets, and
>>> data from web applications. I find that I can do some_string.decode('latin1')
>>> to get unicode that I can use with xlsxwriter,
>>> or put <meta charset="latin1"> in the header of a web page to display
>>> European characters correctly. But normally UTF-8 is recommended as
>>> the encoding to use today. latin1 works correctly more often when I
>>> am using data from the wild. It's frustrating that I have to play
>>> a guessing game to figure out how to use incoming text. I'm just wondering
>>> if there are any thoughts. What if we just globally decided to use utf-8?
>>> Could that ever happen?
>>
>> That has already been decided, as much as it ever can be. UTF-8 is
>> essentially always the correct encoding to use on output, and almost
>> always the correct encoding to assume on input absent any explicit
>> indication of another encoding. (e.g. the HTML "standard" says that
>> all HTML files must be UTF-8.)
>>
>> If you are finding that your specific sources are often encoded with
>> latin-1 instead then you could always try something like:
>>
>> try:
>> text = data.decode('utf-8')
>> except UnicodeDecodeError:
>> text = data.decode('latin-1')
>>
>> (I think latin-1 text will almost always fail to be decoded as utf-8,
>> so this would work fairly reliably assuming those are the only two
>> encodings you see.)
>
> Only if a reserved byte is used in the string.
> It will often work in either.
Because it's actually ASCII and hence there's no difference between
interpreting it as utf-8 or iso-8859-1? In which case, who cares?
> For web pages it cannot be assumed that markup saying it’s utf-8 is
> correct. Many pages are I fact cp1252. Usually you find out because
> of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.
Hence what I said above. But if a source explicitly states an encoding
and it's false then these days I see little need for sympathy.
More information about the Python-list
mailing list