Processing text data with different encodings
Peter Otten
__peter__ at web.de
Tue Jun 28 07:31:08 EDT 2016
Michael Welle wrote:
> With your help, I fixed logging. Somehow I had in mind that the
> logging module would do the right thing if I don't specify the encoding.
The default encoding depends on the environment (and platform):
$ touch tmp.txt
$ python3 -c 'print(open("tmp.txt").encoding)'
UTF-8
$ LANG=C python3 -c 'print(open("tmp.txt").encoding)'
ANSI_X3.4-1968
> Well, setting the encoding explicitly to utf-8 changes the behaviour.
>
> If I use decode('windows-1252') on a bit of text I still have trouble to
> understand what's happening. For instance, there is an u umlaut in the
> 1252 encoded portion of the input text. That character is 0xfc in hex.
> After applying .decode('windows-1252') and logging it, the log contains
> a mangled character with hex codes 0xc3 0x20. If I do the same with
> .decode('utf-8'), the result is a working u umlaut with 0xfc in the log.
>
> On the other hand, if I try the following in the interactive
> interpreter:
>
> Here I have a few bytes that can be interpreted as a 1252 encoded string
> and I command the interpreter to show me the string, right?
>
>>>> e=b'\xe4'
>>>> e.decode('1252')
> 'ä'
>
> Now, I can't to this, because 0xe4 isn't valid utf-8:
>>>> e.decode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0:
> unexpected end of data
>
> But why is it different in my actual script? I guess the assumption that
> what I am reading from sys.stdin.buffer is the same as what is in the
> file, that I pipe into the script, is wrong?
The situation is simple; the string consists of code points, but the file
may only contain bytes. When reading a string from a file the bytes read
need decoding, and before writing a string to a file it must be encoded.
What byte sequence denotes a specific code point depends on the encoding.
This is always the case, i. e. if you look at a UTF-8-encoded file with an
editor that expects cp1252 you will see
>>> in_the_file = "ä".encode("utf-8")
>>> in_the_file
b'\xc3\xa4'
>>> what_the_editor_shows = in_the_file.decode("cp1252")
>>> print(what_the_editor_shows)
ä
On the other hand if you look at a cp1252-encoded file decoding the data as
UTF-8 you will likely get an error because the byte
>>> "ä".encode("cp1252")
b'\xe4'
alone is not valid UTF-8. As part of a sequence the data may still be
ambiguous. If you were to write an a-umlaut followed by two euro signs using
cp1252
>>> in_the_file = '䀀'.encode("cp1252")
an editor expecting UTF-8 would show
>>> in_the_file.decode("utf-8")
'䀀'
More information about the Python-list
mailing list