[Python-ideas] Lossless bulletproof conversion to unicode (backslashing)
anatoly techtonik
techtonik at gmail.com
Tue May 26 20:30:54 CEST 2015
https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode
There is no lossless way to encode the information
to unicode. The argument that you know the encoding
the data is coming from is a fallacy. The argument that
data is always correct is a fallacy as well. So:
1. external data encoding is unknown or varies
2. external data has binary chunks that are invalid for
conversion to unicode
In real world you have to deal with broken and invalid
output and UnicodeDecode crashes is not an option.
The unicode() constructor proposes two options to
deal with invalid output:
1. ignore - meaning skip and corrupt the data
2. replace - just corrupt the data
The solution is to have filter preprocess the binary
string to escape all non-unicode symbols so that the
following lossless transformation becomes possible:
binary -> escaped utf-8 string -> unicode -> binary
How to accomplish that with Python 2.x?
This stuff is critical to port SCons to Python 3.x and I
expect for other such tools too.
--
anatoly t.
More information about the Python-ideas
mailing list