Lossless bulletproof conversion to unicode (backslashing)
https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode There is no lossless way to encode the information to unicode. The argument that you know the encoding the data is coming from is a fallacy. The argument that data is always correct is a fallacy as well. So: 1. external data encoding is unknown or varies 2. external data has binary chunks that are invalid for conversion to unicode In real world you have to deal with broken and invalid output and UnicodeDecode crashes is not an option. The unicode() constructor proposes two options to deal with invalid output: 1. ignore - meaning skip and corrupt the data 2. replace - just corrupt the data The solution is to have filter preprocess the binary string to escape all non-unicode symbols so that the following lossless transformation becomes possible: binary -> escaped utf-8 string -> unicode -> binary How to accomplish that with Python 2.x? This stuff is critical to port SCons to Python 3.x and I expect for other such tools too. -- anatoly t.
On Tue, May 26, 2015 at 9:47 PM, Ethan Furman
On 05/26/2015 11:30 AM, anatoly techtonik wrote:
[...]
How to accomplish that with Python 2.x?
This should be on Python List, not on Ideas.
The way to do this, probably, the idea to make it into unicode() function belongs here. So, if you're replying to this thread and read the letter, are against or for the idea? -- anatoly t.
On 26 May 2015 at 19:30, anatoly techtonik
In real world you have to deal with broken and invalid output and UnicodeDecode crashes is not an option. The unicode() constructor proposes two options to deal with invalid output:
1. ignore - meaning skip and corrupt the data 2. replace - just corrupt the data
There are other error handlers, specifically surrogateescape is designed for this use. Only in Python 3.x admittedly, but this list is about future versions of Python, so that's what matters here.
The solution is to have filter preprocess the binary string to escape all non-unicode symbols so that the following lossless transformation becomes possible:
binary -> escaped utf-8 string -> unicode -> binary
How to accomplish that with Python 2.x?
That question is for python-list. Language changes will only be made to 3.x - python-ideas isn't appropriate for questions about how to achieve something in 2.x. Paul
On Wed, May 27, 2015 at 6:28 PM, Paul Moore
On 26 May 2015 at 19:30, anatoly techtonik
wrote: In real world you have to deal with broken and invalid output and UnicodeDecode crashes is not an option. The unicode() constructor proposes two options to deal with invalid output:
1. ignore - meaning skip and corrupt the data 2. replace - just corrupt the data
There are other error handlers, specifically surrogateescape is designed for this use. Only in Python 3.x admittedly, but this list is about future versions of Python, so that's what matters here.
Forwarded message to python-list and now I have a thread schizophrenia. I read it like python-list is also about Python 3 and got really mad about that. I was a click away from sending me into the ban list again. =) Ok. Closing thread in python-idea. This needs to be reopened when the thread is about Python 4 (which should be all about improving user experience and assessment of the results). -- anatoly t.
participants (3)
-
anatoly techtonik
-
Ethan Furman
-
Paul Moore