Mailman 3 Lossless bulletproof conversion to unicode (backslashing) - Python-ideas

26 May 2015

      https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode

There is no lossless way to encode the information
to unicode. The argument that you know the encoding
the data is coming from is a fallacy. The argument that
data is always correct is a fallacy as well. So:

1. external data encoding is unknown or varies
2. external data has binary chunks that are invalid for
conversion to unicode

In real world you have to deal with broken and invalid
output and UnicodeDecode crashes is not an option.
The unicode() constructor proposes two options to
deal with invalid output:

1. ignore  - meaning skip and corrupt the data
2. replace  - just corrupt the data

The solution is to have filter preprocess the binary
string to escape all non-unicode symbols so that the
following lossless transformation becomes possible:

   binary -> escaped utf-8 string -> unicode -> binary

How to accomplish that with Python 2.x?

This stuff is critical to port SCons to Python 3.x and I
expect for other such tools too.

-- 
anatoly t.

Lossless bulletproof conversion to unicode (backslashing)

anatoly techtonik

Ethan Furman

anatoly techtonik

Paul Moore

anatoly techtonik

tags

participants (3)