"convert" string to bytes without changing data (encoding)

Peter Daum gator at cs.tu-berlin.de
Thu Mar 29 16:57:19 CEST 2012

On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

  l=[i for i in range(256)]; b=bytes(l)
  s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
  for c in s:
      print(hex(ord(c)), end=' ')
      if (ord(c)+1) % 16 ==0: print("")

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!


More information about the Python-list mailing list