"convert" string to bytes without changing data (encoding)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Mar 28 20:26:29 CEST 2012


On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes. 

All bytes are 8-bit, at least on modern hardware. I think you have to go 
back to the 1950s to find 10-bit or 12-bit machines.

> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Well you can't do that, because *by definition* you are changing a 
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is, 
*how* do you want to change them?

You can use an error handler to convert any untranslatable characters 
into question marks, or to ignore them altogether:

bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')

When going the other way, from bytes to strings, it can sometimes be 
useful to use the Latin-1 encoding, which essentially cannot fail:

string = bytes.decode('latin1')

although the non-ASCII chars that you get may not be sensible or 
meaningful in any way. But if there are only a few of them, and you don't 
care too much, this may be a simple approach.

But in a nutshell, it is physically impossible to map the millions of 
Unicode characters to just 256 possible bytes without either throwing 
some characters away, or performing an encoding.



> As it seems, this would be far easier with python 2.x. 

It only seems that way until you try.


-- 
Steven



More information about the Python-list mailing list