"convert" string to bytes without changing data (encoding)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Mar 28 14:26:29 EDT 2012
On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes.
All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.
> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.
Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?
You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:
bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')
When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:
string = bytes.decode('latin1')
although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.
But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.
> As it seems, this would be far easier with python 2.x.
It only seems that way until you try.
--
Steven
More information about the Python-list
mailing list