[newbie] String to binary conversion
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Aug 6 22:01:05 EDT 2012
On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best obtain
> that integer and from that integer backwards again obtain the original
> string? Thanks in advance.
First you have to know the encoding, as that will define the integers you
get. There are many 8-bit encodings, but of course they can't all encode
arbitrary 4-character strings. Since there are tens of thousands of
different characters, and an 8-bit encoding can only code for 256 of
them, there are many strings that an encoding cannot handle.
For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
Sticking to one-byte encodings: since most of them are compatible with
ASCII, examples with "abcd" aren't very interesting:
py> 'abcd'.encode('latin1')
b'abcd'
Even though the bytes object b'abcd' is printed as if it were a string,
it is actually treated as an array of one-byte ints:
py> b'abcd'[0]
97
Here's a more interesting example, using Python 3: it uses at least one
character (the Greek letter π) which cannot be encoded in Latin1, and two
which cannot be encoded in ASCII:
py> "aπ©d".encode('iso-8859-7')
b'a\xf0\xa9d'
Most encodings will round-trip successfully:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True
(although the ability to round-trip is a property of the encoding itself,
not of the encoding system).
Naturally if you encode with one encoding, and then decode with another,
you are likely to get different strings:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'að©Z!'
py> data.decode('iso-8859-14')
'aŵ©Z!'
Both the encode and decode methods take an optional argument, errors,
which specify the error handling scheme. The default is errors='strict',
which raises an exception. Others include 'ignore' and 'replace'.
py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aŵðπ©Z!'.encode('ascii', 'replace')
b'a????Z!'
--
Steven
More information about the Python-list
mailing list