[Tutor] string to binary and back... Python 3

Thu Jul 19 12:52:25 CEST 2012

I'll preface my response by saying that I know/understand fairly little about
it, but since I've recently been smacked by this same issue when converting
stuff to Python3, I'll see if I can explain it in a way that makes sense.

On Wed, 18 Jul 2012, Jordan wrote:

> OK so I have been trying for a couple days now and I am throwing in the
> towel, Python 3 wins this one.
> I want to convert a string to binary and back again like in this
> question: Stack Overflow: Convert Binary to ASCII and vice versa
> (Python)
> <http://stackoverflow.com/questions/7396849/convert-binary-to-ascii-and-vice-versa-python>
> But in Python 3 I consistently get  some sort of error relating to the
> fact that nothing but bytes and bytearrays support the buffer interface
> or I get an overflow error because something is too large to be
> converted to bytes.
> Please help me and then explian what I am not getting that is new in
> Python 3. I would like to point out I realize that binary, hex, and
> encodings are all a very complex subject and so I do not expect to
> master it but I do hope that I can gain a deeper insight. Thank you all.

The way I've read it - stop thinking about strings as if they are text. The
biggest reason that all this has changed is because Python has grown up and
entered the world where Unicode actually matters. To us poor shmucks in the
English speaking countries of the world it's all very confusing becaust it's
nothing we have to deal with. 26 letters is perfectly fine for us - and if we
want uppercase we'll just throw another 26. Add a few dozen puncuation marks
and 256 is a perfectly fine amount of characters.

To make a slightly relevant side trip, when you were a kid did you ever send
"secret" messages to a friend with a code like this?

A = 1
B = 2
.
.
.
Z = 26

Well, that's basically what is going on when it comes to bytes/text/whatever.
When you input some text, Python3 believes that whatever you wrote was encoded
with Unicode. The nice thing for us 26-letter folks is that the ASCII alphabet
we're so used to just so happens to map quite well to Unicode encodings - so
'A' in ASCII is the same number as 'A' in utf-8.

Now, here's the part that I had to (and still need to) wrap my mind around - if
the string is "just bytes" then it doesn't really matter what the string is
supposed to represent. It could represent the LATIN-1 character set. Or
UTF-8, -16, or some other weird encoding. And all the operations that are
supposed to modify these strings of bytes (e.g. removing spaces, splitting on a
certain "character", etc.) still work. Because if I have this string:

9 45 12 9 13 19 18 9 12 99 102

and I tell you to split on the 9's, it doesn't matter if that's some weird
ASCII character, or some equally weird UTF character, or something else
entirely. And I don't have to worry about things getting munged up when I try
to stick Unicode and ASCII values together - because they're converted to bytes
first.

So the question is, of course, if it's all bytes, then why does it look like
text when I print it out? Well, that's because Python converts that byte stream
to Unicode text when it's printed. Or ASCII, if you tell it to.

But Python3 has converted all(?) of those functions that used to operate on
text and made them operate on byte streams instead. Except for the ones that
operate on text ;)

Well, I hope that's of some use and isn't too much of a lie - like I said, I'm
still trying to wrap my head around things and I've found that explaining (or
trying to explain) to someone else is often the best way to work out the idea
in your own head. If I've gone too far astray I'm sure the other helpful folks
here will correct me :)

HTH,
Wayne