Totally confused by the str/bytes/unicode differences introduced in Pythyon 3.x

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Jan 16 21:09:13 EST 2009


On Fri, 16 Jan 2009 17:32:17 -0800, Giampaolo Rodola' wrote:

> On 17 Gen, 02:24, MRAB <goo... at mrabarnett.plus.com> wrote:
> 
>> If you're truly working with strings of _characters_ then 'str' is what
>> you need, but if you're working with strings of _bytes_ then 'bytes' is
>> what you need.
> 
> I work with string of characters but to convert bytes into string I need
> to specify an encoding and that's what confuses me. Before there was no
> need to deal with that.

In Python 2.x, str means "string of bytes". This has been renamed "bytes" 
in Python 3.

In Python 2.x, unicode means "string of characters". This has been 
renamed "str" in Python 3.

If you do this in Python 2.x:

    my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a 
string of bytes to a string of bytes.

If you do this in Python 3:

    my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of 
bytes to a string of characters (unicode). The Python 2.x equivalent code 
would be:

    my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better 
way to do this would be some variation on:

    my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html



-- 
Steven



More information about the Python-list mailing list