Need debugging knowhow for my creeping Unicodephobia

Steve Holden steve at holdenweb.com
Thu Feb 11 17:36:05 EST 2010


mk wrote:
> MRAB wrote:
> 
>> When working with Unicode in Python 2, you should use the 'unicode' type
>> for text (Unicode strings) and limit the 'str' type to binary data
>> (bytestrings, ie bytes) only.
> 
> Well OK, always use u'something', that's simple -- but isn't str what I
> get from files and sockets and the like?
> 
Yes, which is why you need to know what encoding was used to create it.

>> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
>> 'bytes' for binary data (bytes!).
> 
> Neat, except that the process of porting most projects and external
> libraries to P3 seems to be, how should I put it, standing still? Or am
> I wrong? But that's the impression I get?
> 
No, it's probably not going as quickly as you would like, but it's
certainly not standing still. Some of these libraries are substantial
works, and there were changes to the C API that take quite a bit of work
to adapt existing code to.

> Take web frameworks for example. Does any of them have serious plans and
> work in place to port to P3?
> 
There have already been demonstrations of partially-working Python 3
Django. I can't speak to the rest.

>> Strictly speaking, only Unicode can be encoded.
> 
> How so? Can't bytestrings containing characters of, say, koi8r encoding
> be encoded?
> 
It's just terminology. If a bytestring contains koi8r characters then
(as you unconsciously recognized by your use of the word "encoding") it
already *has* been encoded.

>> What Python 2 is doing here is trying to be helpful: if it's already a
>> bytestring then decode it first to Unicode and then re-encode it to a
>> bytestring.
> 
> It's really cumbersome sometimes, even if two libraries are written by
> one author: for instance, Mako and SQLAlchemy are written by the same
> guy. They are both top-of-the line in my humble opinion, but when you
> connect them you get things like this:
> 
> 1. you query SQLAlchemy object, that happens to have string fields in
> relational DB.
> 
> 2. Corresponding Python attributes of those objects then have type str,
> not unicode.
> 
Yes, a relational database will often return ASCII, but nowadays people
are increasingly using encoded Unicode. In that case you need to be
aware of the encoding that has been used to render the Unicode values
into the byte strings (which in Python 2 are of type str) so that you
can decode them into Unicode.

> 3. then I pass those objects to Mako for HTML rendering.
> 
> Typically, it works: but if and only if a character in there does not
> happen to be out of ASCII range. If it does, you get UnicodeDecodeError
> on an unsuspecting user.
> 
Well first you need to be clear what you are passing to Mako.

> Sure, I wrote myself a helper that iterates over keyword dictionary to
> make sure to convert all str to unicode and only then passes the
> dictionary to render_unicode. It's an overhead, though. It would be
> nicer to have it all unicode from db and then just pass it for rendering
> and having it working. (unless there's something in filters that I
> missed, but there's encoding of templates, tags, but I didn't find
> anything on automatic conversion of objects passed to method rendering
> template)
> 
Some database modules will distinguish between fields of type varchar
and nvarchar, returning Unicode objects for the latter. You will need to
ensure that the module knows which encoding is used in the database.
This is usually automatic.

> But maybe I'm whining.
> 
Nope, just struggling with a topic that is far from straightforward the
first time you encounter it.
> 
>> Unfortunately, the default encoding is ASCII, and the bytestring isn't
>> valid ASCII. Python 2 is being 'helpful' in a bad way!
> 
> And the default encoding is coded in such way so it cannot be changed in
> sitecustomize (without code modification, that is).
> 
Yes, the default encoding is not always convenient.

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/




More information about the Python-list mailing list