Need debugging knowhow for my creeping Unicodephobia
mrkafk at gmail.com
Thu Feb 11 22:43:17 CET 2010
> When working with Unicode in Python 2, you should use the 'unicode' type
> for text (Unicode strings) and limit the 'str' type to binary data
> (bytestrings, ie bytes) only.
Well OK, always use u'something', that's simple -- but isn't str what I
get from files and sockets and the like?
> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
> 'bytes' for binary data (bytes!).
Neat, except that the process of porting most projects and external
libraries to P3 seems to be, how should I put it, standing still? Or am
I wrong? But that's the impression I get?
Take web frameworks for example. Does any of them have serious plans and
work in place to port to P3?
> Strictly speaking, only Unicode can be encoded.
How so? Can't bytestrings containing characters of, say, koi8r encoding
> What Python 2 is doing here is trying to be helpful: if it's already a
> bytestring then decode it first to Unicode and then re-encode it to a
It's really cumbersome sometimes, even if two libraries are written by
one author: for instance, Mako and SQLAlchemy are written by the same
guy. They are both top-of-the line in my humble opinion, but when you
connect them you get things like this:
1. you query SQLAlchemy object, that happens to have string fields in
2. Corresponding Python attributes of those objects then have type str,
3. then I pass those objects to Mako for HTML rendering.
Typically, it works: but if and only if a character in there does not
happen to be out of ASCII range. If it does, you get UnicodeDecodeError
on an unsuspecting user.
Sure, I wrote myself a helper that iterates over keyword dictionary to
make sure to convert all str to unicode and only then passes the
dictionary to render_unicode. It's an overhead, though. It would be
nicer to have it all unicode from db and then just pass it for rendering
and having it working. (unless there's something in filters that I
missed, but there's encoding of templates, tags, but I didn't find
anything on automatic conversion of objects passed to method rendering
But maybe I'm whining.
> Unfortunately, the default encoding is ASCII, and the bytestring isn't
> valid ASCII. Python 2 is being 'helpful' in a bad way!
And the default encoding is coded in such way so it cannot be changed in
sitecustomize (without code modification, that is).
More information about the Python-list