Diez B. Roggisch
deets at nospam.web.de
Fri Apr 18 14:27:31 CEST 2008
Robin Becker schrieb:
> I'm in the process of attempting a straightforward port of a relatively
> simple package which does most of its work by writing out files with a
> more or less complicated set of possible encodings. So far I have used
> all the 2to3 tools and a lot of effort, but still don't have a working
> version. This must be the worst way to convert people to unicode. When
> tcl went through this they chose the eminently sensible route of not
> choosing a separate unicode type (they used utf8 byte strings instead).
> Not only has python chosen to burden itself with two string types, but
> with 3 they've swapped roles. This is certainly the first time I've had
> to decide on an encoding before writing simple text to a file.
Which is the EXACT RIGHT THING TO DO! see below.
> Of course we may end up with a better language, but it will be a
> worse(more complex) tool for many simple tasks. Using a complex writing
> with many glyphs costs effort no matter how you do it, but I just use
> ascii :( and it's still an effort.
> I find the differences in C/OS less hard to understand than why I need
> bytes(x,'encoding') everywhere I just used to use str(x).
If you google my name + unicode, you see that I'm often answering
questions regarding unicode. I wouldn't say I'm a recognized expert on
the subject, but I certainly do know enough to deal with it whenever I
And from my experience with the problems in general, and specificly in
python, as well as trying to help others I can say that:
- 95% of the times, the problem is in front of the keyboard.
- programmers stubbornly refuse to *learn* what unicode is, and what
an encoding is, and what role utf-8 plays. Instead, the resort to a
voodoo-approach of throwing in various encode/decode-calls + a good deal
of cat's feces in hope of wriggling themselves out of the problem.
- it is NOT sensible to use utf8 as unicode-"type" - that is as bad as
it can get because you don't see the errors, but instead mangle your
data and end up with a byte-string-mess. If that is your road to heaven,
by all means chose it - and don't use unicode at all. and be prepared
for damnation :)
If your programs worked for now, but don't do anymore because of Py3K
introducing mandatory unicode-objects for string-literals it pretty much
follows that they *seem* to work, but very, very probably fail in the
face of actual i18nized data.
The *only* sensible thing to do is follow these simple rules - and these
apply with python 2.x, and will be enforced by 3k which is a good thing
- when you read data from somewhere, make sure you know which encoding
it has, and *immediatly* convert it to unicode
- when you write data, make sure you know which encoding you want it
to have (in doubt, chose utf-8 to prevent loss of data) and apply it.
- XML-parsers take byte-strings & spit out unicode. Period.
I neither want to imply that you are an Idiot nor that unicode doesn't
have it's complexities. And I'd love to say that Python wouldn't add to
these by having two string-types.
But the *real* problem is that it used to have only bytestrings, and
finally Py3K will solve that issue.
More information about the Python-list