how can I convert invalid ASCII string to Unicode?
tim.one at home.com
Wed May 9 07:37:48 CEST 2001
[skip at pobox.com]
> I have been blissfully ignoring Unicode. Alas, my bliss has been
> so rudely interrupted...
I'm afraid the docs aren't a lot of help here, either. There's a very nice
Grand Architecture that's been reduced to a handful of ambiguously defined
builtin functions without any examples -- painful.
> Suppose I have this string:
> s = "ö" # "o" with an umlaut
> and I'd like to convert it to UTF-8.
You're way off base already, Skip <0.5 wink>: Python can't read your mind.
You have to repeat that over and over until it sinks in. While what you've
shown there is an instance of Python's string *type*, it's just a blob of
arbitrary 8-bit binary data until you tell Python how it was *intended* to be
interpreted. Python refuses to assume *anything* about 8-bit strings except
that they make sense as 7-bit ASCII. In particular, the idea that the raw
binary blob "\xf6" is "'o' with an umlaut" is pure fiction, albeit one your
terminal may be quite insistent on sharing with you <wink>.
If you believe your binary blobs were meant to interpreted as Latin-1, then
tell the unicode() function explicitly:
>>> unicode("\xf6", "latin-1")
That much is a trivial conversion, since the first 256 code points in Unicode
coincide with Latin-1. For other encodings it's not so trivial. But:
> (I know I can preface string literals with 'u', but that's not
> an option here. Pretend s was assigned from a file read.)
Bingo. The example above answered this one.
> Simply executing
> u = unicode(s)
> fails because ord(s) is > 127.
Right: Python has no idea what you think you mean when you go beyond 7-bit
ASCII. This seems to be a real stumbling block for people, alas, perhaps
because they're so steeped in the illusion that their local encoding is the
only one ...
> I eventually figured out that the following would work:
> u = "".join([unichr(ord(c)) for c in s])
> but this seems a bit obscure.
Not to mention nonsense <wink>.
> Is there a cleaner way to convert plain strings containing
> characters > 127 to UTF-8?
So far you haven't found *any* way! All the above accomplished was to
convert a Latin-1 encoded binary blob into a Unicode string. To get from
that to UTF-8 requires another explicit encoding step:
>>> unicode("\x95", "latin-1").encode("utf-8")
Note that the result is not a Unicode string, it's another flavor of binary
blob (8-bit string). Which we'll use to illustrate the other direction too:
going from a UTF-8-encoded binary blob back to a Latin-1-encoded binary blob
>>> print unicode("\xc3\xb6", "utf-8").encode("latin-1")
Which, on my terminal at the moment, displays as an o with an umlaut.
For hints about other encodings you can use and how to use them, look for
"codecs" in the Library manual, and just look at the names of the files in
your Lib/encodings/ directory (they correspond in an obvious way with the
names of available codecs).
More information about the Python-list