Trouble with unicode

Brian Quinlan BrianQ at ActiveState.com
Tue May 15 19:28:04 CEST 2001


> UnicodeError: UTF-8 decoding error: invalid data
> mm, how can I check to see what type of encoding it is?

There is no simple way of doing so. I seem to recall that you are dealing
with e-mail so their might be a RFC that describes how to determine the
encoding. Also, if you send me the file (by private mail!) I can probably
tell you the encoding.

> thanx. There's probably a good reason for all the smoke and
> mirrors but I
> don't see why I can't do simple encode, decodes and it's
> going to take a
> while working out how this lookup works - I would call it
> cryptic even if
> though I realise it's being very OO.

It's not really smoke and mirrors. The steps are as follows:

1. look up the decoder for the source encoding
2. look up the encoder for the destination encoding
3. use the decoder to decode the input into unicode
4. use the encoder to encode the unicode into a buffer

lookup() returns a tuple of functions (encoder, decoder, stream_reader,
stream_writer).

see the details at:

http://www.python.org/doc/current/lib/module-codecs.html

> It works. But I don't want a load of question marks! I want
> the special
> characters. I particularly want to be able to replace them
> all at once.

The problem is that there is no way to represent all possible unicode
characters in ASCII, so the encoder is just replacing them with question
marks. If you input is Latin-1 (which I believe is the case) and your output
is ASCII, your algorithm will work just dandy.

> Maybe I should provide the context for my work. I've written
> a script which  reads orders which come via e-mail, and writes
> the significant data to file attributes generating a pseudo
> database in the file system.

I would imagine that BeOS has a Unicode-aware file system? There might be an
API that allows you to set Unicode attributes directly.





More information about the Python-list mailing list