Unicode is driving me nuts!

Fri Mar 12 19:34:42 EST 2004

    Anthony> str = unicode(raw_str, myencoding)

    Anthony> This works just fine with a small sample Chinese document.

    Anthony> But when I attempted to run the script on the entire corpus, I
    Anthony> get the typical "incomplete multibyte sequence error" or
    Anthony> "UnicodeEncodeError: 'ascii' codec can't encode characters in
    Anthony> position 0-23: ordinal not in range(128)"

Can you craft a small example which demonstrates the error but which you
think is correctly encoded?

    Anthony> I am at my wit's end, so frustrated at handling
    Anthony> non-ascii texts.

Unicode creates lots of problems for the uninitiated.  I pulled my hair out
for a long time.  It took me a couple tries to get my system to work
(more-or-less) with Unicode.  It's still got the occasional problem.

Skip