Unicode and MoinMoin

Mon Feb 27 00:49:54 EST 2006

Greg:

> The only issue I'm having relates to Unicode. MoinMoin and python are
> pretty unforgiving about files that contain Unicode characters that
> aren't included in the coding properly. I've spent hours reading about
> Unicode, and playing with different encoding/decoding commands, but at
> this point, I just want a hacky solution that will ignore the
> improperly coded characters or replace them with placeholders.

    Call the codec with the errors argument set to "ignore" or "replace".

 >>> unicode('AUTHOR: blahblah\n\nTITLE: Reading Course Readings... G. 
A. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8')
Traceback (most recent call last):
   File "<interactive input>", line 1, in ?
   File "c:\python24\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 58: 
unexpected code byte
 >>> unicode('AUTHOR: blahblah\n\nTITLE: Reading Course Readings... G. 
A. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8', 'replace')
u'AUTHOR: blahblah\n\nTITLE: Reading Course Readings... G. A. \ufffd For 
references see blahblah.\n\n\n-----\n\n'

    BTW, its probably in Windows-1252 where it would be a dash. 
Depending on your context it may pay to handle the exception instead of 
using "replace" and attempt interpreting as Windows-1252.

    Neil