pyexpat and unicode

Sylvain Thenault syt at gemini.logilab.fr
Tue Dec 18 04:23:36 EST 2001


On Mon, 17 Dec 2001 23:49:57 GMT, Alex Martelli <aleax at aleax.it> wrote:
>import sys
>import xml.parsers.expat
>parser = xml.parsers.expat.ParserCreate(encoding='utf8')
>
>data_uni = u"<?xml version='1.0' encoding='UTF-8' ?><hello>\202</hello>"
>data     = "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
>
>denc = data_uni.encode('utf8')
>
>for thedata in data_uni, data, denc:
>    parser = xml.parsers.expat.ParserCreate(encoding='utf8')
>    print 'parsing', repr(thedata)
>    try: parser.Parse(thedata, 1)
>    except:
>        print 'oops', sys.exc_info()[0]
>    print 'done'
>
>[alex at arthur alex]$ python a.py
>parsing u"<?xml version='1.0' encoding='UTF-8' ?><hello>\x82</hello>"
>oops exceptions.UnicodeError
>done
>parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
>done
>parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>\xc2\x82</hello>"
>oops xml.parsers.expat.ExpatError
>done
>[alex at arthur alex]$
>
>The first one corresponds to what you're seeing (passing unicode data tries 
>to encode it with your default encoding, and the default's default is 
>ansi), the second one is a string that first within the 'ansi' subset of 
>utf-8... and I don't know what to make of the third one, which I thought 
>would work.
>

replacing  parser = xml.parsers.expat.ParserCreate(encoding='utf8')
with parser = xml.parsers.expat.ParserCreate()
and I obtain the following results:

parsing u"<?xml version='1.0' encoding='UTF-8' ?><hello>\x82</hello>"
oops exceptions.UnicodeError
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>there</hello>"
1
done
parsing "<?xml version='1.0' encoding='UTF-8' ?><hello>\xc2\x82</hello>"
1
done  

another try: replacing parser = xml.parsers.expat.ParserCreate()  
with parser = xml.parsers.expat.ParserCreate(encoding='UTF-8')
give the same results.
I came to the simple conclusion that pyexpat doesn't recognize the 'utf8' 
string as a valid encoding while unicode methods does.  

-- 
Sylvain Thenault

  LOGILAB           http://www.logilab.org




More information about the Python-list mailing list