Does python's minidom support Chinese?

Uche Ogbuji uche at ogbuji.net
Sun Mar 14 23:09:45 EST 2004


Anthony Liu <antonyliu2002 at yahoo.com> wrote in message news:<mailman.250.1078955721.19534.python-list at python.org>...
> The following 4 lines of code parses an XML document
> very well if the XML document contains only English
> words.
> 
> But when I insert one Chinese character into the XML
> document, then Python starts to complain when it hits
> the Chinese character, saying that it is an invalid
> token and thus it is not well-formed.
> 
> This is the complaint of Python:
> 
> ExpatError: not well-formed (invalid token): line 3,
> column 7
> 
> line 3 and column 7 exactly pinpoints the 1st Chinese
> character in the XML document.

This is an XML problem on your end, not a minidom problem.  That error
probably means that you are either omitting the XML declaration (and
thus defaulting to UTF-8 or UTF-16) or declaring a bogus encoding.


> The problem remains even if I try encoding="UTF-16" or
> encoding="GB2312" or encoding="GBK" in the xml
> document.

Well, you can't just go shopping about for oare it accordingly.

Back to minidom: even after you fix your XML problems you may still
have trouble with minidom because the expat reader has to understand
the encoding you're using.  I think that it may use the Python codecs
model to find the encoding you declared, so you may just need to
install a Python Chinese codecs package, and you'll be all set.  I'm
not entirely sure this si the case, though.


--Uche
http://uche.ogbuji.net



More information about the Python-list mailing list