Processing XML files in CJK encodings

Andrew Clover and-google at doxdesk.com
Sun Oct 24 21:26:48 CEST 2004


Gen <gshibaya at gmail.com> wrote:

> I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.

I assume you've already got CJKCodecs (or Python 2.4 where it's
built-in).

The main problem is that the expat parser (on which much Python XML
kit relies) doesn't understand the DBCS encodings. There are two ways
around this: either use an initial recoding step:

  xml= unicode(bytes, 'gb2312').encode('utf-8')
  doc= minidom.parseString(xml)

(If your input documents have an <?xml ... encoding="gb2312" ?>
declaration this will also have to be changed to encoding="utf-8" or
simply removed.)

OR, use a pure-Python XML parser, so it'll have access to CJKCodecs.
That means xmlproc+4DOM (validating) or pxdom (non-validating). This
is, in comparison to the recoding method, rather slow.

[Aside: have just released pxdom 1.2:

  http://www.doxdesk.com/software/py/pxdom.html

I've processed a bunch of Shift-JIS material with this before without
problem.]

> Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
> doesn't work with Ja in UTF-8.

Ohh. That's a bad one. Actually I'm surprised if it works with GB.

Here's a quick fix; I can't guarantee it's correct as I haven't really
played with xmlproc much but it fixes the error for me when parsing
strings. Oh, checking this out at the SourceForge tracker it looks
like the original reporter came up with the same idea, so it might be
okay. :-)

Near the end of method parse_xml_decl (in PyXML 0.8.3 this is at line
723) in _xmlplus.parsers.xmlproc.xmlutils:

            try:
                self.data = self.charset_converter(self.data)
                self.datasize= len(self.data)       ### ADD THIS LINE
            except UnicodeError, e:
                self._handle_decoding_error(self.data, e)
            self.input_encoding = enc1

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/



More information about the Python-list mailing list