elementtree and gbk encoding
Steven Bethard
steven.bethard at gmail.com
Tue Mar 14 17:10:55 EST 2006
Diez B. Roggisch wrote:
> Steven Bethard schrieb:
>> I'm having trouble using elementtree with an XML file that has some
>> gbk-encoded text. (I can't read Chinese, so I'm taking their word for
>> it that it's gbk-encoded.) I always have trouble with encodings, so
>> I'm sure I'm just screwing something simple up. Can anyone help me?
>>
>> Here's the interactive session. Sorry it's a little verbose, but I
>> figured it would be better to include too much than not enough. I
>> basically expected et.ElementTree(file=...) to fail since no encoding
>> was specified, but I don't know what I'm doing wrong when I use
>> codecs.open(...)
>
> The first and most important lesson to learn here is that well-formed
> XML must contain a xml-header that specifies the used encoding. This has
> two consequences for you:
>
> 1) all xml-parsers expect byte-strings, as they have to first read the
> header to know what encoding awaits them. So no use reading the xml-file
> with a codec - even if it is the right one. It will get converted back
> to a string when fed to the parser, with the default codec being used -
> resulting in the well-known unicode error.
>
> 2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
> You need ask these guys to deliver the xml with header. Of course for
> now it is ok to just prepend the text with something like <?xml
> version="1.0" encoding="gbk"?>. But I'd still request them to deliver it
> with that header - otherwise it is _not_ XML, but just something that
> happens to look similar and doesn't guarantee to be well-formed and thus
> can be safely fed to a parser.
Thanks, that's very helpful. I'll definitely harrass the people
producing these files to make sure they put encoding declarations in them.
Here's what I get with the prepending hack:
>>> et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' +
open(filename).read())
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML
parser.feed(text)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk"
encoding seems to work okay from Python:
>>> open(filename).read().decode('gbk')
u'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \u4f0f\u660e\u971e)) \n\t\t (VP (VV
\u83b7\u5f97) \n\t\t\t (NP-OBJ (NN \u5973\u5b50) \n\t\t\t\t (NN
\u8df3\u53f0) \n\t\t\t\t (NN \u8df3\u6c34) \n\t\t\t\t (NN
\u51a0\u519b)))) \n\t\t (LC \u540e)) \n (PU \uff0c) \n
(NP-SBJ (NP-PN (NR \u82cf\u8054\u961f)) \n (NP (NN
\u6559\u7ec3))) \n (VP (ADVP (AD \u70ed\u60c5)) \n
(PP-DIR (P \u5411) \n\t\t (NP (PN \u5979))) \n (VP
(VV \u795d\u8d3a))) \n (PU \u3002)) ) \n</S>\n<S ID=2567>\n(
(FRAG (NR \u65b0\u534e\u793e) \n (NN \u8bb0\u8005) \n
(NR \u7a0b\u81f3\u5584) \n (VV \u6444) ))
\n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeve
More information about the Python-list
mailing list