elementtree and gbk encoding
Steven Bethard
steven.bethard at gmail.com
Wed Mar 15 16:05:37 EST 2006
Fredrik Lundh wrote:
> Steven Bethard wrote:
>
>> Hmm... I downloaded the newest cElementTree (and I already had the
>> newest ElementTree), and here's what I get:
>
>> >>> tree = myparser(filename, 'gbk')
>> Traceback (most recent call last):
>> File "<interactive input>", line 1, in ?
>> File "<interactive input>", line 8, in myparser
>> SyntaxError: not well-formed (invalid token): line 8, column 6
>>
>> FWIW, the file used above doesn't have an <?xml encoding?> header:
>>
>> >>> open(filename).read()
>> '<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
>> <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>
>
> <S ID=2655> isn't a valid XML tag (the attribute value must be quoted)
>
> if I recode the file into UTF-8 and fix the two S tags, the result displays
> just fine in IE and Firefox (I get a few boxes/question marks, but I assume
> that's a font problem).
Thanks (to both Fredrik and Just). You stare at XML too long and you
start to miss the obvious things too. =)
Everything works great now:
>>> text = open(filename).read()
>>> text = re.sub(r'<S ID=(\w+)', r'<S ID="\1"', text)
>>> text = text.decode('gbk').encode('utf-8')
>>> et.fromstring(text)
<Element 'DOC' at 00A2AF38>
=)
Steve
More information about the Python-list
mailing list