elementtree and gbk encoding

Wed Mar 15 16:05:37 EST 2006

Fredrik Lundh wrote:
> Steven Bethard wrote:
> 
>> Hmm...  I downloaded the newest cElementTree (and I already had the
>> newest ElementTree), and here's what I get:
> 
>>  >>> tree = myparser(filename, 'gbk')
>> Traceback (most recent call last):
>>    File "<interactive input>", line 1, in ?
>>    File "<interactive input>", line 8, in myparser
>> SyntaxError: not well-formed (invalid token): line 8, column 6
>>
>> FWIW, the file used above doesn't have an <?xml encoding?> header:
>>
>>  >>> open(filename).read()
>> '<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
>> <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>
> 
> <S ID=2655> isn't a valid XML tag (the attribute value must be quoted)
> 
> if I recode the file into UTF-8 and fix the two S tags, the result displays
> just fine in IE and Firefox (I get a few boxes/question marks, but I assume
> that's a font problem).

Thanks (to both Fredrik and Just).  You stare at XML too long and you 
start to miss the obvious things too. =)

Everything works great now:

 >>> text = open(filename).read()
 >>> text = re.sub(r'<S ID=(\w+)', r'<S ID="\1"', text)
 >>> text = text.decode('gbk').encode('utf-8')
 >>> et.fromstring(text)
<Element 'DOC' at 00A2AF38>

=)

Steve