Some questions about decode/encode

John Machin sjmachin at
Sun Jan 27 22:50:15 CET 2008

On Jan 28, 7:47 am, "Mark Tolonen" <mark.e.tolo... at>
> >"John Machin" <sjmac... at> wrote in message
> >news:eeb3a05f-c122-4b8c-95d8-d13741263374 at
> >On Jan 27, 9:17 pm, glacier <rong.x... at> wrote:
> >> On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-... at>
> >> wrote:
> >*IF* the file is well-formed GBK, then the codec will not mess up when
> >decoding it to Unicode. The usual cause of mess is a combination of a
> >human and a text editor :-)
> SAX uses the expat parser.  From the pyexpat module docs:
> Expat doesn't support as many encodings as Python does, and its repertoire
> of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
> (Latin1), and ASCII. If encoding is given it will override the implicit or
> explicit encoding of the document.
> --Mark

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
print "latin2 FF -> utf8 = %r" %
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
    f = cStringIO.StringIO()
    handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
    xml.sax.parseString(doc, handler)
    result = f.getvalue()
    print repr(result[result.find('<data>'):])

gbkstr='\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
Traceback (most recent call last):
  File "C:\junk\", line 27, in <module>
    xml.sax.parseString(doc, handler)
  File "C:\Python25\lib\xml\sax\", line 49, in parseString
  File "C:\Python25\lib\xml\sax\", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "C:\Python25\lib\xml\sax\", line 123, in parse
  File "C:\Python25\lib\xml\sax\", line 211, in feed
  File "C:\Python25\lib\xml\sax\", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown


More information about the Python-list mailing list