[Tutor] encoding question

eryksun eryksun at gmail.com
Mon Jan 6 03:21:13 CET 2014


On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:
>>
>>     <?xml version="1.0" encoding="ISO-8859-1" ?>
>
> That surprises me. I thought XML was only valid in UTF-8? Or maybe that
> was wishful thinking.

JSON text SHALL be encoded in Unicode:

https://tools.ietf.org/html/rfc4627#section-3

For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the
MIME charset takes precedence. Section 8 has examples:

https://tools.ietf.org/html/rfc3023#section-8

So I was technically wrong to rely on the XML encoding (they happen to
be the same in this case). Instead you can create a parser with the
encoding from the header:

    encoding = response.headers.getparam('charset')
    parser = ET.XMLParser(encoding=encoding)
    tree = ET.parse(response, parser)

The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1
and Unicode transport encodings. So it's probably better to transcode
to UTF-8 as Alex is doing, but then use a custom parser to override
the XML encoding:

    encoding = response.headers.getparam('charset')
    info = response.read().decode(encoding).encode('utf-8')

    parser = ET.XMLParser(encoding='utf-8')
    tree = ET.fromstring(info, parser)


More information about the Tutor mailing list