[Tutor] encoding question
eryksun
eryksun at gmail.com
Mon Jan 6 03:21:13 CET 2014
On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:
>>
>> <?xml version="1.0" encoding="ISO-8859-1" ?>
>
> That surprises me. I thought XML was only valid in UTF-8? Or maybe that
> was wishful thinking.
JSON text SHALL be encoded in Unicode:
https://tools.ietf.org/html/rfc4627#section-3
For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the
MIME charset takes precedence. Section 8 has examples:
https://tools.ietf.org/html/rfc3023#section-8
So I was technically wrong to rely on the XML encoding (they happen to
be the same in this case). Instead you can create a parser with the
encoding from the header:
encoding = response.headers.getparam('charset')
parser = ET.XMLParser(encoding=encoding)
tree = ET.parse(response, parser)
The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1
and Unicode transport encodings. So it's probably better to transcode
to UTF-8 as Alex is doing, but then use a custom parser to override
the XML encoding:
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding).encode('utf-8')
parser = ET.XMLParser(encoding='utf-8')
tree = ET.fromstring(info, parser)
More information about the Tutor
mailing list