[Tutor] Encoding and XML troubles
Kent Johnson
kent37 at tds.net
Sun Nov 5 16:42:19 CET 2006
William O'Higgins Witteman wrote:
> I've been struggling with encodings in my XML input to Python programs.
>
> Here's the situation - my program has no declared encoding, so it
> defaults to ASCII. It's written in Unicode, but apparently that isn't
> confusing to the parser. Fine by me. I import some XML, probably
> encoded in the Windows character set (I don't remember what that's
> called now). I can read it for the most part - but it throws exceptions
> when it hits accented characters (some data is being input by French
> speakers). I am using ElementTree for my XML parsing
>
> What I'm trying to do is figure out what I need to do to get my program
> to not barf when it hits an accented character. I've tried adding an
> encoding line as suggested here:
>
> http://www.python.org/dev/peps/pep-0263/
>
> What these do is make the program fail to parse the XML at all. Has
> anyone encountered this? Suggestions? Thanks.
As Luke says, the encoding of your program has nothing to do with the
encoding of the XML or the types of data your program will accept. PEP
263 only affects the encoding of string literals in your program.
It sounds like your XML is not well-formed. XML files can have an
encoding declaration *in the XML*. If it in not present, the file is
assumed to be in UTF-8 encoding. If your XML is in Cp1252 but lacks a
correct encoding declaration, it is not valid XML because the Cp1252
characters are not valid UTF-8.
Try including the line
<?xml version="1.0" encoding="windows-1252"?>
or
<?xml version="1.0" encoding="Cp1252"?>
as the first line of the XML. (windows-1252 is the official
IANA-registered name for Cp1252; I'm not sure which name will actually
work correctly.)
Kent
More information about the Tutor
mailing list