XML/HTML Encoding problem
Dale Strickland-Clark
dale at riverhall.nospam.co.uk
Mon May 22 11:00:48 EDT 2006
A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.
Here is his email:
I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:
My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>
Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>
Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped. What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.
The problem I get is that when I run, for example:
from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()
The output is:
<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>
So it encodes the entity reference to € (Euro sign). I need it to remain as
€ so that the resulting HTML can render properly in a browser. Is
there a way to make the parser not convert the entity references? Or is
there a convenient post processing function that will do the conversion?
--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk
More information about the Python-list
mailing list