[XML-SIG] Character entities (XHTML)

Andrew Cooke andrewc@webtronfinance.com
Wed, 08 May 2002 10:00:29 -0400


Hi,

Thanks for the reply.  Two points in response:

1 - I screwed up with the input file, it was incorrect.  My apologies=
.  I
have included a revised demonstration of the problem below (the input=
 file
has been through Tidy, but the entities still exist at that point).

2 - As far as I can tell, the problem isn=B4t that the entities are b=
eing
replaced by the appropriate character, but that they are being silent=
ly
dropped.  The < entity passes through correctly, but the ó =
entity
is not replaced by an =F3 (nor is the ¡ replaced by a =A1).

Thanks,
Andrew

PS I am using PyXml installed from PxML-0.7.win32-py2.2.exe

F:\home\Andrew\multi\src\xhtml>python
Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win=
32
Type "help", "copyright", "credits" or "license" for more information=
.
>>> import sys
>>> from xml.dom.ext.reader.Sax2 import FromXmlFile
>>> from xml.dom.ext import PrettyPrint
>>> sys.stdout.writelines(open("index.xhtml").readlines())
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<meta name=3D"generator"
content=3D"HTML Tidy for Cygwin (vers 1st April 2002), see www.w3.org=
" />
<link type=3D"text/css" rel=3D"stylesheet" href=3D"basic.css" />
<title>Index</title>
</head>
<body>
<h1>&iexcl;&lt;Hola!</h1>

<a href=3D"init">initialisaci&oacute;n</a>
</body>
</html>

>>> PrettyPrint(FromXmlFile("index.xhtml"))
<?xml version=3D'1.0' encoding=3D'UTF-8'?>
<!DOCTYPE html>
<html xmlns=3D'http://www.w3.org/1999/xhtml'>
  <head>
    <meta content=3D'HTML Tidy for Cygwin (vers 1st April 2002), see
www.w3.org' n
ame=3D'generator'/>
    <link href=3D'basic.css' rel=3D'stylesheet' type=3D'text/css'/>
    <title>Index</title>
  </head>
  <body>
    <h1>&lt;Hola!</h1>
    <a href=3D'init'>initialisacin</a>
  </body>
</html>
>>>

----- Original Message -----
=46rom: "Thomas B. Passin" <tpassin@comcast.net>
To: <xml-sig@python.org>
Sent: Tuesday, May 07, 2002 7:49 PM
Subject: Re: [XML-SIG] Character entities (XHTML)


I'm sure others have told you the same thing.  When an xml parser par=
ses
your file, it replaces any character references and entities with the=
ir
corresponding characters.  There is no memory of how they came to be.

There are really only two ways to get the entities back in to the out=
put.
Either you use or write a serializer to replace certain characters wi=
th your
entities (and which ones to replace will depend on the encoding), or =
you do
some preprocessing to replace the entities with some encoded version,=
 then
convert them back with postprocessing.

But here you seem to be running HTML Tidy, which may not even be hand=
ling
the source file as xml depending on how you have configured it. Other=
wise,
how did the Tidy "meta" element get into it when you don't show it in=
 the
source file?   In fact, without a DTD your file cannot be processed b=
y an
ordinary xml parser because the values of the entities cannot be know=
n. So
chances are you ran it in xhtml mode, not xml mode, before you fed th=
e
result to the Python modules.  This list probably isn't going to be a=
ble to
help you with the idiosyncracies of Tidy.

Maybe you didn't do that, but you need to explain what you really did
because the way you show it, the Python program couldn't have complet=
ed and
there would be no "Meta" element.

Cheers,

Tom P