Text on multiple
lines and with extra white space in the
raw HTML doesn't change when dom.get_documentElement().normalize() is called.
"""
fr = FileReader()
dom = fr.readStream(StringIO(html),'HTML')
dom.get_documentElement().normalize()
w = HtmlWriter()
w.write(dom)
From bslesins@best.com Wed Apr 28 02:26:04 1999
From: bslesins@best.com (Brian Slesinsky)
Date: Tue, 27 Apr 1999 18:26:04 -0700 (PDT)
Subject: [XML-SIG] checking syntax with xmllib
Message-ID:
Hi, I tried using xmllib to check if an XML document is well-formed and
found some bugs.
If I use xmllib from Python 1.5.2, it complains about invalid characters.
However, I'm fairly sure I'm using correct UTF8 encoding (the document
contains European characters and was converted to Unicode from
ISO-8859-1). It looks like the 'illegal' regular expression in xmllib is
incorrect.
I also tried xml.parsers.xmllib from Python/XML 0.5.1, but it doesn't seem
to be doing any syntax checking at all - I tried a file with one close tag
and it didn't complain.
Here's the script I'm using to do the tests:
#!/nuvo/bin/python
import sys
from xml.parsers.xmllib import XMLParser
def check_xml(file):
x = XMLParser()
f = open(file)
while 1:
line = f.readline()
if line=="": break
x.feed(line)
check_xml(sys.argv[1])
- Brian Slesinsky
From akuchlin@cnri.reston.va.us Wed Apr 28 03:41:53 1999
From: akuchlin@cnri.reston.va.us (A.M. Kuchling)
Date: Tue, 27 Apr 1999 22:41:53 -0400
Subject: [XML-SIG] DOM normalize() broken? entity refs lost?
In-Reply-To: <85256760.007644BA.00@li01.lm.ssc.siemens.com>
References: <85256760.007644BA.00@li01.lm.ssc.siemens.com>
Message-ID: <199904280241.WAA00900@207-172-184-212.s212.tnt23.brd.va.dialup.rcn.com>
Jeff.Johnson@icn.siemens.com writes:
> XmlWriter does not define .doOtherNode()
> so nothing gets written.
Eek! You're right. Try this patch:
Index: writer.py
===================================================================
RCS file: /home/cvsroot/xml/dom/writer.py,v
retrieving revision 1.8
diff -C2 -r1.8 writer.py
*** writer.py 1999/04/08 00:14:29 1.8
--- writer.py 1999/04/28 02:29:42
***************
*** 119,123 ****
self.stream.write(node.toxml())
!
class XmlLineariser(XmlWriter):
--- 119,125 ----
self.stream.write(node.toxml())
! def doOtherNode(self, node):
! self.stream.write( node.toxml() )
!
class XmlLineariser(XmlWriter):
> Text on multiple
> lines and with extra white space in the
> raw HTML doesn't change when dom.get_documentElement().normalize()
Careful; that isn't what normalize() does. Add another Text
node as a child of the TITLE element, to produce two Text nodes text
to each other. dom.dump() will then output:
>
...
After calling normalize:
>
...
See how the two text nodes have been merged? It doesn't do anything
about whitespace.
To strip out whitespace, look at strip_whitespace or
collapse_whitespace in xml.dom.utils; after collapse_whitespace(dom,
WS_INTERNAL), runs of whitespace are collapsed down to a single space.
--
A.M. Kuchling http://starship.python.net/crew/amk/
Guards! Guards! Stop this madman! He's turning everyone into monkeys!
-- A sudden intrusion, in ZOT! #1
From paul@prescod.net Wed Apr 28 17:19:26 1999
From: paul@prescod.net (Paul Prescod)
Date: Wed, 28 Apr 1999 11:19:26 -0500
Subject: [XML-SIG] Another SAX Suggestion
References:
Message-ID: <3727350E.6B51E1ED@prescod.net>
I would like to suggest the default error handlers do something useful:
def error(self, exception):
"Handle a recoverable error."
sys.stderr.write( "Error: "+ exception )
def fatalError(self, exception):
"Handle a non-recoverable error."
sys.stderr.write( "Fatal Error: "+ exception )
def warning(self, exception):
"Handle a warning."
sys.stderr.write( "Warning: "+ exception )
Of course if that's not what a particular implementation wants, they can
override it, but I think that the current lack of behavior is
non-intuitive. Maybe I'm corrupted by working with SGML tools but I expect
the defaults to be as above.
--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco
"Microsoft spokesman Ian Hatton admits that the Linux system would have
performed better had it been tuned."
"Future press releases on the issue will clearly state that the research
was sponsored by Microsoft."
http://www.itweb.co.za/sections/enterprise/1999/9904221410.asp
From Jeff.Johnson@icn.siemens.com Wed Apr 28 18:21:04 1999
From: Jeff.Johnson@icn.siemens.com (Jeff.Johnson@icn.siemens.com)
Date: Wed, 28 Apr 1999 13:21:04 -0400
Subject: [XML-SIG] DOM normalize() broken? entity refs lost?
Message-ID: <85256761.005F477A.00@li01.lm.ssc.siemens.com>
Thanks for the entity reference fix Andrew. It now saves "®" but it still
loses things like "’". I think this is Unicode generated from the RTF to
HTML filter I'm using, and while I can change the RTF to HTML character
translation table to convert RTF "quoteright" to "'" instead of "’", I'm
curious where the entity ref is going. I put some debug statements in
HtmlBuilder.handle_entityref() but it never gets called. I know there is
controversy over Unicode support but I don't know enough about it to know what
to expect in this case.
A new script is included:
import sys, os
from StringIO import StringIO
from xml.dom import utils
from xml.dom.writer import HtmlWriter, XmlWriter
html = """
Don’t
"""
# This works with Andrew's patch but the unicode single quote still vanishes
without a trace.
#Registered ®
fr = utils.FileReader()
dom = fr.readStream(StringIO(html),'HTML')
w = XmlWriter()
w.write(dom)
From akuchlin@cnri.reston.va.us Wed Apr 28 18:39:42 1999
From: akuchlin@cnri.reston.va.us (Andrew M. Kuchling)
Date: Wed, 28 Apr 1999 13:39:42 -0400 (EDT)
Subject: [XML-SIG] Another SAX Suggestion
In-Reply-To: <3727350E.6B51E1ED@prescod.net>
References:
<3727350E.6B51E1ED@prescod.net>
Message-ID: <14119.17665.211348.533470@amarok.cnri.reston.va.us>
Paul Prescod writes:
>I would like to suggest the default error handlers do something useful:
Agreed; the general Python philosophy is to make noise when
something is unexpectedly, rather than making some assumption and
charging onward. Printing an error message seems to be the right
level of noise for parsing errors; they could raise an exception and
terminate further processing (and actually I wouldn't mind that
either), but printing a message seems sufficient.
--
A.M. Kuchling http://starship.python.net/crew/amk/
Principally I played pedants, idiots, old fathers, and drunkards.
As you see, I had a narrow escape from becoming a professor.
-- Robertson Davies, "Shakespeare over the Port"
From Lutz.Ehrlich@EMBL-Heidelberg.de Fri Apr 30 10:56:51 1999
From: Lutz.Ehrlich@EMBL-Heidelberg.de (Lutz.Ehrlich@EMBL-Heidelberg.de)
Date: Fri, 30 Apr 1999 11:56:51 +0200 (MDT)
Subject: [XML-SIG] XQL: Somebody working on it?
Message-ID: <14121.31687.843895.101080@cuckoo.EMBL-Heidelberg.DE>
G'day all,
as I didn't find anything in the recent CVS source for the xml
package, I wondered whether somebody is currently working on
implementing XQL (http://metalab.unc.edu/xql/) ? Before I start doing
anything myself, I would like to hear your opinion about such a
thing. Would implementation be a big thing? Have you guys discussed
implementing any of the query language proposals already?
Any comments are most welcome,
Lutz
______________________________________________________________________
Lutz Ehrlich web : http://www.embl-heidelberg.de/~ehrlich
email: lutz.ehrlich@embl-heidelberg.de
European Molecular Biology Laboratory phone: +49-6221-387-140
Meyerhofstr. 1 fax : +49-6221-387-517
D-69012 Heidelberg, Germany
From Jeff.Johnson@icn.siemens.com Fri Apr 30 15:13:16 1999
From: Jeff.Johnson@icn.siemens.com (Jeff.Johnson@icn.siemens.com)
Date: Fri, 30 Apr 1999 10:13:16 -0400
Subject: [XML-SIG] unicode entitie refs
Message-ID: <85256763.004E13CB.00@li01.lm.ssc.siemens.com>
Sorry to be a pest but I never got a response on the following email and was
hoping someone had an answer as to why unicode entity refs dissapear in PyDom.
After I write this I'll start looking at the SAX code, maybe I have to install
error handlers? Any suggestions?
Thanks,
Jeff
---------------------- Forwarded by Jeff Johnson/Service/ICN on 04/30/99 10:07
AM ---------------------------
Jeff Johnson
04/28/99 01:21 PM
To: akuchlin@cnri.reston.va.us
cc: xml-sig@python.org
Subject: Re: [XML-SIG] DOM normalize() broken? entity refs lost? (Document
link not converted)
Thanks for the entity reference fix Andrew. It now saves "®" but it still
loses things like "’". I think this is Unicode generated from the RTF to
HTML filter I'm using, and while I can change the RTF to HTML character
translation table to convert RTF "quoteright" to "'" instead of "’", I'm
curious where the entity ref is going. I put some debug statements in
HtmlBuilder.handle_entityref() but it never gets called. I know there is
controversy over Unicode support but I don't know enough about it to know what
to expect in this case.
A new script is included:
import sys, os
from StringIO import StringIO
from xml.dom import utils
from xml.dom.writer import HtmlWriter, XmlWriter
html = """
Don’t
"""
# This works with Andrew's patch but the unicode single quote still vanishes
without a trace.
#Registered ®
fr = utils.FileReader()
dom = fr.readStream(StringIO(html),'HTML')
w = XmlWriter()
w.write(dom)
From paul@prescod.net Fri Apr 30 15:09:49 1999
From: paul@prescod.net (Paul Prescod)
Date: Fri, 30 Apr 1999 09:09:49 -0500
Subject: [XML-SIG] XQL: Somebody working on it?
References: <14121.31687.843895.101080@cuckoo.EMBL-Heidelberg.DE>
Message-ID: <3729B9AD.C2D86911@prescod.net>
Lutz.Ehrlich@EMBL-Heidelberg.de wrote:
>
> G'day all,
>
> as I didn't find anything in the recent CVS source for the xml
> package, I wondered whether somebody is currently working on
> implementing XQL (http://metalab.unc.edu/xql/) ? Before I start doing
> anything myself, I would like to hear your opinion about such a
> thing. Would implementation be a big thing? Have you guys discussed
> implementing any of the query language proposals already?
XSL implicitly depends on a query language. It isn't defined separately
from XSL but it is defined in the XSL specification. That query language
actually has W3C standadization status and is needed for the Python XSL
implementation that is under development.
XQL is sort of like that language -- but not quite, and not standardized.
I think that before XQL becomes any kind of standard it would have to be
aligned with XSL's query language. Therefore you can choose yourself
whether you want to implement it in the meantime or not. It all depends on
whether you want to work on something that will likely be obsolete in a
year or not....in the XML world a year is a lifetime so maybe that's a
good tradeoff.
--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco
"Microsoft spokesman Ian Hatton admits that the Linux system would have
performed better had it been tuned."
"Future press releases on the issue will clearly state that the research
was sponsored by Microsoft."
http://www.itweb.co.za/sections/enterprise/1999/9904221410.asp
From paul@prescod.net Fri Apr 30 15:10:05 1999
From: paul@prescod.net (Paul Prescod)
Date: Fri, 30 Apr 1999 09:10:05 -0500
Subject: [XML-SIG] XQL: Somebody working on it?
References: <14121.31687.843895.101080@cuckoo.EMBL-Heidelberg.DE>
Message-ID: <3729B9BD.689AB4F8@prescod.net>
Lutz.Ehrlich@EMBL-Heidelberg.de wrote:
>
> G'day all,
>
> as I didn't find anything in the recent CVS source for the xml
> package, I wondered whether somebody is currently working on
> implementing XQL (http://metalab.unc.edu/xql/) ? Before I start doing
> anything myself, I would like to hear your opinion about such a
> thing. Would implementation be a big thing? Have you guys discussed
> implementing any of the query language proposals already?
XSL implicitly depends on a query language. It isn't defined separately
from XSL but it is defined in the XSL specification. That query language
actually has W3C standadization status and is needed for the Python XSL
implementation that is under development.
XQL is sort of like that language -- but not quite, and not standardized.
I think that before XQL becomes any kind of standard it would have to be
aligned with XSL's query language. Therefore you can choose yourself
whether you want to implement it in the meantime or not. It all depends on
whether you want to work on something that will likely be obsolete in a
year or not....in the XML world a year is a lifetime so maybe that's a
good tradeoff.
--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco
"Microsoft spokesman Ian Hatton admits that the Linux system would have
performed better had it been tuned."
"Future press releases on the issue will clearly state that the research
was sponsored by Microsoft."
http://www.itweb.co.za/sections/enterprise/1999/9904221410.asp
From wunder@infoseek.com Fri Apr 30 16:51:19 1999
From: wunder@infoseek.com (Walter Underwood)
Date: Fri, 30 Apr 1999 08:51:19 -0700
Subject: [XML-SIG] Another SAX Suggestion
In-Reply-To: <3727350E.6B51E1ED@prescod.net>
References:
Message-ID: <3.0.5.32.19990430085119.00ad0c50@corp>
At 11:19 AM 4/28/99 -0500, Paul Prescod wrote:
>I would like to suggest the default error handlers do something useful:
>
> def error(self, exception):
> "Handle a recoverable error."
> sys.stderr.write( "Error: "+ exception )
Since we write servers, we consider output to stderr from a library
to be a defect. Anybody else remember "RANGE ERROR" from the
C math library?
I had to rip out some stderr writes from pyexpat, too.
I wouldn't mind having a stderr error handler provided as part
of the module, with sample code that uses that error handler.
Also along this line, does the SAX adaptor for expat catch all
exceptions raised in a handler? The Expat core doesn't know how
to propagate exceptions, so they need to be caught and reported
locally. This is an interesting behavior difference between SAX
over different parser implementations (a pure-Python parser would
propagate the exceptions).
Sorry for the ignorance of SAX details -- our XML support shipped
last September and I haven't gone back and re-coded to the portable
interface.
wunder
--
Walter R. Underwood
wunder@infoseek.com
wunder@best.com (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946