[XML-SIG] How to parse an XML in SAX

Tue Dec 4 00:39:29 CET 2007

> Hi I want to parse an XML using sax but my big issue are the
> WhiteSpaces when they get reported. I want to know how to efficiently
> ignore them. I know there are some DocumentHandlers and one specific
> for ignore Whitespace but I still come up with a bunch of invisible
> nodes like \t or \n.
> 
> Anyone have a tutorial on how to handle SAX for this kind of parsing?

In general, the notion of "significant whitespace" is pretty weak in
XML (independent of SAX, so I don't think Stefan's bashing of SAX
was of any help). Here is what I know about it:
- white space should be preserved if the attribute xml:space was
  given on an element, and has the value of "preserve". Otherwise,
  it's up to the application on what precisely to do with white
  space.
- white space in "element content" is usually considered ignorable,
  and the XML spec requires that it is reported as such. However,
  whether an element has element content depends on the DTD, so only
  a validating parser can know. If you turn on validation on in SAX,
  white space in element content will be reported through the
  "ignorableWhitespace" event.

So, it's your own choice, and you should make that choice based on
your knowledge of the actual XML application. Typical options are
a) preserve all whitespace
b) perform validation, then strip all whitespace in element content
c) drop white space that completely spans from one tag to another,
   assuming the element has element content. In SAX, track characterData
   since either the last startElement or endElement, and then chose
   to drop the whitespace at the next startElement or endElement.
d) In many cases, you have either element content or simple text
   content, so in SAX, you can drop the white space if you see nested
   elements.
e) strip whitespace, in the sense of Python's string.strip. I.e.
   at endElement, perform .strip() on the collected data.

HTH,
Martin