> I solved the problem and am responding to myself for the benifit of future googlers.
> The sax parsers my split nodes of type CHARACTERS into multiple nodes so they have to be joined back together. Since pulldom depends on a sax parser it also may do this.  My method to find and join together the next CHARACTERS node is below. It assumes that
> self.event,self.node  = iter.next()
> was executed previously.
>      def getCharacterNode(self,iter):
>          while self.event != 'CHARACTERS':
>              self.event,self.node  = iter.next()
>          chars=[]
>          chars.append(self.node.nodeValue)
>          self.event,self.node  = iter.next()
>          while self.event == 'CHARACTERS':
>              chars.append(self.node.nodeValue)
>              self.event,self.node  = iter.next()
>          return ''.join(chars)

Or see:


and the updated version that is part of Amara:

(class normalize_text_filter, which you should be able to copy to your
code if you don't want to install Amara).

