XML / Unicode / SAX question
stefan.behnel-n05pAM at web.de
Wed Jul 4 08:18:03 CEST 2007
> I am using SAX to parse XML that has numeric html entities I need to
> characters to print correctly, but not without being surrounded by
> def characters(self, ch):
> if self.isNews:
> ch = unescape(ch)
> print ch
The print statement introduces line breaks at the end. Use
instead. Note that you have to merge character sequences yourself in SAX.
There is no guarantee into how many chunks the textual context of a single tag
is broken before it is passed to the characters() SAX method.
> For a line like 'Mark à Capbreton'
> my results print as:
> Is this another SAX quirk? I've already had to hack my way around SAX
> not being able to split results on a colon. No matter if I try strip,
> etc the results are always the same: newlines surrounding the html
> entities. I'm using version 2.3.5 and need to stick to the standard
> libraries. Thanks.
Too bad. If an external library was acceptable (Python 2.3 is ok), I would
have proposed lxml, maybe lxml.html (which will be in lxml 2.0), or the Atom
implementation on top of lxml.etree.
Hope it helps,
More information about the Python-list