[XML-SIG] Content is split into two
J. Cliff Dyer
jcd at unc.edu
Wed Mar 26 14:39:21 CET 2008
On Wed, 2008-03-26 at 15:12 +0800, Timothy Wu wrote:
> Hi, I post the following in the Python mailing list but no one
> responded. So I'm posting here again.
> I have created a very, very simple parser for an XML.
> class FindGoXML2(ContentHandler):
> def characters(self, content):
> print content
> I have made it simple because I want to debug. This prints out any
> content enclosed by tags (right?).
> The XML is publicly available here:
> I show a few line embedded in this XML:
> Notice the third line before the last. I expect my content printout to
> print out "evidence:IEA".
> However this is what I get.
> catalytic activity ==> this is the print out the line before
> vidence: IEA
> I don't understand why a few blank lines were printed after "catalytic
> activity". But that
> doesn't matter. What matters is where the string "evidence: IEA" is
> split into two printouts.
> First it prints only "e", then "vidence: IEA". I parsed 825 such XMLs
> without a problem,
> this occurs on my 826th XML.
> Any explanations??
The parser will retrieve input in chunks of unspecified size. There is
no guarantee that a text block will all get returned at once. You are
seeing this problem because the print statement adds a newline after it
prints. If you want to see the text itself, without phantom newlines,
try replacing print with sys.stdout.write().
More information about the XML-SIG