Hi, I post the following in the Python mailing list but no one responded. So I'm posting here again.<br><br>------------<br><br>Hi,<br><br>I have created a very, very simple parser for an XML.<br><br>class FindGoXML2(ContentHandler):<br>
def characters(self, content):<br> print content<br><br>I have made it simple because I want to debug. This prints out any content enclosed by tags (right?).<br>
<br>The XML is publicly available here:<br><a href="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=9622&retmode=xml" target="_blank">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=9622&retmode=xml</a><br>
<br>I show a few line embedded in this XML:<br><br> <Gene-commentary_source><br> <Other-source><br> <Other-source_src><br> <Dbtag><br>
<Dbtag_db>GO</Dbtag_db><br> <Dbtag_tag><br> <Object-id><br> <Object-id_id>3824</Object-id<div id="1gvs" class="ArwC7c ckChnd">
_id><br>
</Object-id><br> </Dbtag_tag><br> </Dbtag><br> </Other-source_src><br> <Other-source_anchor>catalytic activity</Other-source_anchor><br>
<Other-source_post-text>evidence: IEA</Other-source_post-text><br> </Other-source><br> </Gene-commentary_source><br><br>Notice the third line before the last. I expect my content printout to print out "evidence:IEA".<br>
However this is what I get.<br><br>-------------------------<br>catalytic activity ==> this is the print out the line before<br><br><br><br>e<br>vidence: IEA<br>-------------------------<br><br>I don't understand why a few blank lines were printed after "catalytic activity". But that <br>
doesn't matter. What matters is where the string "evidence: IEA" is split into two printouts.<br>First it prints only "e", then "vidence: IEA". I parsed 825 such XMLs without a problem, <br>
this occurs on my 826th XML.<br><br>Any explanations??</div>