[XML-SIG] Content is split into two

J. Cliff Dyer jcd at unc.edu
Wed Mar 26 14:39:21 CET 2008


On Wed, 2008-03-26 at 15:12 +0800, Timothy Wu wrote:
> Hi, I post the following in the Python mailing list but no one
> responded. So I'm posting here again.
> 
> ------------
> 
> Hi,
> 
> I have created a very, very simple parser for an XML.
> 
> class FindGoXML2(ContentHandler):
>     def characters(self, content):
>         print content
> 
> I have made it simple because I want to debug. This prints out any
> content enclosed by tags (right?).
> 
> The XML is publicly available here:
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=9622&retmode=xml
> 
> I show a few line embedded in this XML:
> 
>               <Gene-commentary_source>
>                 <Other-source>
>                   <Other-source_src>
>                     <Dbtag>
>                       <Dbtag_db>GO</Dbtag_db>
>                       <Dbtag_tag>
>                         <Object-id>
>                           <Object-id_id>3824</Object-id
> _id>
>                         </Object-id>
>                       </Dbtag_tag>
>                     </Dbtag>
>                   </Other-source_src>
>                   <Other-source_anchor>catalytic
> activity</Other-source_anchor>
>                   <Other-source_post-text>evidence:
> IEA</Other-source_post-text>
>                 </Other-source>
>               </Gene-commentary_source>
> 
> Notice the third line before the last. I expect my content printout to
> print out "evidence:IEA".
> However this is what I get.
> 
> -------------------------
> catalytic activity  ==> this is the print out the line before
> 
> 
> 
> e
> vidence: IEA
> -------------------------
> 
> I don't understand why a few blank lines were printed after "catalytic
> activity". But that 
> doesn't matter. What matters is where the string "evidence: IEA" is
> split into two printouts.
> First it prints only "e", then "vidence: IEA". I parsed 825 such XMLs
> without a problem, 
> this occurs on my 826th XML.
> 
> Any explanations??

The parser will retrieve input in chunks of unspecified size.  There is
no guarantee that a text block will all get returned at once.  You are
seeing this problem because the print statement adds a newline after it
prints.  If you want to see the text itself, without phantom newlines,
try replacing print with sys.stdout.write().  

Cheers,
Cliff




More information about the XML-SIG mailing list