[XML-SIG] Parsing XML data from a stream where several XML elements follow?

Uche Ogbuji uche.ogbuji@fourthought.com
Wed, 18 Dec 2002 06:47:49 -0700


> On Tue, Nov 26, 2002 at 04:25:38PM +0100,
>  Stephane Bortzmeyer <bortzmeyer@nic.fr> wrote 
>  a message of 20 lines which said:
> 
> > I'm writing a simple XML Internet program which must be able to read
> > and parse successive XML elements coming on the same TCP stream (I did
> > not write the protocol so changing this is not an option).
> > 
> > If I write simple code like:
> > 
> >         read_channel = self.socket.makefile('r')
> >         reader = Sax2.Reader()
> >         reply = reader.fromStream(read_channel)
> > 
> > The fromStream method is stalled even after a complete XML element was
> > read because it waits for the channel to close. 
> > 
> > Is there a way to tell fromStream (which seems poorly documented) to
> > yield a result after the first complete element (or after a syntax
> > error)? Or is there a better way to read successive XML elements?
> 
> Well, apparently noone found a simple solution.

There is no simple solution.  The probem is the separation between the stream 
reading code and the actual parser.  The former does not know anything about 
the element structure, and blocks when it can't read more than the buffer size 
of octets while the channel is still open.

> I plan to SAX the
> stream first to recognize the beginning and ending of the top-level
> elements and then to hand them on to a DOM builder :-(

My guess is that you're just lucky that SAX works in some cases.  I expect 
that it would have the same problem in certain situations.

The real solution is really to hack the code that does the buffered reads so 
that it returns as soon as it has exhausted the current octets on the channel 
and perhaps to change the parser so that it determines itself when it's done 
rather than having the calling code inform it.  This is no trivial solution  
:-(

An easier but more slippery and error-prone solution is to write some code to 
regignize the end of a well-formed parse stream yourself and use to to read 
data from the socket separately.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
A Python & XML Companion - http://www.xml.com/pub/a/2002/12/11/py-xml.html
XML class warfare - http://www.adtmag.com/article.asp?id=6965
MusicBrainz  metadata - http://www-106.ibm.com/developerworks/xml/library/x-thi
nk14.html