[Expat-discuss] Large data sets (Expat v2.0.0; compiled cygwin)

Ben Keitch bkeitch at googlemail.com
Wed May 16 00:24:00 CEST 2007


Brilliant, Lee, that would explain things. I did wonder if something like
that was going on.

I will do as you suggest and add chunks of text together.

Strangely I managed to have more success using Perl's wrapper around expat,
and perl. However, that takes 2 hours to run instead of 30 seconds!

Thanks again for you help,

Ben


On 15/05/07, Lee Passey <lee at novomail.net> wrote:
>
> Ben Keitch wrote:
> > Can someone help me with this code. It is trying to convert an XML file
> of
> > book data to tab-deliminated. Should be simple, but it seems to mangle
> > about 200 of the 10000 records I give it. Supplying each record by
> itself,
> > it works fine. I don't understand why, but not being a C programmer, I
> > dare say I am mangling pointers, or there is a multithread issue I don't
> > understand.
> >
> > here is a typical error:
> > given lines 3380-3383 in a 917682 long XML file (it is well-formed
> > according to xmlwf):
> >
> > <record>
> > <ISBN10>0816044384</ISBN10>
> > <ISBN13>9780816044382</ISBN13>
> > <EAN>9780816044382</EAN>
> > ...
> > </record>
> >
> > the data given to the data handler (and printed to stderr) is:
> >
> > Data: 9780816   Data: 044382
> > Error : isbn10: 0816044384      isbn: 382       isbn13: 044382
> > Data:
> > Data: 9780816044382
> >
> > So in this case, ISBN10 was correct, but ISBN13 only got the last 6
> digits
> > on the first call, but managed to get all the data on the third call
> (the
> > second call gives a blank line! why?)
> >
> > If you give just this XML record to the program, it works fine.
> >
> > Any help greatly appreciated
>
> Be aware of two things: 1. in XML, whitespace /is/ significant, and 2.
> in Expat the character data handler may be called multiple times,
> sequentially, with partial data.
>
> In your case, you haven't indicated the definition of BUFSIZ. Let's
> assume that BUFSIZ is 256. If you read a file in 256-byte chunks, in all
> likelihood at some point you're going to split a chunk of CData. In this
> case, Expat will call the character data handler (or text handler) with
> the partial data, return to the main method to get more data, then call
> the handler with the remainder of the CData.
>
> Try this:
>
> Set a StartElement handler. When the handler is called, save the name of
> the element and set the start of the output buffer to zero. Now, every
> time the CharacterData handler is called, _and_ we are inside an element
> which can contain CData, /add/ the data to the output buffer. When the
> EndElement handler is called, check to make sure that it matches the
> start element (just in case the XML is badly formed) /then/ store the
> element name (or some translation thereof) and the output buffer you
> have accumulated.
>
> Of course, you may need to add code to deal with potential nested
> elements, but that is left as an exercise for the reader.
>
> --
> Nothing of significance below this line.
>
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>


More information about the Expat-discuss mailing list