[Expat-discuss] Large data sets (Expat v2.0.0; compiled cygwin)

Lee Passey lee at novomail.net
Tue May 15 19:31:55 CEST 2007


Ben Keitch wrote:
> Can someone help me with this code. It is trying to convert an XML file of
> book data to tab-deliminated. Should be simple, but it seems to mangle 
> about 200 of the 10000 records I give it. Supplying each record by itself,
 > it works fine. I don't understand why, but not being a C programmer, I
 > dare say I am mangling pointers, or there is a multithread issue I don't
 > understand.
> 
> here is a typical error:
> given lines 3380-3383 in a 917682 long XML file (it is well-formed 
> according to xmlwf):
> 
> <record>
> <ISBN10>0816044384</ISBN10>
> <ISBN13>9780816044382</ISBN13>
> <EAN>9780816044382</EAN>
> ...
> </record>
> 
> the data given to the data handler (and printed to stderr) is:
> 
> Data: 9780816   Data: 044382
> Error : isbn10: 0816044384      isbn: 382       isbn13: 044382
> Data:
> Data: 9780816044382
> 
> So in this case, ISBN10 was correct, but ISBN13 only got the last 6 digits
> on the first call, but managed to get all the data on the third call (the
> second call gives a blank line! why?)
> 
> If you give just this XML record to the program, it works fine.
> 
> Any help greatly appreciated

Be aware of two things: 1. in XML, whitespace /is/ significant, and 2. 
in Expat the character data handler may be called multiple times, 
sequentially, with partial data.

In your case, you haven't indicated the definition of BUFSIZ. Let's 
assume that BUFSIZ is 256. If you read a file in 256-byte chunks, in all 
likelihood at some point you're going to split a chunk of CData. In this 
case, Expat will call the character data handler (or text handler) with 
the partial data, return to the main method to get more data, then call 
the handler with the remainder of the CData.

Try this:

Set a StartElement handler. When the handler is called, save the name of 
the element and set the start of the output buffer to zero. Now, every 
time the CharacterData handler is called, _and_ we are inside an element 
which can contain CData, /add/ the data to the output buffer. When the 
EndElement handler is called, check to make sure that it matches the 
start element (just in case the XML is badly formed) /then/ store the 
element name (or some translation thereof) and the output buffer you 
have accumulated.

Of course, you may need to add code to deal with potential nested 
elements, but that is left as an exercise for the reader.

-- 
Nothing of significance below this line.



More information about the Expat-discuss mailing list