[Expat-bugs] [ expat-Bugs-1004302 ] CharcterData splitted in the middle when isFinal=0

SourceForge.net noreply at sourceforge.net
Fri Aug 6 15:40:35 CEST 2004


Bugs item #1004302, was opened at 2004-08-05 21:24
Message generated for change (Comment added) made by kwaclaw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1004302&group_id=10127

Category: None
Group: None
>Status: Closed
>Resolution: Rejected
Priority: 5
Submitted By: Martin Quinson (mquinson)
Assigned to: Karl Waclawek (kwaclaw)
Summary: CharcterData splitted in the middle when isFinal=0

Initial Comment:
Hello,

I'm trying to use expat on very big document (about
50Mb), so the first bad news is that XML_Parse takes
the size of the buffer as a regular int instead of a
long one. But that's ok, I can feed the parser in
several times, thanks to the isFinal argument.

The problem I would like to report here is that when
the buffer is not done yet, expat calls the
CharacterDataHandler on the end of the buffer, even if
there is more data for it after the feed boundary. A
log of my test program can be found in attachement. All
charterdataread 12345 for ease of debugging. The "1..k"
strings mean give the first and last char feeded to the
parser at that time.

As you can see, the CharacterDataHandler is called
twice for a->clauses[368].literals[16] the first time
with '1' and the second one with '2345'.  As reported
by the lines indicating the boundaries, that's exactly
where the feed boundaries are.

I really need this one working asap for my job I tryied
to fix it myself, but poorly failed. I know it's an
open source project and the induced rules, so I don't
want to rush you with it. Simply, if you have a fix for
it, would you mind sending it to me so that I can
continue my project, please?

I use the version 1.95.8 of expat, plus the debian
packages patches (I updated the package to make sure
the fix was not already in the newest version without
having to polute my /usr/local).

Of course, I cannot produce a "short but complete XML
document that exhibits a bug that you are reporting".
The bug occure when the file is more than 32kb (maxint).

Thanks a lot for your time,
Mt.

----------------------------------------------------------------------

>Comment By: Karl Waclawek (kwaclaw)
Date: 2004-08-06 09:40

Message:
Logged In: YES 
user_id=290026

I guess the reason is speed.
Most SAX parsers I know of behave exactly the same.

Closing this issue.

----------------------------------------------------------------------

Comment By: Martin Quinson (mquinson)
Date: 2004-08-06 03:25

Message:
Logged In: YES 
user_id=85746

What, do you mean you put bugs in expat on purpose ? Just
kiding, of course.

Ok, if that's the way to go, that's the way I'll go, even if
that's a real pain in the ass. Why couldn't just expat
buffer this data until it has it all?

I should have RTFM before submitting this bug, sorry. Feel
free to close this "bug"

Thanks for your time, Mt.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2004-08-05 23:12

Message:
Logged In: YES 
user_id=290026

It is a documented  behaviour of Expat that character
data, even if contiguous, may be reported in chunks, that
is, through multiple call-backs. For instance, Expat will
normally call the character data handler when it sees a line 
break, even if there are more characters to come.

I am afraid that the observed behaviour works as designed.
The standard approach for dealing with it is to use a buffer
to accumulate the characters until the next boundary
(start or end of element tag) is encountered.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1004302&group_id=10127


More information about the Expat-bugs mailing list