[Expat-bugs] [ expat-Bugs-1004302 ] CharcterData split in the middle

Fri Aug 13 03:18:44 CEST 2004

Bugs item #1004302, was opened at 2004-08-05 21:24
Message generated for change (Comment added) made by fdrake
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1004302&group_id=10127

Category: None
Group: None
Status: Closed
Resolution: Rejected
Priority: 5
Submitted By: Martin Quinson (mquinson)
Assigned to: Karl Waclawek (kwaclaw)
>Summary: CharcterData split in the middle

Initial Comment:
Hello,

I'm trying to use expat on very big document (about
50Mb), so the first bad news is that XML_Parse takes
the size of the buffer as a regular int instead of a
long one. But that's ok, I can feed the parser in
several times, thanks to the isFinal argument.

The problem I would like to report here is that when
the buffer is not done yet, expat calls the
CharacterDataHandler on the end of the buffer, even if
there is more data for it after the feed boundary. A
log of my test program can be found in attachement. All
charterdataread 12345 for ease of debugging. The "1..k"
strings mean give the first and last char feeded to the
parser at that time.

As you can see, the CharacterDataHandler is called
twice for a->clauses[368].literals[16] the first time
with '1' and the second one with '2345'.  As reported
by the lines indicating the boundaries, that's exactly
where the feed boundaries are.

I really need this one working asap for my job I tryied
to fix it myself, but poorly failed. I know it's an
open source project and the induced rules, so I don't
want to rush you with it. Simply, if you have a fix for
it, would you mind sending it to me so that I can
continue my project, please?

I use the version 1.95.8 of expat, plus the debian
packages patches (I updated the package to make sure
the fix was not already in the newest version without
having to polute my /usr/local).

Of course, I cannot produce a "short but complete XML
document that exhibits a bug that you are reporting".
The bug occure when the file is more than 32kb (maxint).

Thanks a lot for your time,
Mt.

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2004-08-12 21:18

Message:
Logged In: YES 
user_id=3066

While there's some convenience for the application developer
to be had by providing the buffering directly in Expat, this
would force all applications to pay for the overhead, and
not all applications need to buffer character data.

I'll note that I added application-controlled buffering in
the Python wrapper for Expat; it's not particularly
difficult to write a reasonable buffering implementation. 
One of the issues with doing so, however, is that some space
is needed in the application's data structure that's tacked
onto the Parser object using XML_SetUserData(); not
requiring this would require adding more overhead to the
parser implementation in some form.

Perhaps what's required is better documentation and sample
code, so that everyone has a reasonable starting point to
work from.  This might be made especially easy to re-use for
C99 users.  If that's reasonable, we can re-open this as a
documentation request (assigned to me).

I've tried to clarify the issue summary.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-08-12 21:16

Message:
Logged In: NO 

Karl, how about an End-CharacterDataHandler()? Would that 
be feasable? Would make the xml pasring much more secure, 
knowing the exact end, to pass on the Character data

mattes

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2004-08-12 19:32

Message:
Logged In: YES 
user_id=290026

Changing the behaviour would mean major surgery in Expat 
and is unlikely to happen.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-08-12 19:19

Message:
Logged In: NO 

I am a newbie to expat, so my expirience is not very savvy.
I encoutered the same problem with 'character data splitted'
as Martin.

We deal with small portions of up to 4k. Our xml data gets 
served via TCP/IP, where it is simply not garanteed when and 
how the xml data is received could be one or several 
packages. this leads to exact same problem that the 
characterdata handler is called more then one time.

Karl Waclawek describes as a workaround to do the buffering
on the application side. This is pretty inconvinient though 
and not very clean to implement. 

Live would be much easier if a Start- and End-
CharacterDatahandler callback function would be available. 
(like it is for element and CDATA)

I agree with Martin and it should be handled by the expat.
e.g. expat only calls back for attributes if complete.

mattes at mykmk.com

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2004-08-06 09:40

Message:
Logged In: YES 
user_id=290026

I guess the reason is speed.
Most SAX parsers I know of behave exactly the same.

Closing this issue.

----------------------------------------------------------------------

Comment By: Martin Quinson (mquinson)
Date: 2004-08-06 03:25

Message:
Logged In: YES 
user_id=85746

What, do you mean you put bugs in expat on purpose ? Just
kiding, of course.

Ok, if that's the way to go, that's the way I'll go, even if
that's a real pain in the ass. Why couldn't just expat
buffer this data until it has it all?

I should have RTFM before submitting this bug, sorry. Feel
free to close this "bug"

Thanks for your time, Mt.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2004-08-05 23:12

Message:
Logged In: YES 
user_id=290026

It is a documented  behaviour of Expat that character
data, even if contiguous, may be reported in chunks, that
is, through multiple call-backs. For instance, Expat will
normally call the character data handler when it sees a line 
break, even if there are more characters to come.

I am afraid that the observed behaviour works as designed.
The standard approach for dealing with it is to use a buffer
to accumulate the characters until the next boundary
(start or end of element tag) is encountered.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1004302&group_id=10127