[Expat-bugs] [ expat-Bugs-1284386 ] Byte count in large XML files fails

SourceForge.net noreply at sourceforge.net
Mon Nov 28 03:01:08 CET 2005


Bugs item #1284386, was opened at 2005-09-08 01:01
Message generated for change (Comment added) made by pointsman
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Karl Waclawek (kwaclaw)
Summary: Byte count in large XML files fails

Initial Comment:

XML_GetCurrentByteIndex(XML_Parser parser) returns a
long, which is at least on the most 32 bit Systems 32
bit long. That means, for XML input larger than 2 GByte
file size, XML_GetCurrentByteIndex() returns does not
return the right number.

Sure, such big XML files will be parsed in chunks, so
it is possbile, to keep track about the nr of overflows
by self, but come on.

It's surely a limbo dance by its own to introcude long
long in a source, so portable as expat, but that would
be it.

If you switch to long long if avaliable for this,
please consider also XML_GetCurrentLineNumber() and
XML_GetCurrentColumnNumber(). They return an int, which
is on most 32-byte systems 2 Gig. Though, I'm not
stumbled over this two limits in real life, as I in
fact did with XML_GetCurrentByteIndex(). 

----------------------------------------------------------------------

>Comment By: Rolf Ade (pointsman)
Date: 2005-11-28 02:01

Message:
Logged In: YES 
user_id=13222

XML_GetCurrentByteIndex() could return -1:
Of course! You're right. And it makes sense. A freshly
created or reseted parser  without the first XML_Parse()
call returns -1 on XML_GetCurrentByteIndex(), to signal this
fact: it is not right at the start of the document, but
there isn't any parsing started yet. Nice detail. I should
have looked at the implementation, before replying. Note:
That detail isn't mentioned in the documentation.

I'm fine with a signed long long. 2^63 should be big enough,
for the next few weeks ;-).

Re the defines: Basically yes. It's just, that I'm pretty
sure, we need one round more: some configure check for long
long and depending on that result defining XML_?Int64 as
long long or just long.

I'll look something up (but being on deadline catching, may
need a bit time). 

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-11-28 01:00

Message:
Logged In: YES 
user_id=290026

On a 32bit CPU, 64bit integer operations are considerably 
slower than 32 bit operations. On the other hand 
XMLUpdatePosition isn't called that often - mostly when 
you actually request the line/column number.
So, I agree - no configuration necessary.

For the other point:
If you look at the XML_GetCurrentByteIndex() code, it can 
return -1, and it is calculated using a subtraction. So in 
practice and theory, it must be a signed integer.

XML_GetCurrentByteCount is derived from a subtraction as 
well, but we know it will be positive because eventEndPtr 
should always be larger than eventPtr. So we could risk 
using an unsigned integer.

Just playing around, I added this to expat_external.h:
#ifdef XML_USE_MSC_EXTENSIONS
typedef __int64 XML_Int64;
typedef unsigned __int64 XML_UInt64;
#else
typedef long long XML_Int64;
typedef unsigned long long XML_UInt64;
#endif

What do you think?

Karl

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2005-11-28 00:33

Message:
Logged In: YES 
user_id=13222

Configurable: No
There is nearly no overhead: just a few variables (at max) 8
bytes long instead of 4 bytes. Also speedwise: not mesuarable.

long long acceptable everywhere: Probably no
Some very old or limited embedded system may not have a long
long (or equivalent). Therefor we probably need defines.

Byte index could be negative: don't think so.
How could that happen? Byte index starts at 0 and grows. Or
do I miss something.?


----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-11-27 23:21

Message:
Logged In: YES 
user_id=290026

Just some notes, so that I don't forget

- should it be configurable?
  some may not want the overhead of a 64bit integer, 
  especially for line number and column number.
- is long long acceptable everywhere else
  (other than VC++)?
- the byte index could be negative, but not
  line/column number and byte count, right?
  

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-11-27 20:31

Message:
Logged In: YES 
user_id=290026


You are right, Rolf, it should be 64 bits even on a 32bit
platform. I guess I should make a note in the docs that 
Expat supports > 2GB files, as long as each chunk passed
to the XML_Parse routines is smaller than 2GB.

There are also issues around compiling Expat on a 64bit
platform, but at least for VC++, someone has provided
a patch (bug # 1105135) which looks it should work
on other platforms as well (just a bunch of type casts).

One issue I have already seen is that VC++ 6.0 does not
know about long long.

Thanks for having a look at the cross-platform issue.

I am trying to get Expat 2.0 released despite Fred
not being active on Expat anymore.

Karl

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2005-11-27 20:22

Message:
Logged In: YES 
user_id=13222

Karl,

Most reasonable 32bit platforms have support for file sizes
> 2 GB these days even on 32. It was in fact a 32bit
platform, at which I stumbled over the problem. That for
your easy question.

Much harder is how to slove this in a portable way. I'm
afraid that may need platform depending #defines (with
fallback to long).

I'll go out digging what other portable software does in
this case and will come back with a more concrete proposal.

rolf




----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-11-27 19:22

Message:
Logged In: YES 
user_id=290026

Rolf,

should the type be 64 bit integer on all platforms,
or 32bit on 32bit platforms and 64bit on 64bit platforms?
I think we are talking about m_parseEndByteIndex,
POSITION.lineNumber and POSITION.columnNumber.

Options could be size_t, ptrdiff_t.
MS VC++ 6.0 does not know about long long, but it knows
about __int64. Is there an ANSI definition for 64 bit ints?

What do you suggest that works on all platforms?

Karl





----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127


More information about the Expat-bugs mailing list