[Expat-bugs] [ expat-Bugs-602729 ] Seg fault (with UTF-16 input)

noreply@sourceforge.net noreply@sourceforge.net
Fri, 30 Aug 2002 20:11:14 -0700


Bugs item #602729, was opened at 2002-08-30 21:27
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=602729&group_id=10127

Category: None
Group: Test Required
>Status: Closed
Resolution: Accepted
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Karl Waclawek (kwaclaw)
Summary: Seg fault (with UTF-16 input)

Initial Comment:

It happend, that I stumbled over a expat bug, that
expreß itself with a seg fault. This was reproducable
for the maintainers, I mailed about, with XML data
example provided by me. Thanks to Karl there is already
a fix checked in and thanks to Fred there's already a
regression test. The only reason for this bug report
is, to be the starting point for the documentation of
the case.

rolf


----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-08-30 23:11

Message:
Logged In: YES 
user_id=3066

I checked in a regression test for this as tests/runtests.c
1.35.  Karl commited his fix as lib/xmlparse.c 1.82 and
lib/xmltok_impl.c 1.6.  Closing the bug report as fixed.

Rolf, if you think this doesn't fix the bug, feel free to
follow up, and we can re-open if appropriate.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-08-30 22:17

Message:
Logged In: YES 
user_id=290026

Here is how what I think happens to cause the seg fault:

The tokenizer ..._prologTok returns -XML_TOK_PROLOG_S 
when it detects a CR that is at the end of the buffer. This
is supposed to indicate - IMO - that there might be an 
incomplete linebreak, i.e. a LF might be following the CR
in the next buffer.

doProlog handles this by ignoring the token, and leaving the
buffer pointer the same without advancing it, like in this code 
snippet:
...
    if (tok <= 0) {
      if (nextPtr != 0 && tok != XML_TOK_INVALID) {
        *nextPtr = s;
        return XML_ERROR_NONE;
      }
...

However, epilogProcessor tries to report it, because it might
well be the very last token. Like this:
...
    case -XML_TOK_PROLOG_S:
      if (defaultHandler) {
        eventEndPtr = end;
        reportDefault(parser, encoding, s, end);
      }
      /* fall through */
    case XML_TOK_NONE:
      if (nextPtr)
        *nextPtr = end;
      return XML_ERROR_NONE;
...

Now, if the "end" pointer (which will be passed back to
XML_ParseBuffer through nextPtr) points into the middle
of a character, then bufferPtr (in XML_ParseBuffer) points
there too.

This will cause problems for XML_UpdatePosition, because
its while loop is based on iterating over complete characters,
therefore the end condition (ptr == end) will never be true,
and it will continue accessing memory past the end pointer
until a seg fault is triggered.

I propose this patch (check the attached diff file):
...
    case -XML_TOK_PROLOG_S:
      if (defaultHandler) {
        eventEndPtr = next;
        reportDefault(parser, encoding, s, next);
      }
      if (nextPtr)
        *nextPtr = next;
      return XML_ERROR_NONE;
    case XML_TOK_NONE:
      if (nextPtr)
        *nextPtr = s;
      return XML_ERROR_NONE;
...

In order to properly report the end of the CR token, I also had 
to patch xmltok_impl.c to update the nextTokPtr parameter
in this case (-XML_TOK_PROLOG_S).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=602729&group_id=10127