[Expat-bugs] [ expat-Bugs-653180 ] problem with column and line numbers

noreply at sourceforge.net noreply at sourceforge.net
Mon Dec 16 09:33:26 EST 2002


Bugs item #653180, was opened at 2002-12-13 05:01
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=653180&group_id=10127

Category: None
Group: Test Required
Status: Open
Resolution: Fixed
Priority: 5
Submitted By: MM Zeeman (mmzeeman)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: problem with column and line numbers 

Initial Comment:

XML_GetCurrentColumnNumber returns 2 times the  actual
column 
number. The same holds for XML_GetCurrentLineNumber. The 
XML_GetCurrentByteIndex function returns the correct
value. 

The expat version i'm using is: expat_1.95.5



----------------------------------------------------------------------

>Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-16 12:33

Message:
Logged In: YES 
user_id=290026

Thanks a lot for the test cases. We are really grateful 
when somebody helps out! After all, this is a volunteer
effort.

About the HTML docs - they are usually a little behind.
If in doubt, check expat.h - or the source <g>.

And yes, Expat is good for wrappers. That is how I got 
into it (for Delphi). It supports basically everything 
needed to create a SAX2 wrapper with all SAX extensions.

Entity reporting, however, has the same limitations
as SAX, but we plan to find a way around it for a future 
release - so that Expat can provide everything needed
for building a DOM (it *almost* does).

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-12-16 11:18

Message:
Logged In: NO 

The HTML documentation tells me something different. It
indeed starts with an indication stating that these
functions are useful when XML_Parse returns 0, but also that
they are quote "the position reporting functions are useful
outside of errors. " It only states that it can not be used
inside a DTD.

So the situation was not really clear. What was clear is
that calling the line or column number function outside an
handler before final resulted in a wrong answer, and as a
bonus goofes up the further line and column number reports.

Btw I still think, Expat is a really good XML library. It
has a nice API, which makes it easy to create wrappers (I
was working on an expat wrapper for OCaml)

Here are the tests you requested:

/* 
 * Line and column numbers inside handlers test
 */

static void
start_element_event_handler2(void *userData, const XML_Char
*name,
			     const XML_Char **attr)
{
    CharData *storage = (CharData *) userData;
    char buffer[100];
    
    sprintf(buffer, "<%s> at col:%d line:%d\n", name, 
	    XML_GetCurrentColumnNumber(parser),
	    XML_GetCurrentLineNumber(parser));
    CharData_AppendString(storage, buffer);
}

static void
end_element_event_handler2(void *userData, const XML_Char *name)
{
    CharData *storage = (CharData *) userData;
    char buffer[100];

    sprintf(buffer, "</%s> at col:%d line:%d\n", name, 
	    XML_GetCurrentColumnNumber(parser),
	    XML_GetCurrentLineNumber(parser));
    CharData_AppendString(storage, buffer);
}

START_TEST(test_line_and_column_numbers_inside_handlers)
{
    char *text = "<a>\n"        /* the unix idea of an
end-of-line */
	         "  <b>\r\n"    /* the windows of an end-of-line*/
	         "    <c/>\r"   /* also counts as an end-of-line */
                 "  </b>\n"
                 "  <d>\n"
                 "    <f/>\n"
                 "  </d>\n"
                 "</a>";
    char *expected = "<a> at col:0 line:1\n"
	             "<b> at col:2 line:2\n"
	             "<c> at col:4 line:3\n"
	             "</c> at col:8 line:3\n"
	             "</b> at col:2 line:4\n"
	             "<d> at col:2 line:5\n"
	             "<f> at col:4 line:6\n"
	             "</f> at col:8 line:6\n"
	             "</d> at col:2 line:7\n"
	             "</a> at col:0 line:8\n";
    CharData storage;

    CharData_Init(&storage);
    XML_SetUserData(parser, &storage);
    XML_SetStartElementHandler(parser,
start_element_event_handler2);
    XML_SetEndElementHandler(parser,
end_element_event_handler2);
    if (XML_Parse(parser, text, strlen(text), 1) ==
XML_STATUS_ERROR)
        xml_failure(parser);

    CharData_CheckString(&storage, expected); 
}
END_TEST

START_TEST(test_line_number_after_error)
{
  char *text = "<a>\n"
               "  <b>\n"
               "  </a>"; /* missing </b> */
  int lineno;
  if (XML_Parse(parser, text, strlen(text), 0) !=
XML_STATUS_ERROR)
    fail("Expected a parse error");

  lineno = XML_GetCurrentLineNumber(parser);
  if (lineno != 3) {
    char buffer[100];
    sprintf(buffer, "expected 3 lines, saw %d", lineno);
    fail(buffer);
  }
}
END_TEST
    
START_TEST(test_column_number_after_error)
{
  char *text = "<a>\n"
               "  <b>\n"
               "  </a>"; /* missing </b> */
  int colno;
  if (XML_Parse(parser, text, strlen(text), 0) !=
XML_STATUS_ERROR)
    fail("Expected a parse error");

  colno = XML_GetCurrentColumnNumber(parser);
  if (colno != 4) { 
    char buffer[100];
    sprintf(buffer, "expected 4 columns, saw %d", colno);
    fail(buffer);
  }
}
END_TEST



----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-15 22:49

Message:
Logged In: YES 
user_id=290026

Just a comment:
According to the documentation in expat.h, the line/column
number functions should only be called when XML_Parse
or XML_ParseBuffer return 0 (error) or from callback handlers.

So, strictly speaking, this was not a bug.

Another question - mostly for Fred:
Why is it that line numbers are 1-based, and column
numbers are 0-based? If we keep this behaviour then
we should at least document it!

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-15 15:05

Message:
Logged In: YES 
user_id=290026

Great - I will apply a patch to CVS.
Could you please subject it to a few tests for other
situations, like:
- calling the line/column functions after an error
- calling them in a handler
And report back here. If OK, this patch will go into
the next release, which is coming soon.

To Fred Drake:
Fred, seems there is already a regression test case 
in the posts below. Would you please add it.


----------------------------------------------------------------------

Comment By: MM Zeeman (mmzeeman)
Date: 2002-12-15 14:07

Message:
Logged In: YES 
user_id=350634

Thanks, this solved the problem. 

This was the first time I looked at the expat code, so I was
a bit surprised to see that XML parsing is not as simple as
I thought it would be. On the other hand, maybe the code
needs to be refactored here and there too. 

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-15 11:34

Message:
Logged In: YES 
user_id=290026

OK, I think this will be better.
Same places as before, but instead of the previous 
statements insert these after XMLUpdatePosition:

in XML_Parse(): positionPtr = end;
in XML_ParseBuffer(): positionPtr = bufferPtr;

Why is it not part of the scanning routines?
I can only guess, since I didn't write it.
I would say that
a) it would make the code even more complicated
b) it would likely not be faster

----------------------------------------------------------------------

Comment By: MM Zeeman (mmzeeman)
Date: 2002-12-15 11:20

Message:
Logged In: YES 
user_id=350634

No, this does not solve the problem.

tests/runtests
Running suite(s): basic
93%: Checks: 31, Failures: 2, Errors: 0
tests/runtests.c:342:F:basic tests: expected 4 lines, saw 7
tests/runtests.c:357:F:basic tests: expected 11 columns, saw 22
make: *** [check] Error 1

Btw, this whole line/column counting seems to be one big
side-effect ;) Why is
it not implemented as part of the normal scanning routines.
It seems like the buffer passed to XML_Parse will be checked
(again) just to adjust the line and column numbers. During
the "normal" scanning phase nothing is done to adjust the
line and column numbers. 

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-15 01:54

Message:
Logged In: YES 
user_id=290026

Try this - not tested:

in XML_Parse(), right after XMLUpdatePosition,
insert this line:

  eventPtr = eventEndPtr = end;

and in XML_ParseBuffer, also right after XMLUpdatePosition,
insert (as part of the conditional statement):

  eventPtr = eventEndPtr = bufferPtr;

That should hopefully prevent double counting.
Haven't really checked possible side-effects.


----------------------------------------------------------------------

Comment By: MM Zeeman (mmzeeman)
Date: 2002-12-15 00:48

Message:
Logged In: YES 
user_id=350634

Hmm,

It looks more like the line and column numbers are counted
double outside the handlers. When I added this piece of code
in front of the update routine the counting was correct.
Looking at the source if found that outside the handers the
update is called twice. It is first called in parse, and
later when you call either XML_GetCurrentLineNumber, or
XML_GetCurrentColumnNumber, because for some reason the
eventPtr is set. (Which to me seems like this should only be
the case in handlers, but I'm not sure)

 static void FASTCALL
PREFIX(updatePosition)(const ENCODING *enc,
                       const char *ptr,
                       const char *end,
                       POSITION *pos)
{     
  /* THIS IS NOT CORRECT
   * Yes, this is lame, but it helps to indicate a double 
   * counting the newlines and column number problem 
   */   
  static const char *s_ptr = NULL;
  static const char *s_end = NULL;
  if( (s_ptr == ptr) && (s_end == end))
    /* we already counted this one */
    return;
  s_ptr = ptr; s_end = end;

/* The rest of the function */

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-14 11:41

Message:
Logged In: YES 
user_id=290026

I have looked a little closer.
It seems the numbers are reported OK only if the
parser is at the end or beginning of a token, but not
in-between. Btw, the line/column number relates
to the start of the token, as far as I can tell from the source.

In your example, isFinal = 0, so the parser is in-between 
tokens and expects at least one more call to XML_Parse.
I am not sure it is OK to call the line/column number
functions in between calls to XML_Parse, since at this point
the parser is in the middle of processing a token, while
in the other situations (inside a handler, and when having an 
error) the parser knows exactly what the current token is
or is not.

One thing is sure: The documentation is not clear and 
sufficient. Also, from what I can tell, line numbers are
1-based and column numbers are 0-based. I don't think
that is a good idea. It should probably be consistent
with SAX, where both numbers are 1-based.


----------------------------------------------------------------------

Comment By: MM Zeeman (mmzeeman)
Date: 2002-12-14 11:22

Message:
Logged In: YES 
user_id=350634

When I set a start element handler the line number reporting
is fine, as long as no trailing newlines are inserted after
the last tag. When instead of a start element handler a end
element handler is set the line-number is always wrong.
Inside the handlers the line-numbers are always ok. 

I did not check the behavoir for xml_parse errors, or the
column number (yet).



----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-12-13 21:18

Message:
Logged In: YES 
user_id=290026

Does this also happen in other situations?
For instance:
- inside a handler
- when XML_Parse returns with an error?

----------------------------------------------------------------------

Comment By: MM Zeeman (mmzeeman)
Date: 2002-12-13 06:01

Message:
Logged In: YES 
user_id=350634

I forgot to add the tests:

START_TEST(test_line_number_maas)
{  
  char *text = "<tag>\n"
               "\n"
               "\n</tag>";
  int lineno;
  if (XML_Parse(parser, text, strlen(text), 0) ==
XML_STATUS_ERROR)
      xml_failure(parser);
  lineno = XML_GetCurrentLineNumber(parser);
  if (lineno != 4) {
      char buffer[100];
      sprintf(buffer, "expected 4 lines, saw %d", lineno);
      fail(buffer);
  }
}
END_TEST

START_TEST(test_column_number_maas)
{
  char *text = "<tag></tag>";
  int colno;
  if (XML_Parse(parser, text, strlen(text), 0) ==
XML_STATUS_ERROR)
            xml_failure(parser);
  colno = XML_GetCurrentColumnNumber(parser);
  if (colno != 11) {
      char buffer[100];
      sprintf(buffer, "expected 11 columns, saw %d", colno);
      fail(buffer);
  }
}
END_TEST

Added to the test suite this resulted in:

Running suite(s): basic
93%: Checks: 31, Failures: 2, Errors: 0
tests/runtests.c:342:F:basic tests: expected 4 lines, saw 7
tests/runtests.c:357:F:basic tests: expected 11 columns, saw 22

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=653180&group_id=10127



More information about the Expat-bugs mailing list