[ expat-Bugs-566240 ] UTF-8 char handling still broken(1.95.3)

Mon Jun 10 20:03:02 2002

Bugs item #566240, was opened at 2002-06-08 17:31
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8 char handling still broken(1.95.3)

Initial Comment:

The changes, to fix bug 477667 seems to
have also messed up some things.

I've run all not-wellformedness tests
(should all raise error) and all valid
tests (should all not raise an error)
of the OASIS xml test suite Version 2.

I found, that in this tests

xmltest/not-wf/sa/166.xml
xmltest/not-wf/sa/167.xml
xmltest/not-wf/sa/171.xml
xmltest/not-wf/sa/172.xml
xmltest/not-wf/sa/173.xml
xmltest/not-wf/sa/174.xml
xmltest/not-wf/sa/175.xml
xmltest/not-wf/sa/177.xml
ibm/not-wf/P02/ibm02n32.xml
ibm/not-wf/P02/ibm02n33.xml

a invalid UTF-8 char isn't reported as error

In this test:

ibm/valid/ibm02v01.xml 

expat claims error for a valid UTF-8 char.

rolf

----------------------------------------------------------------------

>Comment By: Rolf Ade (pointsman)
Date: 2002-06-11 03:02

Message:
Logged In: YES 
user_id=13222

(You're of course right. I better should
have distinguished between legal UTF-8
chars and legal XML PCDATA chars. I
confess I still have to re-lookup the
releated parts of the notorious
specs. At the moment, I only mechanical
bang it against the OASIS suite and
report the strange (ie new)
things. Sorry, for omitting deeper
analysis.)

Better now, according to the OASIS test
suite.

Only 

ibm/valid/ibm02v01.xml 

still seems to be wrong. Expat claims
"invalid token", while the test suite
claims, that this is valid XML.

rolf

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 14:03

Message:
Logged In: YES 
user_id=290026

Fix checked in. 
Please test CVS rev. 1.17 of xmltok.c.

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 02:51

Message:
Logged In: YES 
user_id=290026

Looking at the spec, it seems that there are in fact
additional restrictions:

Character Range
[2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, 
excluding the surrogate blocks, FFFE, and FFFF. */ 

This means we have to re-visit the UTF-8 fix.

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 01:15

Message:
Logged In: YES 
user_id=290026

I looked at these test cases, and checked them against 
Table 3.1B in  Unicode 3.2 - have a look at 
<http://www.unicode.org/unicode/reports/tr28/>

First, lets deal with James Clark's test cases:
The docs state that they test if the invalid character FFFF
(or FFFE for test case not-wf-sa-167) is present. This
would map to the sequence EF BF BF (or EF BF BE).

Now, the sequences in question are indeed present, but
they are actually valid UTF-8!
So, where does it say that they are not valid in XML?
XMLSpy accepts these test cases as well-formed, btw.

The same then applies to the IBM test cases:
ibm02n32.xml tests for FFFE and ibm02n33.xml
tests for FFFF. Same question as above - valid UTF-8,
but invalid XML?

About the last test case, file ibm/valid/iP02/bm02v01.xml :
It contains the sequence F0 90 80 5F, which is an illegal
UTF-8 sequnce according to Table 3.1B in Unicode 3.2.

So, as far as I can tell Expat is correct in how
it checks the UTF-8 sequences, but I am not sure
if XML imposes further restrictions on them.

Anybody care to comment?

Karl

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127