[ expat-Bugs-566240 ] UTF-8 char handling still broken(1.95.3)
noreply@sourceforge.net
noreply@sourceforge.net
Sun Jun 9 07:04:04 2002
Bugs item #566240, was opened at 2002-06-08 13:31
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8 char handling still broken(1.95.3)
Initial Comment:
The changes, to fix bug 477667 seems to
have also messed up some things.
I've run all not-wellformedness tests
(should all raise error) and all valid
tests (should all not raise an error)
of the OASIS xml test suite Version 2.
I found, that in this tests
xmltest/not-wf/sa/166.xml
xmltest/not-wf/sa/167.xml
xmltest/not-wf/sa/171.xml
xmltest/not-wf/sa/172.xml
xmltest/not-wf/sa/173.xml
xmltest/not-wf/sa/174.xml
xmltest/not-wf/sa/175.xml
xmltest/not-wf/sa/177.xml
ibm/not-wf/P02/ibm02n32.xml
ibm/not-wf/P02/ibm02n33.xml
a invalid UTF-8 char isn't reported as error
In this test:
ibm/valid/ibm02v01.xml
expat claims error for a valid UTF-8 char.
rolf
----------------------------------------------------------------------
>Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 10:03
Message:
Logged In: YES
user_id=290026
Fix checked in.
Please test CVS rev. 1.17 of xmltok.c.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 22:51
Message:
Logged In: YES
user_id=290026
Looking at the spec, it seems that there are in fact
additional restrictions:
Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */
This means we have to re-visit the UTF-8 fix.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 21:15
Message:
Logged In: YES
user_id=290026
I looked at these test cases, and checked them against
Table 3.1B in Unicode 3.2 - have a look at
<http://www.unicode.org/unicode/reports/tr28/>
First, lets deal with James Clark's test cases:
The docs state that they test if the invalid character FFFF
(or FFFE for test case not-wf-sa-167) is present. This
would map to the sequence EF BF BF (or EF BF BE).
Now, the sequences in question are indeed present, but
they are actually valid UTF-8!
So, where does it say that they are not valid in XML?
XMLSpy accepts these test cases as well-formed, btw.
The same then applies to the IBM test cases:
ibm02n32.xml tests for FFFE and ibm02n33.xml
tests for FFFF. Same question as above - valid UTF-8,
but invalid XML?
About the last test case, file ibm/valid/iP02/bm02v01.xml :
It contains the sequence F0 90 80 5F, which is an illegal
UTF-8 sequnce according to Table 3.1B in Unicode 3.2.
So, as far as I can tell Expat is correct in how
it checks the UTF-8 sequences, but I am not sure
if XML imposes further restrictions on them.
Anybody care to comment?
Karl
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127