[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error
noreply@sourceforge.net
noreply@sourceforge.net
Fri May 17 08:53:02 2002
Bugs item #477667, was opened at 2001-11-02 17:58
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127
Category: None
Group: None
Status: Open
Resolution: Works For Me
>Priority: 6
Submitted By: Patrick McCormick (patrickmc)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error
Initial Comment:
I have a problem where users like to use iso-8859-1
without declaring it in
the prolog, like this:
<?xml version='1.0'?>
<rule>abécdef</rule>
expat properly defaults to utf-8 in this case. As I
understand utf-8, the
é character (0xE7) has a bitfield that looks like the
start of a three byte
sequence. A 3-byte sequence is supposed to look like
this:
bytes | bits | representation
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
the above two bytes (c and d) don't match the 10vvvvvv
mask, so écd is an
illegal utf-8 sequence. But expat doesn't throw a
well-formedness error.
Expat uses this macro in xmltok.c to figure out what's
illegal:
#define UTF8_INVALID3(p) \
((*p) == 0xED \
? (((p)[1] & 0x20) != 0) \
: ((*p) == 0xEF \
? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] ==
0xBE)) \
: 0))
but this doesn't seem strict enough.
I wrote a patch that makes expat check UTF-8 sequences
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html
and it's attached.
----------------------------------------------------------------------
>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 11:52
Message:
Logged In: YES
user_id=3066
This is strange. Using the CVS version of Expat, the test
case (in tests/runtests.c:test_illegal_utf8) sees the error
properly reported. xmlwf doesn't report it, however. Are
you using the library directly or going through xmlwf?
I'll see what I can figure out.
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:44
Message:
Logged In: YES
user_id=290026
There is official conversion code at unicode.org.
Download the files ConvertUTF.c and ConvertUTF.h from
ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/
and then look at the function
static Boolean isLegalUTF8(UTF8 *source, int length)
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:24
Message:
Logged In: YES
user_id=290026
I can confirm that the current CVS does indeed not
report an error against:
<?xml version='1.0'?>
<rule>abécdef</rule>
Karl
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-05-08 17:40
Message:
Logged In: YES
user_id=13222
I'm not happy with closing this bug report without
action. Contrary to Fred's test result, I still find, that
the described bug is still there (as it was at the time, the
bug was reported). I've tested this with the current CVS
HEAD.
The bug is in deed easly demonstrable with the example out
of the bug report. I use:
<?xml version='1.0'?>
<rule>abécdef</rule>
The third character of the PCDATA is a small e with acute,
that's 0xe9 in the iso-8859-1 char table (and the unicode
char 00e9), if there may be an encoding problem throu the
web interface.
xmlwf passes this test file, without any error report, which
is, to the best of my knowledge, wrong.
rxp and libxml (i.e. xmllint) confirm, that the test file is
not proper UTF-8.
IHMO, this is a real _crucial_ bug.
Please, __Please__, re-check this.
rolf
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-19 15:19
Message:
Logged In: YES
user_id=3066
Added a test (tests/runtests.c revision 1.9) that shows this
bug does not exist in the CVS version.
You did not state which version of Expat you're using.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127