[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error
noreply@sourceforge.net
noreply@sourceforge.net
Fri Nov 2 15:04:03 2001
Bugs item #477667, was opened at 2001-11-02 14:58
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Patrick McCormick (patrickmc)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error
Initial Comment:
I have a problem where users like to use iso-8859-1
without declaring it in
the prolog, like this:
<?xml version='1.0'?>
<rule>abécdef</rule>
expat properly defaults to utf-8 in this case. As I
understand utf-8, the
é character (0xE7) has a bitfield that looks like the
start of a three byte
sequence. A 3-byte sequence is supposed to look like
this:
bytes | bits | representation
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
the above two bytes (c and d) don't match the 10vvvvvv
mask, so écd is an
illegal utf-8 sequence. But expat doesn't throw a
well-formedness error.
Expat uses this macro in xmltok.c to figure out what's
illegal:
#define UTF8_INVALID3(p) \
((*p) == 0xED \
? (((p)[1] & 0x20) != 0) \
: ((*p) == 0xEF \
? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] ==
0xBE)) \
: 0))
but this doesn't seem strict enough.
I wrote a patch that makes expat check UTF-8 sequences
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html
and it's attached.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127