[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error

Fri Nov 2 15:04:03 2001

Bugs item #477667, was opened at 2001-11-02 14:58
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Patrick McCormick (patrickmc)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error

Initial Comment:
I have a problem where users like to use iso-8859-1 
without declaring it in
the prolog, like this:

<?xml version='1.0'?>
<rule>abécdef</rule>

expat properly defaults to utf-8 in this case.  As I 
understand utf-8, the
é character (0xE7) has a bitfield that looks like the 
start of a three byte
sequence.  A 3-byte sequence is supposed to look like 
this:

bytes | bits | representation
    3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv

the above two bytes (c and d) don't match the 10vvvvvv 
mask, so écd is an
illegal utf-8 sequence.  But expat doesn't throw a 
well-formedness error.

Expat uses this macro in xmltok.c to figure out what's 
illegal:

#define UTF8_INVALID3(p) \
  ((*p) == 0xED \
  ? (((p)[1] & 0x20) != 0) \
  : ((*p) == 0xEF \
     ? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] == 
0xBE)) \
     : 0))

but this doesn't seem strict enough.

I wrote a patch that makes expat check UTF-8 sequences 
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html

and it's attached.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127