[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error

noreply@sourceforge.net noreply@sourceforge.net
Fri May 17 08:53:02 2002


Bugs item #477667, was opened at 2001-11-02 17:58
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127

Category: None
Group: None
Status: Open
Resolution: Works For Me
>Priority: 6
Submitted By: Patrick McCormick (patrickmc)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error

Initial Comment:
I have a problem where users like to use iso-8859-1 
without declaring it in
the prolog, like this:

<?xml version='1.0'?>
<rule>abécdef</rule>

expat properly defaults to utf-8 in this case.  As I 
understand utf-8, the
é character (0xE7) has a bitfield that looks like the 
start of a three byte
sequence.  A 3-byte sequence is supposed to look like 
this:

bytes | bits | representation
    3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv

the above two bytes (c and d) don't match the 10vvvvvv 
mask, so écd is an
illegal utf-8 sequence.  But expat doesn't throw a 
well-formedness error.

Expat uses this macro in xmltok.c to figure out what's 
illegal:

#define UTF8_INVALID3(p) \
  ((*p) == 0xED \
  ? (((p)[1] & 0x20) != 0) \
  : ((*p) == 0xEF \
     ? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] == 
0xBE)) \
     : 0))

but this doesn't seem strict enough.

I wrote a patch that makes expat check UTF-8 sequences 
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html

and it's attached.


----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 11:52

Message:
Logged In: YES 
user_id=3066

This is strange.  Using the CVS version of Expat, the test
case (in tests/runtests.c:test_illegal_utf8) sees the error
properly reported.  xmlwf doesn't report it, however.  Are
you using the library directly or going through xmlwf?

I'll see what I can figure out.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:44

Message:
Logged In: YES 
user_id=290026

There is official conversion code at unicode.org.
Download the files ConvertUTF.c and ConvertUTF.h from
  ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/

and then look at the function 
  static Boolean isLegalUTF8(UTF8 *source, int length)

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:24

Message:
Logged In: YES 
user_id=290026

I can confirm that the current CVS does indeed not
report an error against:

<?xml version='1.0'?> 
<rule>abécdef</rule> 

Karl

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2002-05-08 17:40

Message:
Logged In: YES 
user_id=13222

I'm not happy with closing this bug report without
action. Contrary to Fred's test result, I still find, that
the described bug is still there (as it was at the time, the
bug was reported). I've tested this with the current CVS
HEAD.

The bug is in deed easly demonstrable with the example out
of the bug report. I use:

<?xml version='1.0'?>
<rule>abécdef</rule>

The third character of the PCDATA is a small e with acute,
that's 0xe9 in the iso-8859-1 char table (and the unicode
char 00e9), if there may be an encoding problem throu the
web interface.

xmlwf passes this test file, without any error report, which
is, to the best of my knowledge, wrong.

rxp and libxml (i.e. xmllint) confirm, that the test file is
not proper UTF-8.

IHMO, this is a real _crucial_ bug. 

Please, __Please__, re-check this.

rolf


----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-19 15:19

Message:
Logged In: YES 
user_id=3066

Added a test (tests/runtests.c revision 1.9) that shows this
bug does not exist in the CVS version.

You did not state which version of Expat you're using.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127