[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error

noreply@sourceforge.net noreply@sourceforge.net
Fri May 17 12:25:07 2002


Bugs item #477667, was opened at 2001-11-02 17:58
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127

Category: None
Group: None
Status: Open
Resolution: Works For Me
Priority: 6
Submitted By: Patrick McCormick (patrickmc)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error

Initial Comment:
I have a problem where users like to use iso-8859-1 
without declaring it in
the prolog, like this:

<?xml version='1.0'?>
<rule>abécdef</rule>

expat properly defaults to utf-8 in this case.  As I 
understand utf-8, the
é character (0xE7) has a bitfield that looks like the 
start of a three byte
sequence.  A 3-byte sequence is supposed to look like 
this:

bytes | bits | representation
    3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv

the above two bytes (c and d) don't match the 10vvvvvv 
mask, so écd is an
illegal utf-8 sequence.  But expat doesn't throw a 
well-formedness error.

Expat uses this macro in xmltok.c to figure out what's 
illegal:

#define UTF8_INVALID3(p) \
  ((*p) == 0xED \
  ? (((p)[1] & 0x20) != 0) \
  : ((*p) == 0xEF \
     ? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] == 
0xBE)) \
     : 0))

but this doesn't seem strict enough.

I wrote a patch that makes expat check UTF-8 sequences 
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html

and it's attached.


----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 15:24

Message:
Logged In: YES 
user_id=3066

Aargh!  This is why I hate token pasting!  "grep" doesn't
like it either.

It gets glued in in four "struct normal_encoding" structures
statically defined, starting with "utf8_encoding_ns".

Ok, I'll keep digging.

----------------------------------------------------------------------

Comment By: Patrick McCormick (patrickmc)
Date: 2002-05-17 15:18

Message:
Logged In: YES 
user_id=363812

not referenced?  sure it is!  you have to tap into the 
crazy zen of expat's vtables-without-C++.

look at the struct utf8_encoding.  at the bottom, it uses 
the macro NORMAL_VTABLE(utf8_), which creates a struct 
entry "utf8_invalid3".

the macro IS_INVALID_CHAR turns into a function call to the 
appropriate utf8_invalidN struct member.  at some point the 
struct members are hooked up to the functions, but I'm not 
sure where.


----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 13:58

Message:
Logged In: YES 
user_id=3066

It's not just that the UTF8_INVALID3() macro is wrong, but
that it isn't used at all!  The macro is referenced from
utf8_isInvalid3(), but that function is not referenced.  ;-(

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 12:31

Message:
Logged In: YES 
user_id=3066

Ok, I've found a bug in the test case (re-using the parser
without resetting it); I've fixed that in my copy and can
now reproduce the error.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-17 12:20

Message:
Logged In: YES 
user_id=290026

I am using the library directly - with my own code.

Karl

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 11:52

Message:
Logged In: YES 
user_id=3066

This is strange.  Using the CVS version of Expat, the test
case (in tests/runtests.c:test_illegal_utf8) sees the error
properly reported.  xmlwf doesn't report it, however.  Are
you using the library directly or going through xmlwf?

I'll see what I can figure out.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:44

Message:
Logged In: YES 
user_id=290026

There is official conversion code at unicode.org.
Download the files ConvertUTF.c and ConvertUTF.h from
  ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/

and then look at the function 
  static Boolean isLegalUTF8(UTF8 *source, int length)

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 10:24

Message:
Logged In: YES 
user_id=290026

I can confirm that the current CVS does indeed not
report an error against:

<?xml version='1.0'?> 
<rule>abécdef</rule> 

Karl

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2002-05-08 17:40

Message:
Logged In: YES 
user_id=13222

I'm not happy with closing this bug report without
action. Contrary to Fred's test result, I still find, that
the described bug is still there (as it was at the time, the
bug was reported). I've tested this with the current CVS
HEAD.

The bug is in deed easly demonstrable with the example out
of the bug report. I use:

<?xml version='1.0'?>
<rule>abécdef</rule>

The third character of the PCDATA is a small e with acute,
that's 0xe9 in the iso-8859-1 char table (and the unicode
char 00e9), if there may be an encoding problem throu the
web interface.

xmlwf passes this test file, without any error report, which
is, to the best of my knowledge, wrong.

rxp and libxml (i.e. xmllint) confirm, that the test file is
not proper UTF-8.

IHMO, this is a real _crucial_ bug. 

Please, __Please__, re-check this.

rolf


----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-19 15:19

Message:
Logged In: YES 
user_id=3066

Added a test (tests/runtests.c revision 1.9) that shows this
bug does not exist in the CVS version.

You did not state which version of Expat you're using.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127