[Expat-bugs] Problem with decoding UTF-8 triplet and expat 1.95.4

Tim Crook tim.crook@adobe.com
Mon, 26 Aug 2002 17:51:25 -0400


On Windows, when reading the UTF-8 sequence "EF BA BF", utf8_isInvalid3
returns TRUE, when it should return FALSE. This UTF-8 sequence encodes to
"FEBF" as UCS-2 (Unicode), but as a result of utf8_isInvalid3 returning
TRUE, an error results and the character isn't decoded properly.

Here is a simple XML file which illustrates the problem:

<?xml version="1.0" encoding="UTF-8" ?> 
<test> 
 <ARABIC_LETTER_DAD_INITIAL_FORM>xxx</ARABIC_LETTER_DAD_INITIAL_FORM> 
</test>

To see the problem, replace xxx with the string value for "EF BA BF".
_________________________________________
Tim Crook
Computer Scientist

Adobe Systems Canada Inc.
785 Carling Avenue
Ottawa, Ontario
Canada  K1S 5H4

Phone: +1 613.751.4800 Ext 5734
Fax: +1 613.594.8886
E-mail: tim.crook@adobe.com