[ expat-Bugs-514281 ] french accents errors?

Mon Apr 15 20:07:02 2002

Bugs item #514281, was opened at 2002-02-07 09:45
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=514281&group_id=10127

Category: None
Group: Not a Bug
Status: Open
Resolution: None
Priority: 5
Submitted By: Vincent Fortier (fortierv)
Assigned to: Nobody/Anonymous (nobody)
Summary: french accents errors?

Initial Comment:
I'm having errors with french accents in iso-8859-1 and utf-8 format 
only at a specific place..

Exemple:
<CLIENT>

<COMPAGNIE>Ville Charlesbourg</COMPAGNIE>
   <CONTACT>

<NOM>Martin Labbé</NOM>
      <COMM 
Type='Phone'>(xxx)xxx</COMM>
   </CONTACT>
</CLIENT>

It 
always stops at <NOM>Martin Labbé</NOM> .. If I either change "é" 
for an "e" it works.. if I simply add 2 chars after the "é" like: 
<NOM>Martin LabbéXX</NOM> it still works...  But it always stops 
when it finishes with "é" or "éX"....

I've tried a few patches.. 
one for expat.h and xmlparse.c wich was unicode changes (#476931, 
#464837).. and another one wich was about xmltok.c (#477667)... 
but with no success..  I always get the error message "mismatched tag 
at line xx"...

I have a lot of "éèàçêÈ" everywhere.. but it seems 
to only stop when it's placed right at the end of a tag..

Anybody 
can help me out?

- vin

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-15 23:06

Message:
Logged In: YES 
user_id=3066

Confirmed, mostly.  With the current CVS, I can't seem to
get things to work by adding "XX" before the tag.  I'm also
seeing a different error: "not well-formed (invalid token)".

I've attached a patch that adds tests to the test suite.

----------------------------------------------------------------------

Comment By: John Dawson (jdawson)
Date: 2002-03-26 16:54

Message:
Logged In: YES 
user_id=30450

I have encountered a similar problem where I work, doing 
German.  My code was in Perl, and used XML::Parser and 
XML::Parser::Expat, but I believe that Perl is not relevant 
to this issue.  I was using libexpat 1.95.1.

I got things to work by doing the following:

1)  Giving the input XML document an encoding type of iso-
8859-1.  This will keep libexpat from interpreting the 
Latin Extended characters (like the French accents) as part 
of a multi-byte UTF-8 sequence.  That's what the problem 
was; it interpreted the '<' character following the accent 
as the second byte of the same character.

2)  Change your code that uses libexpat to convert the data 
from UTF-8 back into ISO-8859-1.  Why do this?  It seems to 
be the case that libexpat is normalizing the input to UTF-
8, regardless of what the original character encoding was.  
The Perl regex I used to do this transformation:

$s =~ s/([\xC0-\xDF])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord
($2)&0x3F)/eg;

If you can't read Perl, you can look at the latin1TOutf8.c 
program, attached below, and do the inverse of what the C 
code there does.  It's quite trivial.

So, this worked for me.

A final point I'd like to make is that part of the issue 
here is documentation and expectation management.  It's 
probably a good thing that libexpat converted the input 
into UTF-8.  The only reason why this didn't work out very 
well for me was that my code is part of a much larger 
system, and the other components I interact with deal 
exclusively with ISO-8859-1.  Thus, the path of least 
resistance was to do the conversion.

----------------------------------------------------------------------

Comment By: Vincent Fortier (fortierv)
Date: 2002-02-07 13:37

Message:
Logged In: YES 
user_id=451869

I've found a way to solve my problem.. The problem was actually with the iso-
8859-1 format making errors with "é" at really specefic places.. I've 
found a converter in C wich transform a text file to utf-8 and the my problems 
where over.. (see attachement)

Here is the web site where I got 
it:
http://developer.iplanet.com/tech/directory/utf8ltn1.html

----------------------------------------------------------------------

Comment By: Vincent Fortier (fortierv)
Date: 2002-02-07 10:40

Message:
Logged In: YES 
user_id=451869

Note:  BTW.. xmlwf works fine on the file.. so it's supposed to be well-
formed in iso-8859-1 format ... and the value of "é" is E9 and 351 (used octal 
dump..)..

thnx.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=514281&group_id=10127