[ expat-Bugs-514281 ] french accents errors?
noreply@sourceforge.net
noreply@sourceforge.net
Mon Apr 15 20:07:02 2002
Bugs item #514281, was opened at 2002-02-07 09:45
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=514281&group_id=10127
Category: None
Group: Not a Bug
Status: Open
Resolution: None
Priority: 5
Submitted By: Vincent Fortier (fortierv)
Assigned to: Nobody/Anonymous (nobody)
Summary: french accents errors?
Initial Comment:
I'm having errors with french accents in iso-8859-1 and utf-8 format
only at a specific place..
Exemple:
<CLIENT>
<COMPAGNIE>Ville Charlesbourg</COMPAGNIE>
<CONTACT>
<NOM>Martin Labbé</NOM>
<COMM
Type='Phone'>(xxx)xxx</COMM>
</CONTACT>
</CLIENT>
It
always stops at <NOM>Martin Labbé</NOM> .. If I either change "é"
for an "e" it works.. if I simply add 2 chars after the "é" like:
<NOM>Martin LabbéXX</NOM> it still works... But it always stops
when it finishes with "é" or "éX"....
I've tried a few patches..
one for expat.h and xmlparse.c wich was unicode changes (#476931,
#464837).. and another one wich was about xmltok.c (#477667)...
but with no success.. I always get the error message "mismatched tag
at line xx"...
I have a lot of "éèàçêÈ" everywhere.. but it seems
to only stop when it's placed right at the end of a tag..
Anybody
can help me out?
- vin
----------------------------------------------------------------------
>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-15 23:06
Message:
Logged In: YES
user_id=3066
Confirmed, mostly. With the current CVS, I can't seem to
get things to work by adding "XX" before the tag. I'm also
seeing a different error: "not well-formed (invalid token)".
I've attached a patch that adds tests to the test suite.
----------------------------------------------------------------------
Comment By: John Dawson (jdawson)
Date: 2002-03-26 16:54
Message:
Logged In: YES
user_id=30450
I have encountered a similar problem where I work, doing
German. My code was in Perl, and used XML::Parser and
XML::Parser::Expat, but I believe that Perl is not relevant
to this issue. I was using libexpat 1.95.1.
I got things to work by doing the following:
1) Giving the input XML document an encoding type of iso-
8859-1. This will keep libexpat from interpreting the
Latin Extended characters (like the French accents) as part
of a multi-byte UTF-8 sequence. That's what the problem
was; it interpreted the '<' character following the accent
as the second byte of the same character.
2) Change your code that uses libexpat to convert the data
from UTF-8 back into ISO-8859-1. Why do this? It seems to
be the case that libexpat is normalizing the input to UTF-
8, regardless of what the original character encoding was.
The Perl regex I used to do this transformation:
$s =~ s/([\xC0-\xDF])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord
($2)&0x3F)/eg;
If you can't read Perl, you can look at the latin1TOutf8.c
program, attached below, and do the inverse of what the C
code there does. It's quite trivial.
So, this worked for me.
A final point I'd like to make is that part of the issue
here is documentation and expectation management. It's
probably a good thing that libexpat converted the input
into UTF-8. The only reason why this didn't work out very
well for me was that my code is part of a much larger
system, and the other components I interact with deal
exclusively with ISO-8859-1. Thus, the path of least
resistance was to do the conversion.
----------------------------------------------------------------------
Comment By: Vincent Fortier (fortierv)
Date: 2002-02-07 13:37
Message:
Logged In: YES
user_id=451869
I've found a way to solve my problem.. The problem was actually with the iso-
8859-1 format making errors with "é" at really specefic places.. I've
found a converter in C wich transform a text file to utf-8 and the my problems
where over.. (see attachement)
Here is the web site where I got
it:
http://developer.iplanet.com/tech/directory/utf8ltn1.html
----------------------------------------------------------------------
Comment By: Vincent Fortier (fortierv)
Date: 2002-02-07 10:40
Message:
Logged In: YES
user_id=451869
Note: BTW.. xmlwf works fine on the file.. so it's supposed to be well-
formed in iso-8859-1 format ... and the value of "é" is E9 and 351 (used octal
dump..)..
thnx.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=514281&group_id=10127