[ expat-Bugs-481609 ] Wrong umlauts after parsing

noreply@sourceforge.net noreply@sourceforge.net
Mon Apr 22 04:21:01 2002


Bugs item #481609, was opened at 2001-11-14 00:33
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=481609&group_id=10127

Category: XML::Parser (Perl module)
Group: Not a Bug
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Thomas Frings (frings)
Assigned to: Clark Cooper (coopercc)
Summary: Wrong umlauts after parsing

Initial Comment:
Parsing a xml-file that contains german umlauts like 
ä ö ü or their encoding like ä ä or ü
results in 'C$' (instead of 'ä'), 'C<' (instead of 'ü') 
or  'C6' (instead of 'ö').

What's going wrong? 

System: Solaris 2.8
        expat 1.95.2
        XML-Parser 2.30

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-04-22 04:20

Message:
Logged In: NO 

When you write umlauts in attributes, it goes completely 
wrong:
<image id="2" alt="Schön" />
results in a value alt="Schn" or (in newer versions of 
Expat) in a Well-Formed error.
When you do alt="Sch&uuml;n" you get alt="Schn" , too.
The only workaround is doing: alt="Sch&amp;uuml;n" , and 
that isn't nice at all.


----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-15 20:39

Message:
Logged In: YES 
user_id=3066

The output shown is not UTF-8, but UTF-8 with the high bit
stripped.  I expect this was an artifact of the display font
or the terminal.  Expat should produce UTF-8 in all cases;
that's part of the intended interface.

----------------------------------------------------------------------

Comment By: Simon Gordon (si_gordon)
Date: 2001-11-14 16:03

Message:
Logged In: YES 
user_id=227124

I believe this is UTF-8. Expat always outputs in UTF-8 
rather than either (a) what you want or (b) what the XML 
encoding is set to.

I have long-held the belief that this is a bug even though 
the relese notes for 1.95 documented this fact. I had to 
patch my version to output ISO-8859-1 for exactly the same 
reason - I needed umlauted characters in ISO, not UTF-8.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=481609&group_id=10127