[Expat-discuss] Fw: Extra character inserted in CharacterData Handler?

Subramanian, Binu binu.subramanian@barconet.com
Wed Jul 24 20:42:02 2002


Hi,

No UTF-8 is not restricted to 8 bits. Do refer this link:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

I have some data which contains some characters like Euro, trademark, etc.
I want to serialise it in XML format. So when i write the XML file, i
replace
the characters with their numerical entities. This XML file is viewed
correctly
in IE 6.0. But when i parse the XML file, the expat prefixes the Â
character. ie
i get the  character followed by the Euro character.

My XML file has the encoding specified as UTF-8.

I will try changing the encoding of the parser and check.

Binu
-----Original Message-----
From: Josh Martin [mailto:Josh.Martin@abq.sc.philips.com]
Sent: 25 July 2002 03:52
To: expat-discuss@lists.sourceforge.net; binu.subramanian@barconet.com
Subject: Re: [Expat-discuss] Fw: Extra character inserted in
CharacterData Handler?


Hi,

Call me crazy, but isn't UTF-8 an 8-bit wide character encoding format?  And
if 
so, isn't the number 8364 a bit out of its league?  If this is true then I
would 
think that the  character is either some sort of multi-byte character 
indicator, or is just expat fudging on the numbers it doesn't understand.
Try 
not specifying the encoding format for the XML document and the XML
parser... 
see what happens.  Let us know how it goes.  I think that is how I solved
this 
problem when I encountered it about a year ago.

BTW, I thought you were trying to keep the parser from converting character 
entities to their character representation?

 - Josh Martin

> Hello,
> 
> I am facing exactly the same problem. In my case the characters are =
the
> Euro, trademark, etc.
> When i write the xml file, i replace the Euro character with its =
numerical
> entity € 
> I have specified the encoding for my XML file as UTF-8.
> 
> Now when the expat parser parses the file, it appends the  =
character. so
it
> is  followed by the Euro character.
> What should i do to get rid of the extra character?
> Am i missing something here?
> Binu
> 
> 
> > 
>  > The "." character in your file - 0xB7 - is invalid UTF-8.
>  > Maybe it is valid ISO-8859-1?
>  > In that case you must add an XML declaration.
>  > 
>  > Actually, 1.95.3 should reject it (and it does so on my system).
>  
>  Rolf Ade just pointed out to me that I didn't read your code.
>  You passed the ISO-8859-1 encoding to the parser, so there
>  was no error on your side.
>  
>  However, what you reported looks exactly like what a word processor
>  would show you when it expects ISO-8859-1, but gets UTF-8 (tested =
with
> Wordpad).
>  Now, this would be a correct result, since Expat only passes UTF-8
>  or UTF-16 to its handlers, no matter what the input.
>  
>  Karl
>  
> 
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/expat-discuss