[Expat-discuss] How to make expat support east asia languages

Karl Waclawek karl at waclawek.net
Tue Mar 15 00:47:55 CET 2005



陈颉 wrote:
>>j.chen at ustc.edu wrote:
>>
>>>Hi,
>>>   Does any one know how to make expat support east asia languages? such as Chinese, Korean and so on.
>>>   Currently, expat's output is some un-recognizable characters when parsing an XML file which contains some east asia languages' characters. Does any one know how to fix this problem?
>>
>>Expat always uses the UTF-8 or UTF-16 encoding of Unicode
>>when passing data to the application, regardless of which encoding
>>the source document is in.
>>
>>Karl
>>
>>
> 
> 
> Well..., when I use expat to parse a very very simple xml, which including some Chinese characters in it, like "<test>哈哈</test>", a "not well formed document" error occured. However, when I change the xml to "<test>test</test>" , everything is ok.
> 
> So, could you please tell me why ?
> 
> And what's more, in many Chinese XML tech. discussion forum, it is said clearly that expat do not support Chinese. So, maybe there is some problem with expat when parsing some multi-byte coding language. Do you know how to fix it?
> 

We have to differentiate between input and output.
For output, most XML parser support Unicode only.

For input, Expat supports UTF-8, UTF-16, ISO-8859-1 and ASCII.
If you implement the unknownEmcodingHandler() then Expat can
support more encodings.

If you check out patch # 888879, you will find that someone
supplied such an implementation for the gb3212 encoding.

However, the standard encoding is Unicode, which supports basically
all characters of all languages, including Asian languages.
If you have a choice, I recommend to abandon the older
multi-byte encodings for Asian characters and switch to unicode.

Karl


More information about the Expat-discuss mailing list