[Expat-discuss] How is SJIS encoding handled in expat?

Agarwal, Saumya Saumya.Agarwal at netapp.com
Fri Apr 20 10:36:33 CEST 2007


Thanks Karl. The  problem was that XML_ParserCreate(const XML_Char *encoding); function was being called by passing UTF-8 which was overriding the encoding declaration, as you suspected.

>Not by default. You must register an "unknownEncodingHandler" that can handle SHIFT-JIS.
>Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16 for input.
>For an example, look at patch #888879 on the Expat web site. 

Where can I find an encoding handler which can handle SHIFT-JIS?  Will expat be able to support both UTF-8 and SHIFT-JIS encoding at the same time if I register such an handler?

Thanks,
Saumya

-----Original Message-----
From: Karl Waclawek [mailto:karl at waclawek.net] 
Sent: Tuesday, April 17, 2007 9:45 PM
To: expat-discuss at libexpat.org
Subject: Re: [Expat-discuss] How is SJIS encoding handled in expat?

Agarwal, Saumya wrote:
> Hi,
>  
> I have a scenario in which the encoding of the data on the server is in SJIS format. The client requests this data from the server through an API, the server sends the output in XML parsed by the expat parser.
>  
> Here is the input and output  -
>  
> <?xml version='1.0' encoding='SHIFT-JIS' ?> <!DOCTYPE netapp SYSTEM 
> 'file:/etc/netapp_filer.dtd'> <netapp 
> xmlns="http://www.netapp.com/filer/admin 
> <BLOCKED::http://www.netapp.com/filer/admin> " 
> version="1.0"><file-inode-info><inode-number>1193746</inode-number><vo
> lume-name>vol0</volume-name></file-inode-info></netapp>
>  
> OUTPUT:
> <?xml version='1.0' encoding='UTF-8' ?> <!DOCTYPE netapp SYSTEM 
> '/na_admin/netapp_filer.dtd'> <netapp version='1.1' 
> xmlns='http://www.netapp.com/filer/admin'>
> <results 
> status="passed"><volume-name>vol0</volume-name><volume-fsid>1996999850
> </volume-fsid><volume-uuid>42a93940-4ed9-11db-ba89-00a098032816</volum
> e-uuid><inode-number>1193746</inode-number><number-of-parents>1</numbe
> r-of-parents><inode-paths><inode-parent-info><inode-path>/vol/vol0/hom
> e/新規ワードパッド 
> ドキュメント.doc</inode-path></inode-parent-info></inode-paths></results></n
> etapp>
>
>  
> As seen above, the client declares the document encoding to be SHIFT-JIS. The server returns the proper data (seems like SJIS, as japanese characters are represented correctly in the output ) but the encoding declared in the output document is UTF-8. 
> Now, the strange part is that even if the client declares the document endoding to be UTF-8 in the input, the server behavior is just the same!
>  
> Here are my questions -
> 1. Does expat support SJIS encoding? 
>   

Not by default. You must register an "unknownEncodingHandler" that can handle SHIFT-JIS.
Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16 for input.
For an example, look at patch #888879 on the Expat web site.

> 2. If yes, then how does it know the data is SJIS encoded and when does it call the appropriate handler? 
>   

Normally, Expat would reject the input document. Do you know if there is an "unknownEncodingHandler"?
Or more likely, the XML_ParserCreate(const XML_Char *encoding); function is called by passing a recognized encoding (instead of null). This would override the encoding declaration and make Expat treat the document as if it thus encoded.

> 3. Is the output returned by expat, the SJIS encoded data, or does it convert the data to UTF-8 and return it?
>   

Expat always return either UTF-8 or UTF-16, depending on how it was built.
My guess is, the server forces one of the built-in encodings when calling XML_ParserCreate(const XML_Char *encoding). This can work as long as there is no sequence of bytes that represents an invalid code point in that encoding.

> 4. Is there a way through which expat can declare to the client that the data is actually SJIS and not UTF-8? We have another parser on the client side (libxml2) which fails which a parsing error when the XML output from expat is given to it, as the data is japanese while the encoding declaration is UTF-8.
>   

No, Expat always returns UTF-8 or UTF-16. I think there is an error on the server side.
Since you say the characters returned by Expat are actually SJIS, I assume that the server forces Expat to treat it as one of the built-in encodings (most likely UTF-8).
>  Karl
>
>   
_______________________________________________
Expat-discuss mailing list
Expat-discuss at libexpat.org
http://mail.libexpat.org/mailman/listinfo/expat-discuss


More information about the Expat-discuss mailing list