[Expat-discuss] How is SJIS encoding handled in expat?

Karl Waclawek karl at waclawek.net
Tue Apr 17 18:14:30 CEST 2007


Agarwal, Saumya wrote:
> Hi,
>  
> I have a scenario in which the encoding of the data on the server is in SJIS format. The client requests this data from the server through an API, the server sends the output in XML parsed by the expat parser.
>  
> Here is the input and output  -
>  
> <?xml version='1.0' encoding='SHIFT-JIS' ?>
> <!DOCTYPE netapp SYSTEM 'file:/etc/netapp_filer.dtd'>
> <netapp xmlns="http://www.netapp.com/filer/admin <BLOCKED::http://www.netapp.com/filer/admin> " version="1.0"><file-inode-info><inode-number>1193746</inode-number><volume-name>vol0</volume-name></file-inode-info></netapp>
>  
> OUTPUT:
> <?xml version='1.0' encoding='UTF-8' ?>
> <!DOCTYPE netapp SYSTEM '/na_admin/netapp_filer.dtd'>
> <netapp version='1.1' xmlns='http://www.netapp.com/filer/admin'>
> <results status="passed"><volume-name>vol0</volume-name><volume-fsid>1996999850</volume-fsid><volume-uuid>42a93940-4ed9-11db-ba89-00a098032816</volume-uuid><inode-number>1193746</inode-number><number-of-parents>1</number-of-parents><inode-paths><inode-parent-info><inode-path>/vol/vol0/home/新規ワードパッド ドキュメント.doc</inode-path></inode-parent-info></inode-paths></results></netapp>
>
>  
> As seen above, the client declares the document encoding to be SHIFT-JIS. The server returns the proper data (seems like SJIS, as japanese characters are represented correctly in the output ) but the encoding declared in the output document is UTF-8. 
> Now, the strange part is that even if the client declares the document endoding to be UTF-8 in the input, the server behavior is just the same!
>  
> Here are my questions -
> 1. Does expat support SJIS encoding? 
>   

Not by default. You must register an "unknownEncodingHandler" that can
handle SHIFT-JIS.
Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16
for input.
For an example, look at patch #888879 on the Expat web site.

> 2. If yes, then how does it know the data is SJIS encoded and when does it call the appropriate handler? 
>   

Normally, Expat would reject the input document. Do you know if there is
an "unknownEncodingHandler"?
Or more likely, the XML_ParserCreate(const XML_Char *encoding); function
is called by passing
a recognized encoding (instead of null). This would override the
encoding declaration and make
Expat treat the document as if it thus encoded.

> 3. Is the output returned by expat, the SJIS encoded data, or does it convert the data to UTF-8 and return it?
>   

Expat always return either UTF-8 or UTF-16, depending on how it was built.
My guess is, the server forces one of the built-in encodings when calling
XML_ParserCreate(const XML_Char *encoding). This can work as long as there
is no sequence of bytes that represents an invalid code point in that
encoding.

> 4. Is there a way through which expat can declare to the client that the data is actually SJIS and not UTF-8? We have another parser on the client side (libxml2) which fails which a parsing error when the XML output from expat is given to it, as the data is japanese while the encoding declaration is UTF-8.
>   

No, Expat always returns UTF-8 or UTF-16. I think there is an error on
the server side.
Since you say the characters returned by Expat are actually SJIS, I
assume that the server
forces Expat to treat it as one of the built-in encodings (most likely
UTF-8).
>  Karl
>
>   


More information about the Expat-discuss mailing list