[Expat-discuss] Windows-1252 and Latin-1

Fred L. Drake, Jr. fdrake at acm.org
Tue Jan 14 09:48:15 EST 2003


Swapnel writes:
 > Dear All,
 > 	The problem is : If the Encoding scheme specified in XML document is
 > "windows=1252" . and the file is saved using Notepad of win-2000 stating
 > encoding scheme as Unicode then Expat doesnot parse the document. But the
 > same file if saved by notepad setting  encoding "ANSI" or "UTF-8"  then
 > process is carried out smoothly.
 > 
 > Eg:
 > 	XML document : <?xml version="1.0"? encoding= "windows-1252">.
 > The editor used to saved this document is win-2000 notepad with encoding
 > option as unicode. Expat is not able to parse this document.

Hmm.  According to this page:

http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

"Windows-1252" is a synonym for Latin-1, or ISO-8859-1 (I didn't
compare the table codepoint-by-codepoint, just trusting the text of
the page).  If you use ISO-8859-1 in the XML declaration, Expat should
be perfectly happy.  The issue is that Windows-1252 is a non-standard
name for the encoding.

We should consider adding "windows-1252" to the list of supported
encodings for Expat since it is (supposedly) identical with Latin-1.

 > Is there any way by which expat can cnvert the XML document saved with
 > Encoding scheme UNICODE to ANSI or UTF-8 and then it takes for parsing?

In general, you can tell Expat to assume the data is in a particular
encoding by specifying the encoding in the call to XML_ParserCreate(),
and then re-encode the input yourself.  Another option is to use the
facilities Expat provides to hook in additional decoders; see the
reference.html file that comes with Expat for API information.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Zope Corporation



More information about the Expat-discuss mailing list