[Expat-discuss] Windows-1252 and Latin-1
Fred L. Drake, Jr.
fdrake at acm.org
Tue Jan 14 09:48:15 EST 2003
Swapnel writes:
> Dear All,
> The problem is : If the Encoding scheme specified in XML document is
> "windows=1252" . and the file is saved using Notepad of win-2000 stating
> encoding scheme as Unicode then Expat doesnot parse the document. But the
> same file if saved by notepad setting encoding "ANSI" or "UTF-8" then
> process is carried out smoothly.
>
> Eg:
> XML document : <?xml version="1.0"? encoding= "windows-1252">.
> The editor used to saved this document is win-2000 notepad with encoding
> option as unicode. Expat is not able to parse this document.
Hmm. According to this page:
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
"Windows-1252" is a synonym for Latin-1, or ISO-8859-1 (I didn't
compare the table codepoint-by-codepoint, just trusting the text of
the page). If you use ISO-8859-1 in the XML declaration, Expat should
be perfectly happy. The issue is that Windows-1252 is a non-standard
name for the encoding.
We should consider adding "windows-1252" to the list of supported
encodings for Expat since it is (supposedly) identical with Latin-1.
> Is there any way by which expat can cnvert the XML document saved with
> Encoding scheme UNICODE to ANSI or UTF-8 and then it takes for parsing?
In general, you can tell Expat to assume the data is in a particular
encoding by specifying the encoding in the call to XML_ParserCreate(),
and then re-encode the input yourself. Another option is to use the
facilities Expat provides to hook in additional decoders; see the
reference.html file that comes with Expat for API information.
-Fred
--
Fred L. Drake, Jr. <fdrake at acm.org>
PythonLabs at Zope Corporation
More information about the Expat-discuss
mailing list