[XML-SIG] XML and Unicode

Mark Nottingham mnot@mnot.net
Tue, 22 May 2001 19:33:18 -0700


--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


OK, so I'm not getting something then. The attached test script (and
data file) is the problem pared down - if u'string' is a neutral
encoding, and .encode('utf-8') generates a utf-8 encoded string of
that encoding, then the utf-8.html output file should display
correctly; however, it doesn't, while the latin-1 output does
(because the input is latin-1).

It seems like the XML parser isn't converting the ISO-8859-1 to
Unicode; does this make sense?

Thanks,


On Wed, May 23, 2001 at 12:38:34AM +0200, M.-A. Lemburg wrote:
> Mark Nottingham wrote:
> > 
> > How does one detect the charset used in an XML document from a SAX2
> > parser (PyXML 0.6.5)?
> > 
> > Also, if I have an XML document encoded ISO-8851-1 (and properly
> > identified), should I have a reasonable expectation that the output
> > of a SAX processor, post- .encode('utf-8'), should be correct if
> > viewed in a Web browser with UTF-8 selected as a character encoding?
> 
> This should work...
> 
> > In other words, is the post-parse unicode string a neutral
> > representation of the 8851-x string, which can then be encoded as
> > utf-8?
> 
> Unicode is encoding neutral in the sense that it provides
> space for the characters of most scripts. If the parser returns
> Unicode, then you can encode it as UTF-8 and have the original
> contents of the attribute/element represented as UTF-8 string.
> 
> > Or, is it in the charset of the original XML document (my
> > testing seems to indicate the latter - what was a 8851 character in
> > the original text does not successfully come out the other side)?
> > 
> > (Sorry if this is obtuse - just getting into i18n, and Python docs
> > are thin on the ground)
> 
> -- 
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> ______________________________________________________________________
> Company & Consulting:                           http://www.egenix.com/
> Python Software:                        http://www.lemburg.com/python/

-- 
Mark Nottingham
http://www.mnot.net/

--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="testuni.py"

#!/usr/bin/env python2.0

from xml import sax
import string

def run(i, e):
	dh = Parser()
	p = sax.sax2exts.make_parser()
	p.setContentHandler(dh)
	p.setFeature(sax.handler.feature_namespaces, 1)
	p.parse(i + '.xml')
	content = dh.content.encode(e)
	file = open(e + ".html", 'w')
	file.write(template % (e, content))
	file.close()

class Parser(sax.handler.ContentHandler):
	def __init__(self):
		self._tmp_buf = ''
		self.content = None
	
	def startElementNS(self, name, qname, attrs):
		pass
	
	def endElementNS(self, name, qname):
		if name[1] == 'content':
			self.content = string.strip(self._tmp_buf)
		
	def characters(self, content):
		self._tmp_buf = self._tmp_buf + content
		

template = """\
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=%s">
</head>
<body>
<p>%s</p>
</body>
</html
"""

if __name__ == '__main__':
	run('ISO-8859-1', 'UTF-8')
	run('ISO-8859-1', 'Latin-1')

--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment; filename="ISO-8859-1.xml"
Content-Transfer-Encoding: 8bit

<?xml version="1.0" encoding="ISO-8859-1" ?>
<content>Net 21  The Survivors</content>

--jI8keyz6grp/JLjh--