[XML-SIG] XML and Unicode

Mark Nottingham mnot@mnot.net
Tue, 22 May 2001 19:33:18 -0700

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

OK, so I'm not getting something then. The attached test script (and
data file) is the problem pared down - if u'string' is a neutral
encoding, and .encode('utf-8') generates a utf-8 encoded string of
that encoding, then the utf-8.html output file should display
correctly; however, it doesn't, while the latin-1 output does
(because the input is latin-1).

It seems like the XML parser isn't converting the ISO-8859-1 to
Unicode; does this make sense?


On Wed, May 23, 2001 at 12:38:34AM +0200, M.-A. Lemburg wrote:
> Mark Nottingham wrote:
> > 
> > How does one detect the charset used in an XML document from a SAX2
> > parser (PyXML 0.6.5)?
> > 
> > Also, if I have an XML document encoded ISO-8851-1 (and properly
> > identified), should I have a reasonable expectation that the output
> > of a SAX processor, post- .encode('utf-8'), should be correct if
> > viewed in a Web browser with UTF-8 selected as a character encoding?
> This should work...
> > In other words, is the post-parse unicode string a neutral
> > representation of the 8851-x string, which can then be encoded as
> > utf-8?
> Unicode is encoding neutral in the sense that it provides
> space for the characters of most scripts. If the parser returns
> Unicode, then you can encode it as UTF-8 and have the original
> contents of the attribute/element represented as UTF-8 string.
> > Or, is it in the charset of the original XML document (my
> > testing seems to indicate the latter - what was a 8851 character in
> > the original text does not successfully come out the other side)?
> > 
> > (Sorry if this is obtuse - just getting into i18n, and Python docs
> > are thin on the ground)
> -- 
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> ______________________________________________________________________
> Company & Consulting:                           http://www.egenix.com/
> Python Software:                        http://www.lemburg.com/python/

Mark Nottingham

Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="testuni.py"

#!/usr/bin/env python2.0

from xml import sax
import string

def run(i, e):
	dh = Parser()
	p = sax.sax2exts.make_parser()
	p.setFeature(sax.handler.feature_namespaces, 1)
	p.parse(i + '.xml')
	content = dh.content.encode(e)
	file = open(e + ".html", 'w')
	file.write(template % (e, content))

class Parser(sax.handler.ContentHandler):
	def __init__(self):
		self._tmp_buf = ''
		self.content = None
	def startElementNS(self, name, qname, attrs):
	def endElementNS(self, name, qname):
		if name[1] == 'content':
			self.content = string.strip(self._tmp_buf)
	def characters(self, content):
		self._tmp_buf = self._tmp_buf + content

template = """\
<meta http-equiv="Content-Type" content="text/html; charset=%s">

if __name__ == '__main__':
	run('ISO-8859-1', 'UTF-8')
	run('ISO-8859-1', 'Latin-1')

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment; filename="ISO-8859-1.xml"
Content-Transfer-Encoding: 8bit

<?xml version="1.0" encoding="ISO-8859-1" ?>
<content>Net 21  The Survivors</content>