Matt Price matt.price at utoronto.ca
Mon Aug 9 19:45:43 CEST 2004

(cross-posted to python-list)

I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a little confusing.  Here
is the basic situation:

I connect to the server and receive an xml document whose content is a
bibliographic dataset.  The document can be encoded in two ways:
ISO-8859-1 or unicode.  My program simply takes the document and
passes it to an xsl stylesleet using libxslt & libxml2.  Here's the
relevant code:  

# this is how I get the results & generate either a string or a
# unicode string
    def getref (self, query = ':ID:>0',  cmd = 'getref ', 
                reftype = default_reftype): 
        cmd += ' ' + query 
        self.send(cmd + self.CS_TERM) 
        results = self.tread() 
        if self.encoding == 'UNICODE': 
            print ' decoding unicode string: ' 
            results = results.decode('utf-8', 'replace') 
        return results 

# this is where I generate the html:
    def risx_to_html (self, risxSet, xsl = xsl_ss,  
                    css=css_url, use_css = 1): 
        styledoc = libxml2.parseFile(xsl) 
        style = libxslt.parseStylesheetDoc(styledoc) 
        doc = libxml2.parseDoc(risxSet) 
        result = style.applyStylesheet(doc, None) 
        # style.saveResultToFilename("results.html", result, 0) 
        htmlString = style.saveResultToString(result) 
        return htmlString 

The server's default encoding is iso-8859-1, and since I mosly use
english-language references, this usually works fine; but occasionally
the server will pass me an entity like 'μ' (for Greek letter mu).
This generates an error like this:  

Entity: line 57: parser error : Entity 'mu' not defined

This is not so bad, because the parsing continues nonetheless.  With
unicode it's worse.  In this case there are several errors depending
on how I set the system up:  

with iso-8859-1 set as default encoding in sitecustomize.py:

  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

with utf-8 set as default encoding: 
  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode

So I guess I have two questions:

(1) am I using the right python tools for this job?  My excellent
python books unfortunately don't cover either unicode or xml in much
depth, so I'm a little uncertain as te whtehr I'm doing the right

(2) Is there a way to make libxml2 parse unicode documents?  Do I need
to pass it more information alerting it that it's getting unicode?  

Anyway, thanks very much for your help.  Much appreciated,  


Matt Price	    matt.price at utoronto.ca
History Department, University of Toronto
(416) 978-2094

