
I ma new to lxml and a little wobbly on Unicode in Python 2.7. Here is my problem: 1. I take an SGML text with character entities and convert them to UTF-8 characters 2. I extract note elements from the text with the converted characters. This works if I use two separate Python scripts. The first goes like this: # This Python file uses the following encoding: utf-8 import entityrules filein = open('/users/martin/dropbox/eebo-SGMLfiles/200504/A10675.sgm', 'r') fileout = open('/users/martin/dropbox/lxml/A10675.xml', 'w') text = filein.read() text = entityrules.replaceCharacterEntities(text) print >> fileout, text It takes the SGML file as an input. applies the "replaceCharacterEntities" and writes that output to a new file. The second script goes like this: #encoding = 'utf-8' from lxml import etree filein ='/users/martin/dropbox/lxml/a10675.xml' fileout = open('/users/martin/dropbox/lxml/notesA10675.xml', 'w') doc = etree.parse(filein) for note in doc.getiterator('note'): print >>fileout, etree.tostring(note, with_tail = False, encoding = 'utf-8') Its input is the file generated by the first script. It converts the input file into an element tree and then iterates through the tree, extracting a particular element . Now it should be possible to combine those two scripts in a single script as follows: # This Python file uses the following encoding: utf-8 import entityrules from lxml import etree #converting the SGML character entities to UTF-8 characcters filein = open('/users/martin/dropbox/eebo-SGMLfiles/200504/A10675.sgm', 'r') fileout = open('/users/martin/dropbox/lxml/notesA10675.xml', 'w') text = filein.read() text = entityrules.replaceCharacterEntities(text) print "I finished the first part of this script" #creating an element tree and extracting instances of a particular element doc = etree.parse(text) for note in doc.getiterator('note'): print >>fileout, etree.tostring(note, with_tail = False, encoding = 'utf-8') This script produces the following output: Martin-Muellers-Mac-Pro:lxml martin$ python getNotes.py I finished the first part of this script Traceback (most recent call last): File "getNotes.py", line 14, in <module> doc = etree.parse(text) File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220) File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82303) File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82596) File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81635) File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78544) File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488) File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379) File "parser.pxi", line 584, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74650) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1126: ordinal not in range(128) The script does the first part of the job correctly, as is apparent from the first line. And the "text" variable that is passed to the etree function should in principle be identical with the content of the file that is passed to my script from a file. But it isn't, and somewhere along the road Python seems to have forgotten that we 're dealing with Unicode texts. What am I doing wrong? Or is there a bug? I am using the latest Enthought version of Python, which comes bundled with lxml Martin Mueller Professor of English and Classics Northwestern University
participants (2)
-
Jens Quade
-
Martin Mueller