Here is a Unicode question in foo/bär terms? You can change the text of an element from 'bär' to 'bar'. Why can't you change it from 'bar' to 'bär'? Or what does it take to do it? Here is a minimal file demonstrating the problem # This Python file uses the following encoding: utf-8 from lxml import etree import re text = """<root> <w>foo</w> <w>bär</w> </root>""" tree = etree.XML(text) for element in tree.iter('w'): if element.text.encode('utf-8') =='bär': print etree.tostring(element, encoding = 'utf-8') In the above first iteration of the script, the text 'bär' properly passes through the 'fromstring' and 'tostring' routines. In the next iteration of the script, you want to change 'bär' to 'bar'. That works for element in tree.iter('w'): if element.text.encode('utf-8') =='bär': element.text = 'bar' print etree.tostring(element, encoding = 'utf-8') That script returns the expected '<w>bar</w> But if I start with '<w>bar</w>' and want to change it to '<w>bär</w>', the script doesn't work: for element in tree.iter('w'): if element.text.encode('utf-8') =='bar': element.text = 'bär' print etree.tostring(element, encoding = 'utf-8') The script triggers the following error message: "ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters" I've tried various "recognize Unicode" routines that work in other contexts. But they don't work here. What am I missing? Martin Mueller Professor of English and Classics Northwestern University
On Tue, 2012-12-18 at 16:27 +0000, Martin Mueller wrote:
for element in tree.iter('w'): if element.text.encode('utf-8') =='bar': element.text = 'bär' print etree.tostring(element, encoding = 'utf-8')
The third line there must be: element.text = u'bär' to indicate that it is a Unicode string. Jamie
Hi, Martin Mueller, 18.12.2012 17:27:
for element in tree.iter('w'): if element.text.encode('utf-8') =='bär':
In addition to what was said already, it's worth mentioning that this is the wrong way to do this comparison. Instead, use Unicode text in your program and encode/decode data only on the way in and out. In the case of XML, the parser and serialiser do this for you, so all you have to do in the code above is to compare element.text to u'bär'. Stefan
participants (3)
-
Jamie Norrish
-
Martin Mueller
-
Stefan Behnel