On Qua, 2014-01-29 at 08:37 +0100, Stefan Behnel wrote:
Sérgio Basto, 29.01.2014 08:21:
On Qua, 2014-01-29 at 08:08 +0100, Stefan Behnel wrote:
Sérgio Basto, 28.01.2014 18:43:
when a stringxpath is a string we need be sure that encodes in utf-8 , python see enconde and decode in opposite way, so command is decode, I got stringxpath working with "é" like this:
stringxpath = '//div[@id="México"]' hparser = lxml.html.HTMLParser(encoding=pcoding , remove_comments=True) html_document = lxml.html.fromstring(content, parser=hparser) html_document.xpath(strxpath.decode('utf-8'))
1) this has nothing to do with the topic of this thread.
2) this is completely the wrong way to do this.
Actually, lxml should reject the XPath expression as invalid byte string input
And it does, I just checked.
In [1]: stringxpath = '//div[@id="México"]' In [3]: stringxpath.decode('utf-8') Out[3]: u'//div[@id="M\xe9xico"]'
is not a byte string input, or maybe I don't understand.
Sorry, my fault. I misread the "decode()" as "encode()", because I didn't see why you would *decode* an obvious Unicode string.
The right way to do this is to say
stringxpath = u'//div[@id="México"]'
hum thanks, BTW with python 2.7 , do you know how I convert : '//div[@id="México"]' to u'//div[@id="México"]' ? thanks for your reply
I.e. with a "u" prefix to make it a Unicode string in Py2.x.
In any case, passing Unicode strings (at least for anything that's not plain ASCII text in Py2.x), is totally the right thing to do. Sorry for the confusion.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Sérgio M. B.