[lxml-dev] lxml.html.submit_form and unicode values
Hi, Slowly but surely learning more about lxml.html by using it to do some scrapping. I encountered a unicode problem trying to submit the following form. <form name="Lien1" method="POST" action="http://recherche2.assemblee-nationale.fr/resultats_tribun.jsp" id="Lien1"> <input type="hidden" name="id_auteur" value="Aboud Élie"> <input type="hidden" name="nom_auteur" value="Élie Aboud"> <input type="hidden" name="legislature" value="13"> <input type="hidden" name="typedoc" value="Questions"> </form> Which can be found under the Questions link of http://www.assemblee-nationale.fr/13/tribun/fiches_id/267457.asp#P3 ===== UnicodeEncodeError Traceback (most recent call last) /Users/eugene/Documents/Dev/parlorama/code/<ipython console> in <module>() /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in submit_form(form, extra_values, open_http) 819 if open_http is None: 820 open_http = open_http_urllib --> 821 return open_http(form.method, form.action, values) 822 823 def open_http_urllib(method, url, values): /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in open_http_urllib(method, url, values) 836 data = None 837 else: --> 838 data = urlencode(values) 839 return urlopen(url, data) 840 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.pyc in urlencode(query, doseq) 1267 for k, v in query: 1268 k = quote_plus(str(k)) -> 1269 v = quote_plus(str(v)) 1270 l.append(k + '=' + v) 1271 else: UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 6: ordinal not in range(128) ===== I tried to address the problem by encoding the values in the form fields as suggested here : http://mail.python.org/pipermail/tutor/2007-May/054340.html but in a python shell doing
form.fields['id_auteur'] u'Aboud \xc9lie' form.fields['id_auteur'] = form.fields['id_auteur'].encode('utf-8') [...] ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
Would welcome advice or guidance ... if I want to make urlopen "happy" I am "displeasing" ElementTree :( Thanks for your help, -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you?
Hi, for UnicodeEncodeError should be written tutorial on codespeak.net/lxml - because it's most problematic and confusing (even with NS) problem in lxml... And it have not need to be encoded from utf-8 as in example below, but ISO-8859-1 (<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> at assemple-nationale.fr/...) Vojta Dne 15.9.2010 8:08, Eugene Van den Bulke napsal(a):
Hi,
Slowly but surely learning more about lxml.html by using it to do some scrapping.
I encountered a unicode problem trying to submit the following form.
<form name="Lien1" method="POST" action="http://recherche2.assemblee-nationale.fr/resultats_tribun.jsp" id="Lien1"> <input type="hidden" name="id_auteur" value="Aboud Élie"> <input type="hidden" name="nom_auteur" value="Élie Aboud"> <input type="hidden" name="legislature" value="13"> <input type="hidden" name="typedoc" value="Questions"> </form>
Which can be found under the Questions link of http://www.assemblee-nationale.fr/13/tribun/fiches_id/267457.asp#P3
===== UnicodeEncodeError Traceback (most recent call last)
/Users/eugene/Documents/Dev/parlorama/code/<ipython console> in<module>()
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in submit_form(form, extra_values, open_http) 819 if open_http is None: 820 open_http = open_http_urllib --> 821 return open_http(form.method, form.action, values) 822 823 def open_http_urllib(method, url, values):
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in open_http_urllib(method, url, values) 836 data = None 837 else: --> 838 data = urlencode(values) 839 return urlopen(url, data) 840
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.pyc in urlencode(query, doseq) 1267 for k, v in query: 1268 k = quote_plus(str(k)) -> 1269 v = quote_plus(str(v)) 1270 l.append(k + '=' + v) 1271 else:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 6: ordinal not in range(128) =====
I tried to address the problem by encoding the values in the form fields as suggested here : http://mail.python.org/pipermail/tutor/2007-May/054340.html
but in a python shell doing
form.fields['id_auteur'] u'Aboud \xc9lie' form.fields['id_auteur'] = form.fields['id_auteur'].encode('utf-8') [...] ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
Would welcome advice or guidance ... if I want to make urlopen "happy" I am "displeasing" ElementTree :(
Thanks for your help,
Thanks for pointing out my charset mistake (not at the heart of my problem though). If you have any expertise with lxml.html + form + unicode, do you think an writing an alternate opener and passing it to submit_form using the open_http keyword be the best way to go about it? -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you?
If you have any expertise with lxml.html + form + unicode, do you think an writing an alternate opener and passing it to submit_form using the open_http keyword be the best way to go about it?
It seems to do the job: def open_http(method, url, values): from lxml.html import open_http_urllib values = [(k, v.encode('ISO-8859-1')) for k,v in values] return open_http_urllib(method, url, values) May not be the most elegant solution but could be useful to someone else. -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you?
Eugene Van den Bulke, 15.09.2010 08:08:
I encountered a unicode problem trying to submit the following form.
<form name="Lien1" method="POST" action="http://recherche2.assemblee-nationale.fr/resultats_tribun.jsp" id="Lien1"> <input type="hidden" name="id_auteur" value="Aboud Élie"> <input type="hidden" name="nom_auteur" value="Élie Aboud"> <input type="hidden" name="legislature" value="13"> <input type="hidden" name="typedoc" value="Questions"> </form>
Hmm, yes, looks like the form handling code doesn't properly encode the values. That's a bug. Does anyone know what the correct encoding is for submitting the form? Is it the original encoding of the page? And: what should happen if the values cannot be encoded? Maybe an explicit encoding option would take care of this case. Stefan
participants (3)
-
Eugene Van den Bulke
-
Stefan Behnel
-
Vojtěch Rylko