urlparse.urlparse bug - misparses long URL

John Nagle nagle at animats.com
Fri Dec 14 02:26:43 EST 2007


Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
====
http://www.midamericabank61.com.mx?email_from=gpatti@Tezzaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBBQBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZXgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L%fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99%ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d%e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https://www.midamericabank.com/my_acccounts/default.aspxL0PWSjXev6xlkMTqVKFbLUgrh8CBquCchip4PuQDWYLYpzDGOFkLZyY
====
What we get back in the "accesshost" field (i.e. the domain name) is

====
'www.midamericabank61.com.mx?email_from=gpatti at Tezzaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBBQBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZXgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L%fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99%ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d%e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https:'
====

which is wrong.  Something far out in that URL is breaking urlparse, and it's 
not able to extract the domain name properly.

It's not a UNICODE issue; forced the data to "str" and it still mis-parses.

I'm trying to construct s shorter string that fails.  More to follow.

(Yes, another error associated with the wonderful world of parsing hostile sites 
in Python.  This is from a phishing attack, and that URL is in PhishTank.)

					John Nagle
					SiteTruth



More information about the Python-list mailing list