urllib interpretation of URL with ".."

John Nagle nagle at animats.com
Mon Jun 25 12:42:31 EDT 2007


Duncan Booth wrote:
> "Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> 
>>>Is "urllib" wrong?

> Section 5.2 is also relevant here. In particular:
> 
> 
>>      g) If the resulting buffer string still begins with one or more
>>         complete path segments of "..", then the reference is
>>         considered to be in error.  Implementations may handle this
>>         error by retaining these components in the resolved path (i.e.,
>>         treating them as part of the final URI), by removing them from
>>         the resolved path (i.e., discarding relative levels above the
>>         root), or by avoiding traversal of the reference.
> 
> 
> The common practice seems to be for client-side implementations to handle 
> this using option 2 (removing them) and servers to use option 3 (avoiding 
> traversal of the reference). urllib uses option 1 which is also correct but 
> not as useful as it might be.

    That's helpful.  Thanks.

    In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

    (Yet another hassle in processing real-world HTML.)

					John Nagle



More information about the Python-list mailing list