urllib interpretation of URL with ".."
John Nagle
nagle at animats.com
Mon Jun 25 12:42:31 EDT 2007
Duncan Booth wrote:
> "Martin v. Löwis" <martin at v.loewis.de> wrote:
>
>
>>>Is "urllib" wrong?
> Section 5.2 is also relevant here. In particular:
>
>
>> g) If the resulting buffer string still begins with one or more
>> complete path segments of "..", then the reference is
>> considered to be in error. Implementations may handle this
>> error by retaining these components in the resolved path (i.e.,
>> treating them as part of the final URI), by removing them from
>> the resolved path (i.e., discarding relative levels above the
>> root), or by avoiding traversal of the reference.
>
>
> The common practice seems to be for client-side implementations to handle
> this using option 2 (removing them) and servers to use option 3 (avoiding
> traversal of the reference). urllib uses option 1 which is also correct but
> not as useful as it might be.
That's helpful. Thanks.
In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.
(Yet another hassle in processing real-world HTML.)
John Nagle
More information about the Python-list
mailing list