[Python-Dev] Fwd: RFC - GoogleSOC proposal -cleanupurllib

Mike Brown mike at skew.org
Sat Mar 24 22:48:04 CET 2007

Senthil Kumaran wrote:
> I have written a proposal to cleanup urllib as part of Google SoC. I am
> attaching the file 'soc1' with this email. Requesting you to go through the
> proposal and provide any feedback which I can incorporate in my submission.

>From your proposal:

> 2) In all modules, Follow the new RFC 2396 in favour of RFC 1738 and RFC 1808.
> [...]
> In all modules, follow the new RFC 2396 in favor of RFC 1738, RFC 1808. The
> standards for URI described in RFC 2396 is different from older RFCs and
> urllib, urllib2 modules implement the URL specifications based on the older
> URL specification. This will need changes in urlparse and other parse 
> modules to handle URLS as specified in the RFC2396.

The "new" RFC 2396 was superseded by STD 66 (RFC 3986) two years ago. Your
failure to notice this development doesn't bode well :) j/k, although it does
undermine confidence somewhat.

I think the bugfixes sound great, but major enhancements and API refactorings
need to be undertaken more cautiously.

In any case, I have a few suggestions:

- Read http://en.wikipedia.org/wiki/Uniform_Resource_Identifier.
  (I wrote the majority of it, and got peer review from the URI WG a while

- Read http://en.wikipedia.org/wiki/Percent_encoding.
  (I wrote most of this too).

- Familiarize yourself with STD 66. (i.e., don't trust anything I wrote ;))
  Especially note its differences from RFC 2396 (summarized in an appendix).

- Seek peer review for any changes that you attribute to changing standards.

In my experience implementing a general-purpose URI processing library
(http://cvs.4suite.org/viewcvs/4Suite/Ft/Lib/Uri.py?view=markup ),
there were times when I thought the standard was saying a bit more than it
really was, especially when it came to percent-encoding, which has several
somewhat-conflicting conventions and standards governing it. I tried to
cover these in the Wikipedia article.

- Anticipate real-world use cases. If you go down the road of doing what
the standards recommend (be aware of "should" vs "must" and whether
it's directed at URI producers or consumers), you might lose sight of the
fact that there's a reason, for example, people use encodings other than
the recommended UTF-8 as the basis for percent-encoding. Similarly,
expectations surrounding the behavior of 'file' URIs and path-portions
thereof are sometimes less than optimal in the real world. If you're
designing an API, be flexible, and seek review for any compatibilities
you intend to introduce.

- Be aware of the fact that people might have different expectations when
they use different string types (unicode, str) in URI processing, and
different levels of awareness of the levels of abstraction at which URI
processing operates. It can be difficult to uniformly handle unicode and str.
And then there's IRIs (RFC 3987)...

For additional background, you might also check the python-dev discussion
of urllib in Sep 2004, urlparse in Nov 2005, and the competing uriparse.py
proposals (Apr, Jun 2006).


More information about the Python-Dev mailing list