[Python-Dev] bytes / unicode

Wed Jun 23 18:11:05 CEST 2010

Tres Seaver <tseaver at palladion.com> wrote:

> Stephen J. Turnbull wrote:
> 
> > We do need str-based implementations of modules like urllib.
> 
> Why would that be?  URLs aren't text, and never will be.  The fact that
> to the eye they may seem to be text-ish doesn't make them text.  This

URLs are exactly text (strings, representable as Unicode strings in
Py3K), and were designed as such from the start.  The fact that some of
the things tunneled or carried in URLs are string representations of
non-string data shouldn't obscure that point.  They're not "text-ish",
they're text.  They're not opaque, either; they break down in
well-specified ways, mainly into strings.

The trouble comes in when we try to go beyond the spec, or handle things
that don't conform to the spec.  Sure, a path component of a URI might
actually be a %-escaped sequence of arbitrary bytes, even bytes that
don't represent a string in any known encoding, but that's only *after*
reversing the %-escapes, which should happen in a scheme-specific piece
of code, not in generic URL parsing or manipulation.

Bill