[Python-Dev] urllib.quote and unquote - Unicode issues

André Malo nd at perlig.de
Sun Jul 13 20:54:52 CEST 2008


* Matt Giuca wrote:

> > This POV is way too browser-centric...
>
> This is but one example. Note that I found web forms to be the least
> clear-cut example of choosing an encoding. Most of the time applications
> seem to be using UTF-8, and all the standards I have read are moving
> towards specifying UTF-8 (from being unspecified). I've never seen a
> standard specify or even recommend Latin-1.

Ahem. The HTTP standard does ;-)

> Where web forms are concerned, basically setting the form accept-charset
> or the page charset is the *maximum amount* of control you have over the
> encoding. As you say, it can be encoded by another page or the user can
> override their settings. Then what can you do as the server? Nothing ...

Guessing works pretty well in most of the cases.

> Exactly. This is exactly my point - Latin-1 is arbitrary from a standards
> point of view. It's just one of the many legacy encodings we'd like to
> forget. The UTFs are the only options which support all languages, and
> UTF-8 is the only ASCII-compatible (and therefore URI-compatible)
> encoding. So we should aim to support that as the default.

Latin-1 is not exactly arbitray. Besides being a charset - it maps 
one-to-one to octet values, hence it's commonly used to encode octets and 
is therefore a better fallback than every other encoding.

> I agree. However if there *was* a proper standard we wouldn't have to
> argue! "Most proper" and "should do" is the most confident we can be when
> dealing with this standard, as there is no correct encoding.

Well, the standard says, there are octets to be encoded. I find that proper 
enough.

> Does anyone have a suggestion which will be more compatible with the rest
> of the world than allowing the user to select an encoding, and defaulting
> to "utf-8"?

Default to latin-1 for decoding and utf-8 for encoding. This might be 
confusing though, so maybe you've asked the wrong question ;)

nd
-- 
Real programmers confuse Christmas and Halloween because
DEC 25 = OCT 31.  -- Unknown

                                      (found in ssl_engine_mutex.c)


More information about the Python-Dev mailing list