[Python-Dev] urllib.quote and unquote - Unicode issues
André Malo
nd at perlig.de
Sun Jul 13 20:54:52 CEST 2008
* Matt Giuca wrote:
> > This POV is way too browser-centric...
>
> This is but one example. Note that I found web forms to be the least
> clear-cut example of choosing an encoding. Most of the time applications
> seem to be using UTF-8, and all the standards I have read are moving
> towards specifying UTF-8 (from being unspecified). I've never seen a
> standard specify or even recommend Latin-1.
Ahem. The HTTP standard does ;-)
> Where web forms are concerned, basically setting the form accept-charset
> or the page charset is the *maximum amount* of control you have over the
> encoding. As you say, it can be encoded by another page or the user can
> override their settings. Then what can you do as the server? Nothing ...
Guessing works pretty well in most of the cases.
> Exactly. This is exactly my point - Latin-1 is arbitrary from a standards
> point of view. It's just one of the many legacy encodings we'd like to
> forget. The UTFs are the only options which support all languages, and
> UTF-8 is the only ASCII-compatible (and therefore URI-compatible)
> encoding. So we should aim to support that as the default.
Latin-1 is not exactly arbitray. Besides being a charset - it maps
one-to-one to octet values, hence it's commonly used to encode octets and
is therefore a better fallback than every other encoding.
> I agree. However if there *was* a proper standard we wouldn't have to
> argue! "Most proper" and "should do" is the most confident we can be when
> dealing with this standard, as there is no correct encoding.
Well, the standard says, there are octets to be encoded. I find that proper
enough.
> Does anyone have a suggestion which will be more compatible with the rest
> of the world than allowing the user to select an encoding, and defaulting
> to "utf-8"?
Default to latin-1 for decoding and utf-8 for encoding. This might be
confusing though, so maybe you've asked the wrong question ;)
nd
--
Real programmers confuse Christmas and Halloween because
DEC 25 = OCT 31. -- Unknown
(found in ssl_engine_mutex.c)
More information about the Python-Dev
mailing list