[Baypiggies] urllib.urlencode and encoding
Shannon -jj Behrens
jjinux at gmail.com
Thu Apr 19 20:22:57 CEST 2007
On 4/19/07, Tung Wai Yip <tungwaiyip at yahoo.com> wrote:
> > On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
> >> > When a new URI scheme defines a component that represents textual
> >> > data consisting of characters from the Universal Character Set
> >> > [UCS],
> >> > the data should first be encoded as octets according to the UTF-8
> >> > character encoding [STD63]; then only those octets that do not
> >> > correspond to characters in the unreserved set should be percent-
> >> > encoded. For example, the character A would be represented as "A",
> >> > the character LATIN CAPITAL LETTER A WITH GRAVE would be
> >> > represented
> >> > as "%C3%80", and the character KATAKANA LETTER A would be
> >> > represented
> >> > as "%E3%82%A2".
> Thanks Keith for the heads up. One issue I regularly have is to track down
> the lineage of RFCs. When I found RFC X, I am often not aware of a RFC Y
> that supersede it. It doesn't help that historically there are many
> documents pointing to RFC X. But from RFC X itself it has no link to RFC
> Y. Try to follow the link from the bottom of the urlparse module
> documentation. It does not lead to RFC 3986.
> On Wed, 18 Apr 2007 21:15:34 -0700, David Reid <dreid at dreid.org> wrote:
> > The key piece of information here is "When a new URI scheme" the RFC
> > (AFAICT) makes no mention of what to do about old schemes, such as
> > HTTP. In fact the HTML4 spec makes it's own claims as to %-encoded
> > as a result of form submission:
> > http://www.w3.org/TR/html4/interact/forms.html
> > accept-charset = charset list [CI]
> > This attribute specifies the list of character encodings for
> > input data that is accepted by the server processing this form. The
> > value is a space- and/or comma-delimited list of charset values. The
> > client must interpret this list as an exclusive-or list, i.e., the
> > server is able to accept any single character encoding per entity
> > received.
> > The default value for this attribute is the reserved string
> > "UNKNOWN". User agents may interpret this value as the character
> > encoding that was used to transmit the document containing this FORM
> > element.
> > So I think it's still incorrect for urllib to make any such
> > assumptions as to the data being UTF-8. (Though I hope it won't be in
> > the future.)
> > - -David
> > http://dreid.org
> I think RFC 3986 says a character should be encoded in UTF-8 only if it is
> from the UCS. But it is also legitimate to use other character set, for
> example as in the HTML4 spec David has pointed out. Say you are writing a
> screen scrapper for a Japanese website you should use the character
> encoding the website expects, which is not necessary UTF-8.
Ok, thanks for all your comments guys. David, thanks for the RFC
quotes. If I am to understand things correctly, because the rest of
my page is all working correctly using UTF-8, I can .encode('UTF-8')
parameters before passing them to urlencode. However, it doesn't make
sense to put that .encode inside urlencode.
>Welcome to the tower of babel!
I was reading <http://www.mozilla.org/docs/web-developer/faq.html#accept>
the other day, and I was pondering the fact that we can't even agree
on versions of HTML. Mozilla *still* recommends HTML 4.01 over XHTML.
Since HTML is a language used to transport content, I recognized that
this too was a case of the Tower of Babel. Upon realizing this, in my
head, I heard a little voice say, "Gotcha!"
"'Software Engineering' is something of an oxymoron. It's very
difficult to have real engineering before you have physics, and there
isn't anything even close to a physics for software." -- L. Peter
More information about the Baypiggies