[Web-SIG] parsing of urlencoded data and Unicode

Tue Jul 29 23:02:10 CEST 2008

On Tue, Jul 29, 2008 at 4:04 PM, James Y Knight <foom at fuhm.net> wrote:
>> So it is MIME, right?
>
> No: RFC 2388 says it is MIME, but in real life it is not. RFC 2388 is wrong.

I think this is a problem of semantics; what you mean by "wrong".

The RFC is not wrong, in terms of it having a technical inaccuracy
or needing a errata.  Which by the way none have been issued so far,
http://www.rfc-editor.org/errata_search.php?rfc=2388

It may be "wrong" only in terms of it being ignored by the authors of
software.  I'd tend to use a different less-misleading term though.

I think it more appropriate to call the software which purports to adhere
to HTTP 1.1 (and hence it's dependent specs like RFC 2388) to be "wrong".

>> Now you can successfully argue that many user agents do not
>> follow the RFC carefully enough.  But that's not a problem with
>> the RFC itself.
>
> Common practice is by now long established, and cannot simply be changed 10
> years after the fact to conform to what the standard says it should've been.

I'm not so sure.  Granted this is a problem for the browser guys and not
us Python people.

Ragarding timelines; the multipart/form-data RFC 2388 was written in 1988.
The HTTP 1.1 came after that.  And both of these specs are around 10
years old, while most browsers today are in fact the newcomers; not the
other way around.  The RFC isn't trying to rewrite facts; it came first.

I'm sure there's lots of other places where browsers today do not adhere
to the RFC specs; so do we say the specs are wrong or that the
browsers have bugs?  (I'm not talking about W3C stuff; that's clearly
not as straight forward as RFCs)

> Therefore, it *is* now a problem with the standard: the standard is wrong.
> If you follow it, you're going to create totally broken software.

I don't think we're there.  Although many real world browsers may
not conform strictly to the RFC; I fail to see why that means that
the server can't be in this case.

I just don't see "totally broken" as an inevitable outcome.

> For instance, treating form posts as being 7bit unless they have a
> Content-Transfer-Encoding. The RFC says you should do that.

Um, no.  HTTP 1.1 specifically grants an exemption to that 7-bit
restriction in MIME.

The tricky part is that with web software you're dealing with a whole
bunch of standards, and even ignoring W3C stuff, there's even
a whole bunch of RFCs.  Sometimes one RFC will override part
of another; and that's what the HTTP RFC does to the MIME RFCs.

Yes, its confusing and prone to interpretation errors.

> But it's an
> absolutely nonsensical thing to do. Your code would not work with any
> existing web browser if you did. Or, if you're writing a web browser: don't
> even think of using Content-Transfer-Encoding to encode your response.

Again, the RFCs already account for that.  In web software, the
primary RFC is the HTTP 1.1 spec; not the MIME spec.  This can
be confusing because HTTP borrows say 90% of MIME, but
overrides other parts of it.

So I guess in a pedantic way, yes, this is not strictly "MIME".  If it
were you'd be dealing with email, not web.  But in as much as its
the parts of MIME that the HTTP spec says to use, it is still MIME.

And the parts we're dealing with; the multipart/form-data type
and what to do with the presence or absence of content-type
headers on the subparts; well, that is pretty explicitly stated.

>> Or it should be considered encoded "byte" string?
>
> I'd recommend that it should be, certainly at the lower levels. A higher
> level API can look at the hints available to figure out how to decode the
> non-file fields: e.g.: if the magic _charset_ parameter is present, use
> that, otherwise use what the developer tells you they put in accept-charset
> / what encoding they sent the page in.

I don't think any library should be applying those heuristics.  Hasn't
everybody been annoyed by IE's content type sniffing heuristics;
this would be the same idea but on the server side.

Heuristics though may be a perfectly suitable thing for some applications
to do.  But you also have to remember that not all HTTP transactions
involve browsers, or even HTML, and that deviations from the RFC
should have explicit consequences in those cases in terms of a
standard library.

I think that perhaps allowing the application to provide an override
(default content type) as input might be enough in this case; although
even that could be argued.  It might be sufficient that the library follow
the RFC strictly; and well, if the posted data doesn't follow the spec we
raise an error along with the original byte string and let the application
deal with it.

An override is I think a reasonable compromise to allow one to
deal with real-world non-conforming browsers; while not throwing
out the RFC or adding complex fragile heuristics into the library.
You certainly don't want to break when/if you get a user agent
that DOES follow the RFC.
-- 
Deron Meranda