[Web-SIG] parsing of urlencoded data and Unicode
James Y Knight
foom at fuhm.net
Tue Jul 29 22:04:04 CEST 2008
On Jul 29, 2008, at 3:18 PM, Deron Meranda wrote:
> In what way is RFC 2388 wrong or not MIME?
> Per RFC 2388 sect. 3:
> "The media-type multipart/form-data follows the rules of all
> MIME data streams as outlined in [RFC 2046]."
> So it is MIME, right?
No: RFC 2388 says it is MIME, but in real life it is not. RFC 2388 is
> Now you can successfully argue that many user agents do not
> follow the RFC carefully enough. But that's not a problem with
> the RFC itself.
Common practice is by now long established, and cannot simply be
changed 10 years after the fact to conform to what the standard says
it should've been. Therefore, it *is* now a problem with the standard:
the standard is wrong. If you follow it, you're going to create
totally broken software.
For instance, treating form posts as being 7bit unless they have a
Content-Transfer-Encoding. The RFC says you should do that. But it's
an absolutely nonsensical thing to do. Your code would not work with
any existing web browser if you did. Or, if you're writing a web
browser: don't even think of using Content-Transfer-Encoding to encode
your response. Few servers/frameworks would understand your submission
if you tried.
> But, at this point, can one consider the content of form post to be
> encoded "text" string?
> Or it should be considered encoded "byte" string?
I'd recommend that it should be, certainly at the lower levels. A
higher level API can look at the hints available to figure out how to
decode the non-file fields: e.g.: if the magic _charset_ parameter is
present, use that, otherwise use what the developer tells you they put
in accept-charset / what encoding they sent the page in.
More information about the Web-SIG