[Web-SIG] parsing of urlencoded data and Unicode
Deron Meranda
deron.meranda at gmail.com
Tue Jul 29 21:18:43 CEST 2008
On Tue, Jul 29, 2008 at 2:41 PM, Manlio Perillo
<manlio_perillo at libero.it> wrote:
> James Y Knight ha scritto:
>> You seem to be under the mistaken impression that form post content is
>> MIME. It is not. It looks kinda like it should be, and maybe it's even
>> specified to be [rfc2388], but actually treating it as MIME is a rather
>> critical error. RFC2388 is just wrong, don't believe a thing it says.
In what way is RFC 2388 wrong or not MIME?
Per RFC 2388 sect. 3:
"The media-type multipart/form-data follows the rules of all multipart
MIME data streams as outlined in [RFC 2046]."
So it is MIME, right?
You may be referring to the much older "experimental" RFC 1867,
upon which 2388 is based. It merely said it was a "MIME compatible
representation". But even then the intent was clearly to be MIME.
Now you can successfully argue that many user agents do not
follow the RFC carefully enough. But that's not a problem with
the RFC itself.
> But, at this point, can one consider the content of form post to be encoded
> "text" string?
>
> Or it should be considered encoded "byte" string?
Both/either.
I'd say follow the RFC, but perhaps allow a caller to provide
an override default. So yes, you should assume an encoded
string if the subpart has a text/* Content-Type, or if it has no
content type at all (which must then be assumed to be text/plain
US-ASCII). That is the intent of the MIME text/* media type
after all; that it should be interpreted as a character string
and not a byte string.
In other cases, I would say returning a byte string is the
correct thing to do.
Also I'd say that if you're dealing with text (text/*) and no
charset is provided (or the caller hasn't given an override
default charset); then you must assume US-ASCII. And
you should allow any UnicodeDecodeErrors to bubble
up to the caller. In other words if a user agent sent text
in ISO-8859-x and didn't say it was doing so, then an
error should be raised when non-ASCII data is seen.
--
Deron Meranda
More information about the Web-SIG
mailing list