[Web-SIG] parsing of urlencoded data and Unicode

Deron Meranda deron.meranda at gmail.com
Tue Jul 29 22:12:10 CEST 2008


On Tue, Jul 29, 2008 at 3:50 PM, Manlio Perillo
<manlio_perillo at libero.it> wrote:
> Deron Meranda ha scritto:
>>
>> [...]
>>>
>>> But, at this point, can one consider the content of form post to be
>>> encoded
>>> "text" string?
>>>
>>> Or it should be considered encoded "byte" string?
>>
>> Both/either.
>>
>> I'd say follow the RFC, but perhaps allow a caller to provide
>> an override default.  So yes, you should assume an encoded
>> string if the subpart has a text/* Content-Type, or if it has no
>> content type at all (which must then be assumed to be text/plain
>> US-ASCII).  That is the intent of the MIME text/* media type
>> after all; that it should be interpreted as a character string
>> and not a byte string.
>>
>> In other cases, I would say returning a byte string is the
>> correct thing to do.
>>
>
> I'm not sure to understand.
> If you want non text data in the POST request body, you can use the file
> control.

I don't think we're disagreeing.

In HTML, an input element with type=file will result in non-text; e.g.,
should result in a byte stream (ignoring the possibility of uploading
text files, which are permitted but not required to have a text/*
content type).  But on the other hand an input with type=text or
type=password should definitely result in a character string,
not a byte string.  Same with a textarea element.

It's less clear what input type=checkbox or type=radio should give,
but I think it's safe to assume a character string.

Either way, the parser of the multipart/form-data has no idea
what the original HTML looked like; it only has the posted MIME
structure and headers to go by.

In my suggestion, only if there is a Content-Type header on the
subpart, and only then if it is not of text/*, then you would return
a byte string.  Everything else should result in a character string.

But you just can't only pick one return type; sometimes you have
bytes and other times you have characters.


> I can't really see use cases of normal input fields having byte strings.

In HTML, no.  Only input with type=file should ever result in
a content type other than text.

However don't forget that not all POSTs with multipart/form-data
have to be the result of an HTML page.  So a generic consumer
of multipart/form-data can't make such assumptions; hence why it
should just follow the RFC; with possible caller-specified overrides
to compensate for the real-world not matching the RFC spec.
-- 
Deron Meranda


More information about the Web-SIG mailing list