[Web-SIG] parsing of urlencoded data and Unicode

Deron Meranda deron.meranda at gmail.com
Tue Jul 29 20:58:26 CEST 2008


On Tue, Jul 29, 2008 at 12:39 PM, Manlio Perillo
<manlio_perillo at libero.it> wrote:
> Bill Janssen ha scritto:
>> Actually, it's defined for all fields, isn't it?  From RFC 2388:
>>
>> ``As with all multipart MIME types, each part has an optional
>> "Content-Type", which defaults to text/plain.''
>>
>> So the type is "text/plain" unless it says something else.  And,
>> according to RFC 2046, the default charset for "text/plain" is
>> "US-ASCII".
>
> Ok with theory.
> But in practice:
>
> <form action="" method="post" accept-charset="utf-8"
>      enctype="multipart/form-data">
> [...]
>
> In theory I should assume ascii encoded data for the body field; and since
> this data can not be decoded, I should assume it as byte string.
>
> However the body field is encoded in utf-8, and if I add an hidden _charset_
> field, FF and IE add this field in the response, with the charset used in
> the encoding.

>From what I've seen, most user agents fail to send a Content-Type,
much less a charset parameter.  Many will also ignore the accept-charset
<form> attribute.

However most browsers will respectfully send the text fields in a POST
response in the same character set that the page which contained the
<form> element was sent to the browser to begin with.  So if you
output HTML pages in UTF-8, the text portions of post messages will
be returned in UTF-8.

It's not following any standard, but its the way things seem to work.
I would think it most useful if the decoding framework would strictly
follow the RFC and assume "text/plain; charset=US-ASCII"; but
also allow the caller some means of indicating a different default.
Obviously, if a user agent does provide a complete Content-Type,
it should be used.
-- 
Deron Meranda


More information about the Web-SIG mailing list