[Python-3000] BOM handling

Antoine Pitrou solipsis at pitrou.net
Thu Sep 14 08:19:00 CEST 2006


Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit :
> In any case, I believe that the above behavior is correct for the
> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
> 'utf-16-le' decoding spellings; two of which don't strip.

Your opinion is probably valid in a theoretical point of view. You are
more knowledgeable than me.

My point was different : most programmers are not at your level (or
Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
is supposed to be an abstracted textual type to make it easy to write
unicode-friendly applications (isn't it?).
Therefore it should hide the messy issue of superfluous BOMs, unwanted
BOMs, etc. Telling the programmer to use a specific UTF-8 variant
specialized in BOM-stripping will make eyes roll... "why doesn't the
standard UTF-8 do it for me?"



More information about the Python-3000 mailing list