[I18n-sig] Encoding auto-detection
Paul Prescod
paulp@ActiveState.com
Fri, 01 Jun 2001 16:07:59 -0700
"Martin v. Loewis" wrote:
>
>...
>
> I see. For a general purpose encoding guesser to be useful, it would
> work totally different from the XML autodetection.
Agreed. They should be treated as two different problems.
>...
> In general, I think encoding auto-detection is a stupid idea, you
> really have to have a higher-level protocol that tells you what the
> encoding is.
These protocols are very unreliable. I often see data served from a
website as application/octet-stream no matter what its real data type
is.
> ... Trying Unicode-encodings-autodetection might be more
> successful, but I still think it is quite pointless: I predict that
> UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
> be exchanged as UTF-8.
On Windows, if you save a file as "Unicode", it means UTF-16. I think
that UTF-16 is Microsoft's "standard" Unicode encoding. UTF-8 could be
considered Unix's "standard" encoding.
I don't think you should treat it as either-or. Autodetection is not as
good as really knowing for sure, of course. That doesn't mean that it is
*stupid*. It means it is the best fallback available when dealing with
stupid systems like the Unix file system or misconfigured web servers.
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook