[I18n-sig] Encoding auto-detection

M.-A. Lemburg mal@lemburg.com
Sat, 02 Jun 2001 13:24:05 +0200


Tom Emerson wrote:
> 
> Martin v. Loewis writes:
> > In general, I think encoding auto-detection is a stupid idea, you
> > really have to have a higher-level protocol that tells you what the
> > encoding is.
> 
> This is a utopian idea that completely falls apart in the real world.

That's why I need such a function... first for XML and then for
other files having some standard magic prepended to them.

The reason for this is simple: even if a protocol defines which
encoding to use, this is not necessarily respected in input data.
The usual thing to do is first to try to decode the data into Unicode
using the given encoding, then to analyse the data and try the
guessed encoding and only then to reject the data as false input.

Without the second step there would be far to many instances of
data being rejected due to wrong encoding information, e.g. a
common situation for XML is that XML files use Latin-1 in the body and
forget to define the XML header. The parser will then default to
UTF-8 and fail to read the data.

You have a similar situation for data which originated in parts
of the world where more than one encoding is in common
use e.g. Russia or Asia. Input data generated by humans can
should always be treated with care ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/