[Python-3000] content-based detection

Sun Sep 10 21:57:56 CEST 2006

Le dimanche 10 septembre 2006 à 11:30 -0700, Paul Prescod a écrit :
> I don't mind your name of autotextfile but I think that your
> by_content argument defeats the goal of having a very simple API for
> quick and dirty stuff. If content detection is a good idea (usually
> right) then we should do it.

Using system or locale default is trustable and reproduceable.
Content-based detection is wilder, especially if the algorithm isn't
fully refined in the first Py3k releases.

> I can't see an argument for ever turning off the BOM detection. 

Perhaps, but having a subset of it still running behind your back while
you disabled it is misleading.

Also, I think having BOM detection as the only test in content-based
detection would be uninteresting. The common use case for encoding
detection is to guess between one of Unicode variants (mostly UTF-8
*with or without BOM*) and the non-Unicode encoding which is popular for
a given language (e.g. ISO-8859-15).

I doubt many people have to discriminate between UTF-16LE, UCS-4 and
UTF-8. Are there real cases like that for text files?

Regards

Antoine.