[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 14:47:15 CEST 2006

Le dimanche 10 septembre 2006 à 21:58 +1000, Nick Coghlan a écrit :
> Antoine Pitrou wrote:
> > So, here is an alternative proposal :
> > Make it so that textfile() doesn't recognize system-wide defaults (as in
> > your proposal), but also provide autotextfile() which would recognize
> > those defaults (with a by_content=False optional argument to enable
> > content-based guessing).
> > 
> > textfile() being clearly marked for use by large well thought-out
> > applications, and autotextfile() for small scripts and the like.
> > Different names make it clear that they are for different uses, and
> > allow to spot them easily when looking at source code (either by a human
> > reader or a quality measurement tool).
> 
> How does your "autotextfile('myfile.txt')" differ from Paul's 
> "textfile('myfile.txt', encoding='guess')"?

Paul's "encoding='guess'" specifies a complicated and dangerous guessing
algorithm.

However, autotextfile('myfile.txt') would mean :
- use Paul's "site" if such a thing is defined
- otherwise, use Paul's "locale"
(no content-based guessing)

On the other hand "autotextfile('myfile.txt', by_content=True)" would
enable content-based guessing, thus be equivalent to Paul's
"encoding='guess'".

To sum up the API:
 - textfile("filename.txt", mode, encoding=None): fails without  an
explicit "encoding" argument if no "site" algorithm has been explicitly
configured.
 - autotextfile("filename.txt", mode, by_content=False): selects either
the "site"-configured encoding or the locale fallback, unless
"by_content" is True in which case it tries to detect based on actual
content.

In short, my proposal is just a naming proposal to achieve the following
goals :
- the textfile() function is "clean", and satisfies to the ideal that it
is Wrong to not specify an encoding when retrieving text from on-disk
bytes
- the autotextfile() function makes it easy to write simple scripts with
an easy to remember function with an explicit name (instead of a magic
value in an optional string argument)
- the autotextfile() function makes it easy to spot those abusive uses
of the quick-and-dirty way in apps which strive for interoperability and
portability

(in French we say "ne pas mélanger les torchons et les serviettes" :
don't mix towels and rags :-))

All this can be in a module, no need to pollute the top-level
namespace :
from text import textfile
from text import autotextfile

> The 'additional symbolic values' should be implemented as true
> encodings (i.e., it should be possible to look up 'site', 'guess' and
> 'locale' in the codecs registry, and replace them there as well).

Treating different things as "true encodings" does not help
understandability IMHO. "guess", "site" and "locale" are not encodings
in themselves, they are decision algorithms. In particular, "guess" has
to look at big chunks of existing text contents before deciding (which
may or may not have side-effects such as unexpected buffering).

Really, while "iso-8859-1" or "utf-8" is always the same encoding,
"guess" will not always result in the same encoding being used: it
depends on actual data fed to it. "guess" will not even allow the same
set of characters to be used: if "guess" results in "iso-8859-1", then I
can't use all the (Unicode) characters that I can use when "guess"
results in "utf-8".
This variability/unpredictability is a fundamental difference in
behaviour compared to a "true encoding", for which you can always be
sure what set of (textual) data can be represented.

Regards

Antoine.