[Python-3000] encoding='guess' ?

Nick Coghlan ncoghlan at gmail.com
Sun Sep 10 15:44:16 CEST 2006


Antoine Pitrou wrote:
> Hi,
> 
> Let me add that 'guess' should probably be forbidden as an encoding
> parameter (instead, a separate function argument should be used as in my
> proposal).
> 
> Here is a schematic example to show why :
> 
> def append_text(filename, encoding):
>     src = textfile(filename, "r", encoding)
>     my_text = src.read()
>     src.close()
>     dst = textfile("textlist.txt", "r+", encoding)
>     dst.seek_end(0)
>     dst.write(my_text + "\n")
>     dst.close()
> 
> With Paul's current proposal three cases can arise :
>  - "encoding" is a real encoding name like iso-8859-1 or utf-8. There
> should be no problems, since we assume this encoding has been configured
> once and for all in the application.
>  - "encoding" is either "site" or "locale". This should result in the
> same value run after run, since we assume the site or locale encoding
> value has been configured once and for all.
>  - "encoding" is "guess". In this case anything can happen. A possible
> occurence is that for the first file, it will result in utf-8 being
> detected (or Shift-JIS, or whatever), and for the second file it will be
> iso-8859-1. This will lead to a crash in the likely case that some
> characters in the source file can't be represented using the character
> encoding auto-detected for the destination file.
> 
> Yet the append_text() function does look correct, doesn't it?
> 
> We shouldn't hide a contextual encoding-detection algorithm under an
> encoding name. It leads to semantic uncertainty.

Interesting. This goes back more towards the model of "no default encoding, 
but provide the right tools to make it easy for a program to choose one in the 
absence of any metadata".

So perhaps there should just be an explicit function "guessencoding()" that 
accepts a filename and returns a codec name. So if you want to guess, you 
would do something like:

f = open(fname, 'r', string.guessencoding(fname))

The PEP's other suggestions would then be spelled something like:

f = open(fname, 'r', string.getlocaleencoding())
f = open(fname, 'r', string.getsiteencoding())

Cheers,
Nick.



-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org


More information about the Python-3000 mailing list