[Python-ideas] Py3 unicode impositions

Steven D'Aprano steve at pearwood.info
Sun Feb 12 06:26:24 CET 2012


Nick Coghlan wrote:
> On Sun, Feb 12, 2012 at 1:19 PM, Carl M. Johnson
> <cmjohnson.mailinglist at gmail.com> wrote:
>> On Feb 11, 2012, at 5:10 PM, Eric Snow wrote:
>>
>>> So something like this:
>>>
>>>    import functools, builtins
>>>    open = builtins.open = functools.partial(open, encoding="ascii",
>>> errors="surrogateescape")
>>
>> We could pack it in and call it something like "python2open". :-)
> 
> An open_ascii() builtin isn't as crazy as it may initially sound -
> it's not at all uncommon to have a file that's almost certainly in
> some ASCII compatible encoding like utf-8, latin-1 or one of the other
> extended ASCII encodings, but you don't know which one specifically.

To me, "open_ascii" suggests either:

- it opens ASCII files, and raises an error if they are not ASCII; or

- it opens non-ASCII files, and magically translates their content to ASCII 
using some variant of "The Unicode Hammer" recipe:

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/

We should not be discouraging developers from learning even the most trivial 
basics of Unicode. I'm not suggesting that we try to force people to become 
Unicode experts (they wouldn't, even if we tried) but making this a built-in 
is dumbing things down too much. I don't believe that it is an imposition for 
people to explicitly use open(filename, 'ascii', 'surrogateescape') if that's 
what they want.

If they want open_ascii, let them define this at the top of their modules:

open_ascii = (lambda name:
     open(name, encoding='ascii', errors='surrogateescape'))

A one liner, if you don't mind long lines.

I'm not entirely happy with the surrogateescape solution, but I can see it's 
possibly the least worst *simple* solution for the case where you don't know 
the source encoding. (Encoding guessing heuristics are awesome but hardly 
simple.) So put the recipe in the FAQs, in the docs, and the docstring for 
open[1], and let people copy and paste the recipe. That's a pretty gentle 
introduction to Unicode.




[1] Which is awfully big and complex in Python 3.1, but that's another story.


-- 
Steven



More information about the Python-ideas mailing list