[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode
report at bugs.python.org
Sun Feb 12 13:16:36 CET 2012
Paul Moore <p.f.moore at gmail.com> added the comment:
A better example in terms of "intended to be text" might be ChangeLog files. These are clearly text files, but of sufficiently standard format that they can be manipulated programmatically.
Consider a program to get a list of all authors who changed a particular file. Scan the file for date lines, then scan the block of text below for the filename you care about. Extract the author from the date line, put into a set, sort and print.
All of this can be done assuming the file is ASCII-compatible, but requires non-trivial text processing that would be a pain to do on bytes. But author names are quite likely to be non-ASCII, especially if it's an international project. And the changelog file is manually edited by people on different machines, so the possibility of inconsistent encodings is definitely there. (I have seen this happen - it's not theoretical!)
For my code, all I care about is that the names round-trip, so that I'm not damaging people's names any more than has already happened.
encoding="ascii",errors="surrogateescape" sounds like precisely the right answer here.
(If it's hard to find a good answer in Python 3, it's very easy to decide to use Python 2 which "just works", or even other tools like awk which also take Python 2's naive approach - and dismiss Python 3's Unicode model as "too hard").
My mental model here is text editors, which let you open any file, do their best to display as much as they can and allow you to manipulate it without damaging the bits you don't change. I don't see any reason why people shouldn't be able to write Python 3 code that way if they need to.
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list