On Wed, Feb 15, 2012 at 2:12 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
> using "ascii+surrogateescape" for your own I/O and setting > "backslashreplace" on sys.stdout should cover you (and any > exceptions you get will be warning you about cases where your > original assumptions about not caring about Unicode validity have > been proven wrong).
Are you saying you know more than the user about her application?
No, I'm merely saying that at least 3 options (latin-1, ascii+surrogateescape, chardet2) should be presented clearly to beginners and the trade-offs explained. For example: Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: None Approach: Specify encoding="latin-1" Bytes/bytearray: data.decode("latin-1") Text files: open(fname, encoding="latin-1") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "latin-1") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "latin-1", line_buffered=True) Stdout replacement (terminal): Leave it alone By decoding with latin-1, an application won't get *any* Unicode decoding errors, as that encoding maps byte values directly to the first 256 Unicode code points. However, any output data generated by that application *will* be corrupted if the assumption of ASCII compatibility are violated, or if implicit transcoding to any encoding other than "latin-1" occurs (e.g. when writing to sys.stdout or a log file, communicating over a network socket or serialising the string the json module). This is the closest Python 3 comes to emulating the permissive behaviour of Python 2's 8-bit strings (implicit interoperation with byte sequences is still disallowed). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: Minimal Approach: Use encoding="ascii" and errors="surrogateescape" (or, alternatively, errors="backslashreplace" for sys.stdout) Bytes/bytearray: data.decode("ascii", errors="surrogateescape") Text files: open(fname, encoding="ascii", "surrogateescape") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "ascii", "surrogateescape") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "ascii", "surrogateescape", line_buffered=True) Stdout replacement (terminal): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, sys.stdout.encoding, "backslashreplace", line_buffered=True) Using "ascii+surrogateescape" instead of "latin-1" is a small initial step into the Unicode-aware world. It still lets an application process any ASCII-compatible encoding *without* having to know the exact encoding of the source data, but will complain if there is an implicit attempt to transcode the data to another encoding, or if the application inserts non-ASCII data into the strings before writing them out. Whether non-ASCII compatible encodings trigger errors or get corrupted will depend on the specifics of the encoding and how the program manipulates the data. The "backslashreplace" error handler (enabled by default for sys.stderr, optionally enabled as shown above for sys.stdout) can be useful to help ensure that printing out strings will not trigger UnicodeEncodeErrors (note: the *repr* of strings already escapes non-ASCII characters internally, such that repr(x) == ascii(x). Thus, UnicodeEncodeErrors will occur only when encoding the string itself using the "strict" error handler, or when another library performs equivalent validation on the string). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: High Approach: Use binary APIs and the "chardet2" module from PyPI to detect the character encoding Bytes/bytearray: data.decode(detected_encoding) Text files: open(fname, encoding=detected_encoding) The *right* way to process text in an unknown encoding is to do your best to derive the encoding from the data stream. The "chardet2" module on PyPI allows this. Refer to that module's documentation (WHERE?) for details. With this approach, transcoding to the default sys.stdin and sys.stdout encodings should generally work (although the default restrictive character set on Windows and in some locales may cause problems). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia