I really like a task-oriented approach like this. +1000 for this sort of thing in the docs. On 15 February 2012 08:03, Nick Coghlan <ncoghlan@gmail.com> wrote:
Task: Process data in any ASCII compatible encoding
This is actually closest to how I think about what I'm doing, so thanks for spelling it out.
Unicode Awareness Care Factor: High
I'm not entirely sure how to interpret this - "High level of interest in getting it right" or "High amount of investment in understanding Unicode needed"? Or something else?
Approach: Use binary APIs and the "chardet2" module from PyPI to detect the character encoding Bytes/bytearray: data.decode(detected_encoding) Text files: open(fname, encoding=detected_encoding)
If this is going into the Unicode FAQ or somewhere similar, it probably needs a more complete snippet of sample code. Without having looked for and read the chardet2 documentation, do I need to read the file once in binary mode (possibly only partially) to scan it for an encoding, and then start again "for real". That's arguably a downside to this approach.
The *right* way to process text in an unknown encoding is to do your best to derive the encoding from the data stream. The "chardet2" module on PyPI allows this. Refer to that module's documentation (WHERE?) for details.
There is arguably another, simpler approach, which is to pick a default encoding (probably what Python gives you by default) and add a command line argument to your program (or equivalent if your program isn't a command line app) to manually specify an alternative. That's probably more complicated than the naive user wanted to deal with when they started reading this summary, but may well not sound so bad by the time they get to this point :-)
With this approach, transcoding to the default sys.stdin and sys.stdout encodings should generally work (although the default restrictive character set on Windows and in some locales may cause problems).
A couple of other tasks spring to mind: Task: Process data in a file whose encoding I don't know Unicode Understanding Needed: Medium-Low Unicode Correctness: High Approach: Use external tools to identify the encoding, then simply specify it when opening the file. On Unix, "file -i FILENAME" will attempt to detect the encoding, on Windows, XXX. If, and only if, this approach doesn't identify the encoding clearly, then the other options allow you to do the best you can. (Needs a better description of what tools to use, and maybe a sample Python script using chardet2 as a fallback). This is actually the "right way", and should be highlighted as such. By describing it this way, it's also rather clear that it's *not hard*, once you get over the idea that you don't know how to get the encoding, because it's not specified in the file. Having read through and extended Nick's analysis to this point, I'm thinking that it actually fits my use cases fine (and correct Unicode handling no longer feels like such a hard problem to me :-)) Task: Process data in a file believed to have inconsistent encodings Unicode Understanding Needed: High Unicode Correctness: Low Approach: ??? Panic :-) This is the killer, but should be extremely rare. We don't need to explain what to do here, but maybe offer a simple strategy (1. Are you sure the file has mixed encodings? Have you checked twice? 2. If it's ASCII-compatible, can you work on a basis that you just pass the mixed-encoding bytes through unchanged? If so use one of the other recipes Nick explained. 3. Do you care about mojibake or corruption? Can you afford not to? 4. Are you a Unicode expert, or do you know one? :-)) I think something like this would be a huge benefit for the Unicode FAQ. I haven't got the time or expertise to write it, but I wish I did. If I get some spare time, I might well have a go anyway, but I can't promise. Paul