On Thu, Feb 16, 2012 at 02:37:12PM +1300, Greg Ewing wrote:
On 16/02/12 02:39, Oleg Broytman wrote:
On Wed, Feb 15, 2012 at 11:15:36AM +1100, Ben Finney wrote:
If people want to remain wilfully ignorant of text encoding in the third millennium
This returns us to the very beginning of the thread. The original complain was: Python3 requires users to learn too much about unicode, more than they really need.
I don't think it's helpful to label everyone who wants to use the techniques being discussed here as lazy or ignorant. As we've seen, there are cases where you truly *can't* know the true encoding, and at the same time it *doesn't matter*, because all you want to do is treat the unknown bytes as opaque data. To tell someone in that position that they're being lazy is both wrong and insulting.
In fairness, this thread was originally started with the scenario "I'm reading files which are only mostly ASCII, but I don't want to learn about Unicode" rather than "I know about Unicode, but it doesn't help me in this situation because the encoding truly is unknown". So wilful ignorance does apply, at least in the use-case the thread started with. (If it helps, think of them as too busy to learn, not too lazy.) If you already know about Unicode, then you probably don't need to be given a simple recipe to follow, because you probably already have a solution that works for you. Which brings us back to the original use-case: "I have a file which is only mostly ASCII, and I don't care to learn about Unicode at this time to deal with it. I need a recipe I can follow that will do the right-thing so I can continue to ignore the issue for a little longer." I don't think that we should either insist that these people be forced to learn Unicode, nor expect to be able to solve every possible problem they might find. A couple of recipes in the FAQs, and discussion of why you might prefer one to the other, should be able to cover most simple cases: open(filename, encoding='ascii', errors='surrogateescape') open(filename, encoding='latin1') Both recipes hint at the wider world of encodings and error handlers, hence act as a non-threatening introduction to Unicode. -- Steven