On 16/02/12 02:39, Oleg Broytman wrote:
On Wed, Feb 15, 2012 at 11:15:36AM +1100, Ben Finney wrote:
If people want to remain wilfully ignorant of text encoding in the third millennium
This returns us to the very beginning of the thread. The original complain was: Python3 requires users to learn too much about unicode, more than they really need.
I don't think it's helpful to label everyone who wants to use the techniques being discussed here as lazy or ignorant. As we've seen, there are cases where you truly *can't* know the true encoding, and at the same time it *doesn't matter*, because all you want to do is treat the unknown bytes as opaque data. To tell someone in that position that they're being lazy is both wrong and insulting. It seems to me that what surrogateescape is effectively doing is creating a new data type that consists of a mixture of ASCII characters and raw bytes, and enables you to tell which is which. Maybe there should be a real data type like this, or a flag on the unicode type. The data would be stored in the same way as a latin1-decoded string, but anything with the high bit set would be regarded as a byte instead of a character. This might make it easier to interoperate with external libraries that expect well-formed unicode. -- Greg