On 16 February 2012 04:08, Steven D'Aprano <steve@pearwood.info> wrote:
On 16/02/12 02:39, Oleg Broytman wrote:
I don't think it's helpful to label everyone who wants to use the techniques being discussed here as lazy or ignorant. As we've seen, there are cases where you truly *can't* know the true encoding, and at the same time it *doesn't matter*, because all you want to do is treat the unknown bytes as opaque data. To tell someone in that position that they're being lazy is both wrong and insulting.
In fairness, this thread was originally started with the scenario "I'm reading files which are only mostly ASCII, but I don't want to learn about Unicode" rather than "I know about Unicode, but it doesn't help me in this situation because the encoding truly is unknown". So wilful ignorance does apply, at least in the use-case the thread started with. (If it helps, think of them as too busy to learn, not too lazy.)
As the person who started the thread with this use case, I'd dispute that description of what I said. To restate it "I'm reading files which are mostly ASCII but not all. I know that I should identify the encoding, and what to do if I did know the encoding, but I'm not sure how to find out reliably what the encoding is. Also, the problem doesn't really warrant investing the time needed to research means of doing so - given that I don't need to process the non-ASCII, I just want to avoid decoding errors and not corrupt the data". I'm not lazy, I've just done a cost/benefit analysis and determined that my limited knowledge should be enough. Experience with other tools which aren't as strict as Python 3 on Unicode matters confirms that a "good enough" job does satisfy my needs. And I'm not willfully ignorant, I actually have a good feel for Unicode and the issues involved, and I certainly know what's right. I've just found that everything I've read assumes that "knowing the encoding" isn't hard - and my experience differs, so I don't know where to go for answers. Add to this the fact that I *know* I've seen supposed text files with mixed encoding content, and no-one has *ever* explained how to handle that (it's basically a damaged file, and so all the "right way to deal with Unicode" discussions ignore it) even though tools like grep and awk do a perfectly acceptable job to the level I care about. I'm very pleased with the way this thread has gone, because it has answered all of the questions I've had about "nearly-ASCII" text files. But there's no way I'd have expected to spend this much time, and involve this many other people with more knowledge than me, just to handle my original changelog-parsing problem that I could do in awk or Python 2 in about 5 minutes. Now, I could also do it in Python 3. But then, I couldn't. Hopefully the knowledge from this thread can be captured so that other people can avoid my dilemma. OK, so maybe I do feel somewhat insulted... Cheers, Paul.