[Python-ideas] Python 3000 TIOBE -3%

Thu Feb 16 13:59:26 CET 2012

On 16 February 2012 04:08, Steven D'Aprano <steve at pearwood.info> wrote:
> On 16/02/12 02:39, Oleg Broytman wrote:
>> I don't think it's helpful to label everyone who wants to use the
>> techniques being discussed here as lazy or ignorant. As we've seen,
>> there are cases where you truly *can't* know the true encoding,
>> and at the same time it *doesn't matter*, because all you want to
>> do is treat the unknown bytes as opaque data. To tell someone in
>> that position that they're being lazy is both wrong and insulting.
>
> In fairness, this thread was originally started with the scenario "I'm
> reading files which are only mostly ASCII, but I don't want to learn
> about Unicode" rather than "I know about Unicode, but it doesn't help me
> in this situation because the encoding truly is unknown". So wilful
> ignorance does apply, at least in the use-case the thread started with.
> (If it helps, think of them as too busy to learn, not too lazy.)

As the person who started the thread with this use case, I'd dispute
that description of what I said.

To restate it "I'm reading files which are mostly ASCII but not all. I
know that I should identify the encoding, and what to do if I did know
the encoding, but I'm not sure how to find out reliably what the
encoding is. Also, the problem doesn't really warrant investing the
time needed to research means of doing so - given that I don't need to
process the non-ASCII, I just want to avoid decoding errors and not
corrupt the data".

I'm not lazy, I've just done a cost/benefit analysis and determined
that my limited knowledge should be enough. Experience with other
tools which aren't as strict as Python 3 on Unicode matters confirms
that a "good enough" job does satisfy my needs. And I'm not willfully
ignorant, I actually have a good feel for Unicode and the issues
involved, and I certainly know what's right. I've just found that
everything I've read assumes that "knowing the encoding" isn't hard -
and my experience differs, so I don't know where to go for answers.

Add to this the fact that I *know* I've seen supposed text files with
mixed encoding content, and no-one has *ever* explained how to handle
that (it's basically a damaged file, and so all the "right way to deal
with Unicode" discussions ignore it) even though tools like grep and
awk do a perfectly acceptable job to the level I care about.

I'm very pleased with the way this thread has gone, because it has
answered all of the questions I've had about "nearly-ASCII" text
files. But there's no way I'd have expected to spend this much time,
and involve this many other people with more knowledge than me, just
to handle my original changelog-parsing problem that I could do in awk
or Python 2 in about 5 minutes. Now, I could also do it in Python 3.
But then, I couldn't. Hopefully the knowledge from this thread can be
captured so that other people can avoid my dilemma.

OK, so maybe I do feel somewhat insulted...

Cheers,
Paul.