[Python-ideas] Python 3000 TIOBE -3%

Sun Feb 12 00:24:04 CET 2012

On 11 February 2012 19:46, Masklinn <masklinn at masklinn.net> wrote:
>>>> Besides, it's perfectly possible to process bytes in Python 3. You just
>>>> have to open the file in binary mode and do the processing at the byte
>>>> string level.
>>>
>>> I think that's the route which should be taken
>>
>> Oh, absolutely not. When it's text, it's best to process it as Unicode.
>
> Except it's not processed as text, it's processed as "stuff with ascii
> characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS
> (which is kinda-sorta-extended-ascii but not exactly), and while using
> an ISO-8859 will yield unicode data that's about the only thing you can
> say about it and the actual result will probably be mojibake either way.

No, not at all. It *is* text. I *know* it's text. I know that it is
encoded in an ASCII-superset (because I can read it in a text editor
and *see* that it is). What I *don't* know is what those funny bits of
mojibake I see in the text editor are. But I don't really care. Yes, I
could do some analysis based on the surrounding text and confirm
whether it's latin-1, utf-8, or something similar. But it honestly
doesn't matter to me, as all I care about is parsing the file to find
the change authors, and printing their names (to re-use the
"manipulating a ChangeLog file" example). And even if it did matter,
the next file might be in a different ASCII-superset encoding, but I
*still* won't care because the parsing code will be exactly the same.

Saying "it's bytes" is even more of a lie than "it's latin-1". The
honest truth is "it's an ASCII superset", and that's all I need to
know to do the job manually, so I'd like to write code to do the same
job without needing to lie about what I know. I'm now 100% convinced
that encoding="ascii",errors="surrogateescape" is the way to say this
in code.

Paul.