[Python-ideas] Python 3000 TIOBE -3%

Sat Feb 11 20:46:52 CET 2012

On 2012-02-11, at 20:35 , Stefan Behnel wrote:
> 
>> Yes, but now instead of just ignoring that stuff you have to actively and
>> knowingly lie to Python to get it to shut up.
> 
> The advantage is that it becomes explicit what you are doing. In Python 2,
> without any encoding, you are implicitly assuming that the encoding is
> Latin-1, because that's how you are processing it. You're just not spelling
> it out anywhere, thus leaving it to the innocent reader to guess what's
> happening. In Python 3, and in better Python 2 code (using codecs.open(),
> for example), you'd make it clear right in the open() call that Latin-1 is
> the way you are going to process the data.

I'm not sure going from "ignoring it" to "explicitly lying about it" is a
great step forward. latin-1 is not "the way you are going to process the data"
in this case, it's just the easiest way to get Python to shut up and open the
damn thing.

>>> Besides, it's perfectly possible to process bytes in Python 3. You just
>>> have to open the file in binary mode and do the processing at the byte
>>> string level.
>> 
>> I think that's the route which should be taken
> 
> Oh, absolutely not. When it's text, it's best to process it as Unicode.

Except it's not processed as text, it's processed as "stuff with ascii
characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS
(which is kinda-sorta-extended-ascii but not exactly), and while using
an ISO-8859 will yield unicode data that's about the only thing you can
say about it and the actual result will probably be mojibake either way.

By processing it as bytes, it's made explicit that this is not known and
decoded text (which is what unicode strings imply) but that it's some
semi-arbitrary ascii-compatible encoding and that's the extent of the
developer's knowledge and interest in it.