[Python-ideas] Python 3000 TIOBE -3%

Sat Feb 11 18:25:49 CET 2012

On 2/11/2012 7:53 AM, Stefan Behnel wrote:
> Masklinn, 11.02.2012 13:41:
>> On 2012-02-11, at 13:33 , Stefan Behnel wrote:
>>> Paul Moore, 11.02.2012 11:47:
>>>> On 11 February 2012 00:07, Terry Reedy wrote:
>>>>>>> Nor is there in 3.x.
>>>>>
>>>>> I view that claim as FUD, at least for many users, and at least until the
>>>>> persons making the claim demonstrate it. In particular, I claim that people
>>>>> who use Python2 knowing nothing of unicode do not need to know much more to
>>>>> do the same things in Python3.
>>>>
>>>> Concrete example, then.
>>>>
>>>> I have a text file, in an unknown encoding (yes, it does happen to
>>>> me!) but opening in an editor shows it's mainly-ASCII. I want to find
>>>> all the lines starting with a '*'. The simple
>>>>
>>>> with open('myfile.txt') as f:
>>>>     for line in f:
>>>>         if line.startswith('*'):
>>>>             print(line)
>>>>
>>>> fails with encoding errors. What do I do? Short answer, grumble and go
>>>> and use grep (or in more complex cases, awk) :-(
>>>
>>> Or just use the ISO-8859-1 encoding.
>>
>> It's true that requires to handle encodings upfront where Python 2 allowed you
>> to play fast-and-lose though.
>
> Well, except for the cases where that didn't work. Remember that implicit
> encoding behaves in a platform dependent way in Python 2, so even if your
> code runs on your machine doesn't mean it will work for anyone else.
>
>
>> And using latin-1 in that context looks and feels weird/icky, the file is not
>> encoded using latin-1, the encoding just happens to work to manipulate bytes as
>> ascii text + non-ascii stuff.
>
> Correct. That's precisely the use case described above.
>
> Besides, it's perfectly possible to process bytes in Python 3. You just
> have to open the file in binary mode and do the processing at the byte
> string level. But if you don't care (and if most of the data is really
> ASCII-ish), using the ISO-8859-1 encoding in and out will work just fine
> for problems like the above.

If one has ascii text + unspecified 'other stuff', one can either 
process as 'polluted text' or as 'bytes with some ascii character 
codes'. Since (as I just found out) one can iterate binary mode files by 
line just as with text mode, I am not sure what the tradeoffs are. I 
would guess it is mostly whether one wants to process a sequence of 
characters or a sequence of character codes (ints).

-- 
Terry Jan Reedy