[Python-ideas] Python 3000 TIOBE -3%

Sat Feb 11 15:29:34 CET 2012

On 11 February 2012 12:41, Masklinn <masklinn at masklinn.net> wrote:
>>> with open('myfile.txt') as f:
>>>    for line in f:
>>>        if line.startswith('*'):
>>>            print(line)
>>>
>>> fails with encoding errors. What do I do? Short answer, grumble and go
>>> and use grep (or in more complex cases, awk) :-(
>>
>> Or just use the ISO-8859-1 encoding.
>
> It's true that requires to handle encodings upfront where Python 2 allowed you
> to play fast-and-lose though.
>
> And using latin-1 in that context looks and feels weird/icky, the file is not
> encoded using latin-1, the encoding just happens to work to manipulate bytes as
> ascii text + non-ascii stuff.

To be honest, I'm fine with the answer "use latin1" for this case.
Practicality beats purity and all that. But as you say, it feels wrong
somehow. I suspect that errors=surrogateescape is the better "I don't
really care" option. And I still maintain it would be useful for
combating FUD if there was a commonly-accepted idiom for this.

Interestingly, on my Windows PC, if I open a file using no encoding in
Python 3, I seem to get code page 1252:

Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("unicode.txt")
>>> f.encoding
'cp1252'
>>>

So actually, on this PC, I can't really provoke these sorts of
decoding error problems (CP1252 accepts all bytes, it's basically
latin1). Whether this is a good thing or a bad thing, I'm not sure :-)

Paul