[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Andrew Barnert abarnert at yahoo.com
Mon Jun 10 00:57:08 CEST 2013


From: Oleg Broytman <phd at phdru.name>

Sent: Sunday, June 9, 2013 3:20 PM


> On Sun, Jun 09, 2013 at 02:52:32PM -0700, Andrew Barnert 
> <abarnert at yahoo.com> wrote:
>>  There's definitely a case to be made for implementing some kind of 
> Notepad-like heuristics in Python. It would be great to be able to do this at 
> the interactive interpreter:
>> 
>>  line = text.partition('\n')[0]
>>  for encoding in codecs.guess(text)[:10]:
>>      print(encoding, line.decode(encoding))
>> 
>>  In fact, if you wrote that at pushed it to PyPI I'd start using it 
> today, and maybe even lobbying for its inclusion in the stdlib.
> 
>    Chardet (and variants)?
> https://pypi.python.org/pypi?%3Aaction=search&term=chardet&submit=search


Since this is now the third such reply I've gotten, I'll reply to the list.

chardet2 is great. But chardet2 doesn't do Notepad-like heuristics. It doesn't consider your system and OEM charsets, it doesn't understand Microsoft's nonstandard UTF-8 BOM rules, it doesn't detect EBCDIC (or contain the helpful message "This version of OS/2 or MS-DOS does not support EBCDIC. Please contact IBM for support."—but I don't think WRITE.EXE has that message either nowadays…), etc.

I sometimes use both chardet2 from the command line, and Notepad or Wordpad, on the same file when trying to puzzle things out. I'd like to have access to both from the interactive interpreter. And I don't think it would be necessary, or likely reasonable, for the MS heuristics to get added to chardet2. And, if it were, I'm not sure what I'd like the API to look like (one function with options?). Which would actually be a problem if we got both into the stdlib, so I probably shouldn't have suggested that.

Anyway, sorry for not making that clear in the first place.


More information about the Python-ideas mailing list