On 10Aug2016 1146, Random832 wrote:
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
- change sys.getfilesystemencoding() to return 'utf-8'
- automatically decode byte paths assuming they are utf-8
- remove the deprecation warning on byte paths
Why? What's the use case?
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
- make the default open() encoding check for a BOM or else use utf-8
- [ALTERNATIVE] make the default open() encoding check for a BOM or else
For reading, I assume. When opened for writing, it should probably be utf-8-sig [if it's not mbcs] to match what Notepad does. What about files opened for appending or updating? In theory it could ingest the whole file to see if it's valid UTF-8, but that has a time cost.
Writing out the BOM automatically basically makes your files incompatible with other platforms, which rarely expect a BOM. By omitting it but writing and reading UTF-8 we ensure that Python can handle its own files on any platform, while potentially upsetting some older applications on Windows or platforms that don't assume UTF-8 as a default.
Notepad, if there's no BOM, checks the first 256 bytes of the file for whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], and can get it wrong for certain very short files [i.e. the infamous "this app can break"]
Yeah, this is a pretty horrible idea :) I don't want to go there by default, but people can install chardet if they want the functionality.
What to do on opening a pipe or device? [Is os.fstat able to detect these cases?]
We should be able to detect them, but why treat them any differently from a file? Right now they're just as broken as they will be after the change if you aren't specifying 'b' or an encoding - probably more broken, since at least you'll get less encoding errors when the encoding is UTF-8.
Maybe the BOM detection phase should be deferred until the first read. What should encoding be at that point if this is done? Is there a "utf-any" encoding that can handle all five BOMs? If not, should there be? how are "utf-16" and "utf-32" files opened for appending or updating handled today?
Yes, I think it would be. I suspect we'd have to leave the encoding unknown until the first read, and perhaps force it to utf-8-sig if someone asks before we start. I don't *think* this is any less predictable than the current behaviour, given it only applies when you've left out any encoding specification, but maybe it is.
It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
- force the console encoding to UTF-8 on initialize and revert on
Why not implement a true unicode console? What if sys.stdin/stdout are pipes (or non-console devices such as a serial port)?
Mostly because it's much more work. As I mentioned in my other post, an alternative would be to bring win_unicode_console into the stdlib and enable it by default (which considering the package was largely developed on bugs.p.o is probably okay, but we'd probably need to rewrite it in C, which is basically implementing a true Unicode console).
You're right that changing the console encoding after launching Python is probably going to mess with pipes. We can detect whether the streams are interactive or not and adjust accordingly, but that's going to get messy if you're only piping in/out and stdin/out end up with different encodings. I'll put some more thought into this part.