On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Why? What's the use case? [byte paths]
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8? What about only doing the deprecation warning if non-ascii bytes are present in the value?
For reading, I assume. When opened for writing, it should probably be utf-8-sig [if it's not mbcs] to match what Notepad does. What about files opened for appending or updating? In theory it could ingest the whole file to see if it's valid UTF-8, but that has a time cost.
Writing out the BOM automatically basically makes your files incompatible with other platforms, which rarely expect a BOM.
Yes but you're not running on other platforms, you're running on the platform you're running on. If files need to be moved between platforms, converting files with a BOM to without ought to be the responsibility of the same tool that converts CRLF line endings to LF.
By omitting it but writing and reading UTF-8 we ensure that Python can handle its own files on any platform, while potentially upsetting some older applications on Windows or platforms that don't assume UTF-8 as a default.
Okay, you haven't addressed updating and appending. I realized after posting that updating should be in binary, but that leaves appending. Should we detect BOMs and/or attempt to detect the encoding by other means in those cases?
Notepad, if there's no BOM, checks the first 256 bytes of the file for whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], and can get it wrong for certain very short files [i.e. the infamous "this app can break"]
Yeah, this is a pretty horrible idea :)
Eh, maybe the utf-16 because it can give some hilariously bad results, but using it to differentiate between utf-8 and mbcs might not be so bad. But what to do if all we see is ascii?
What to do on opening a pipe or device? [Is os.fstat able to detect these cases?]
We should be able to detect them, but why treat them any differently from a file?
Eh, I was mainly concerned about if the first few bytes aren't a BOM? What about blocking waiting for them? But if this is delayed until the first read then it's fine.
It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
The only thing O_TEXT does rather than O_BINARY is convert CRLF line endings (and maybe end on ^Z), and I don't think we even expose the constants for the CRT's unicode modes.