Chris Barker writes:
Sure -- but it's entirely unnecessary, yes? If you don't change your code, you'll get py2(bytes) strings as paths in py2, and py3 (Unicode) strings as paths on py3. So different, yes. But wouldn't it all work?
The difference is that if you happen to have a file name on Unix that is *not* encoded in the default locale, bytes Just Works, while Something Bad happens with unicode (mixing Python 3 and Python 2 terminology for clarity). Also, in Python the C/POSIX default locale implied a codec of 'ascii' which is quite risky nowadays, so using unicode meant always being conscious of encodings.
So folks are making an active choice to change their code to get some perceived (real?) performance benefit???
No, they're making a passive choice to not fix whut ain't broke nohow, but in Python 3 is spelled differently. It's the same order of change as "print stuff" (Python 2) to "print(stuff)" (Python 3), except that it's not as automatic. (Ie, where print is *always* a function call in Python 3, often in a Python 2 -> 3 port you're better off with str than bytes, especially before PEP 461 "% formatting for bytes".)
However, as I understand it, py3 string paths did NOT "just work" in place of py2 paths before surrogate pairs were introduced (when was that?)
I'm not sure what you're referring to. Python 2 unicode and Python 3 str have been capable of representing (for values of "representing" that require appropriate choice of I/O codecs) the entire repertoire of Unicode since version 1.6 [sic!]. I suppose you mean PEP 383 (implemented in Python 3.1), which added a pseudo-encoding for unencodable bytes, ie, the surrogateescape error handler. This was never a major consideration in practice, however, as you could always get basically the same effect with the 'latin-1' codec. That is, the surrogateescape handler is primarily of benefit to those who are already convinced that fully conformant Unicode is the way to go. It doesn't make a difference to those who prefer bytes.
What I'm getting at is whether there is anything other than inertia that keeps folks using bytes paths in py3 code? Maybe it wouldn't be THAT hard to get folks to make the switch: it's EASIER to port your code to py3 this way!
It's not. First, encoding awareness is real work. If you try to DTRT, you open yourself up to UnicodeErrors anywhere in your code where there's a Python/rest-of-world boundary. If you just use bytes, you may be producing garbage, but your program doesn't stop running, and you can always argue it's either your upstream's or your downstream's fault. I *personally* have always found the work to be worthwhile, as my work always involves "real" text processing, and frequently not in pure ASCII. Second, there are a lot of low-level use cases where (1) efficiency matters and (2) all the processing actually done involves switching on byte values in the range 32-126. It makes sense to do that work on bytes, wouldn't you say?<wink/> And to make the switch cases easier to read, it's common practice to form (or contort) those bytes into human words. These cases include a lot of the familiar acronyms: SMTP, HTTP, DNS, VCS, VM (as in "bytecode interpreter"), ... and the projects are familiar: Twisted, Mercurial, .... Bottom line: I'm with you! I think that "filenames are text" *should* be the default mode for Python programmers. But there are important use cases where it's sometimes more work to make that work than to make bytes work (on POSIX), and typically those cases also inherit largish, battle-tested code bases that assume a "bytes in, bytes through, bytes out" model. We can't deprecate "filenames as bytes" on POSIX yet, and if we want to encourage participation in projects that use that model by Windows-based programmers, we can't deprecate completely on Windows, either.