On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
My concern still is that it brings the bytes type into the status of another character string type, which is really bad, and will require further modifications to Python for the lifetime of 3.x.
I'd like to understand why this is "really bad". I though it was by design that the str and bytes types behave pretty similarly. You can use both as dict keys.
If they have to behave pretty similarly, they have to be supported in all APIs that deal with text.
I don't see how you get from "pretty similarly" to "all APIs". :-)
For example, people will demand that printing bytes should just copy them onto the stream (rather than invoking repr()), and writing them onto a text stream should work the same way. GUI library should support them, the XML libraries, and so on.
Where will you stop, and tell people that bytes are just not supposed to do this or that?
Printing a bytes object already works, and displays its repr(), which is guaranteed to be pure ASCII (unlike the repr() of a unicode str object in Py3k). All the others you mention will cause breakage as they should -- these errors exist to force the programmer to think about encodings or conversions. I don't see that as a big burden because the only way there could be bytes here in the first place is when the user explicitly requested bytes. A program that only ever passes text strings to the os module is only ever going to get text strings back.
This is because applications will then regularly use byte strings for file names on Unix, and regular strings on Windows, and then expect the program to work the same without further modifications.
It seems that bytes arguments actually *do* work on Windows -- somehow they get decoded. (Unless Terry's report was from 2.x.)
To a limited degree - see my other message. Don't try to listdir a directory with characters outside CP_ACP (it will give you invalid file names).
Understood.
Actually something like that may not be a bad idea. Ian Bicking's webob supports similar double APIs for getting the request parameters out of a request object; I believe request.GET['x'] is a text object and request.GET_str['x'] is the corresponding uninterpreted bytes sequence. I would prefer to have os.environb over os.environ[b"PATH"] though.
And would you keep them synchronized?
Yes, the bytes versions would be the canonical version and the str version would wrap around that -- though updating the str version would also update the bytes version. Some keys would be missing from the str version (or perhaps they would raise exceptions or default to some other error handler, like ignore or replace).
I assume at some point we can stop and have sufficiently low-level interfaces that everyone can agree are in bytes only. Bytes aren't going away. How does Java deal with this? Its File class doesn't seem to deal in bytes at all. What would its listFiles() method do with undecodable filenames?
Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will fail with FileNotFoundException.
IOW, Java hasn't solved the problem in the last 10 years. Marcin Kowalczyk did a more thorough analysis about a year ago in
http://mail.python.org/pipermail/python-3000/2007-September/010450.html
I can't say I like the Java solution. I would like to be able to write a robust backup tool in Python, even if the code needed to make it work everywhere isn't going to win any prizes (due to the need to use bytes on Unix, str on Windows). -- --Guido van Rossum (home page: http://www.python.org/~guido/)