Hi,
I am not more conviced that raising a UnicodeEncodeError on unencodable characters is the right fix for the issue #13247. The problem with this solution is that you have to wait until an user get a UnicodeEncodeError.
I have yet another proposition: emit a warning when a bytes filename is used. So it doesn't affect the default behaviour, but you can use -Werror to test if your program is fully Unicode compliant on Windows (without having to test invalid filenames).
I don't know if a BytesWarning or a DeprecationWarning is more apropriate. It depends if we plan to drop support of bytes filenames on Windows later (in Python 3.5 or later).
List of impacted functions:
os._getfinalpathname(bytes) os._getfullpathname(bytes) os._isdir(bytes) os.access(bytes) os.chdir(bytes) os.chmod(bytes) os.getcwdb() os.link(bytes, bytes) os.listdir(bytes) os.lstat(bytes) os.mkdir(bytes) os.readlink(bytes) os.rename(bytes, bytes) os.rmdir(bytes) os.stat(bytes) os.symlink(bytes, bytes) os.unlink(bytes) os.utime(bytes, time)
Note: Unicode filenames are not affected by this change. For example, os.listdir(str) will not emit any warning.
Victor
On 29/10/2011 9:52 AM, Victor Stinner wrote:
Hi,
I am not more conviced that raising a UnicodeEncodeError on unencodable characters is the right fix for the issue #13247. The problem with this solution is that you have to wait until an user get a UnicodeEncodeError.
I have yet another proposition: emit a warning when a bytes filename is used. So it doesn't affect the default behaviour, but you can use -Werror to test if your program is fully Unicode compliant on Windows (without having to test invalid filenames).
I don't know if a BytesWarning or a DeprecationWarning is more apropriate. It depends if we plan to drop support of bytes filenames on Windows later (in Python 3.5 or later).
When previously discussing this issue, I was under the impression that the problem was unencodable bytes passed from the Python code to Windows - but the reverse is true - only the data coming back from Windows isn't encodable.
This changes my opinion significantly :) I don't think raising an error is the right choice - there are almost certainly use-cases where the current behaviour works OK and we would break them (eg, not all files in a directory are likely to be unencodable). As the data came externally, the only solution the programmer has is to change to the unicode version of the api - so we recommend the bytes version not be used by anyone, anytime - which means it is conceptually deprecated already.
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely - possibly with a less aggressive timeline than normal. In Python 2.7, I think documenting the issue and a recommendation to always use unicode is sufficient (ie, we can't deprecate it and a new BytesWarning seems gratuitous.)
Cheers,
Mark
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely - possibly with a less aggressive timeline than normal. In Python 2.7, I think documenting the issue and a recommendation to always use unicode is sufficient (ie, we can't deprecate it and a new BytesWarning seems gratuitous.)
That sounds all fine to me.
Regards, Martin
"Martin v. Löwis" writes:
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely
That sounds all fine to me.
As quoted above, deprecation of the bytes version of the API sounds fine to me, but isn't this going to run into the usual objections from the "we need bytes for efficiency" crowd? It's OK with me<wink> to say "in this restricted area you must convert to Unicode", but is that going to fly with that constituency?
As quoted above, deprecation of the bytes version of the API sounds fine to me, but isn't this going to run into the usual objections from the "we need bytes for efficiency" crowd? It's OK with me<wink> to say "in this restricted area you must convert to Unicode", but is that going to fly with that constituency?
I don't think this "we need bytes for efficiency" crowd actually exists. We are talking about file names here. The relevant crowd is the "we need bytes for correctness", and that crowd focuses primarily on Unix. It splits into the "we only care about Unix" crowd (A), the "we want correctness everywhere" crowd (B), and the "we want portable code" crowd (C). (A) can accept the deprecation. (B) will support it. Only (C) might protest, as we are going to break their code, hence the deprecation period.
Regards, Martin
On Sun, Oct 30, 2011 at 6:00 PM, "Martin v. Löwis" martin@v.loewis.de wrote:
As quoted above, deprecation of the bytes version of the API sounds fine to me, but isn't this going to run into the usual objections from the "we need bytes for efficiency" crowd? It's OK with me<wink> to say "in this restricted area you must convert to Unicode", but is that going to fly with that constituency?
I don't think this "we need bytes for efficiency" crowd actually exists.
I think that crowd does exist, but I've only ever seen them complain about URLs and other wire protocols (where turnaround time can matter a lot in terms of responsiveness of network applications for short requests, and encode()/decode() cycles can really add up). Filesystem access is dominated by I/O time, and there's often going to be some encoding or decoding going anyway (since the app and the filesystem have to get the data into a common format).
Cheers, Nick.
Le 30/10/2011 09:00, "Martin v. Löwis" a écrit :
As quoted above, deprecation of the bytes version of the API sounds fine to me, but isn't this going to run into the usual objections from the "we need bytes for efficiency" crowd? It's OK with me<wink> to say "in this restricted area you must convert to Unicode", but is that going to fly with that constituency?
I don't think this "we need bytes for efficiency" crowd actually exists. We are talking about file names here. The relevant crowd is the "we need bytes for correctness", and that crowd focuses primarily on Unix.
Oh, by the way, it is important to know that Unicode filenames is the best way to write portable programs with Python 3. On UNIX, since Python 3.1, undecodables filename don't raise Unicode errors: undecodable bytes are stored as surrogates (see the PEP 383). So even if the computer is completly misconfigured, it "just works".
On Windows, you must Unicode for filenames for correctness.
Anyway, with Python 3, it's easier to manipulate Unicode strings than bytes strings.
Martin finally agreed with me, I should hurry to implement my idea! :-)
Victor
Le 29/10/2011 07:47, Mark Hammond a écrit :
When previously discussing this issue, I was under the impression that the problem was unencodable bytes passed from the Python code to Windows
- but the reverse is true - only the data coming back from Windows isn't
encodable.
The undecodable filenames issue occurs mostly on os.listdir(bytes) and os.getcwdb().
Unencodable filenames issue occurs on the rest of my function list.
As the data came externally, the only solution the programmer has is to change to the unicode version of the api
- so we recommend the bytes version not be used by anyone,
anytime - which means it is conceptually deprecated already.
I proposed to raise a Unicode error on undecodable filenames, instead of returning invalid filenames (with question marks), to force the developer to move to the Unicode API. But as I explained in my previous message, you have to wait for an user having the problem to be noticied of the problem.
Terry J. Reedy is also concerned about backward compatibility (3.2 -> 3.3). Emiting a warning, disabled by default, is a softer solution :-)
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely - possibly with a less aggressive timeline than normal.
If there is a warning, I don't really care of removing the bytes API before Python 4.
PendingDeprecationgWarning can be used, or maybe a DeprecationWarning mentioning that the code will stay for long time.
In Python 2.7, I think documenting the issue and a recommendation to always use unicode is sufficient (ie, we can't deprecate it and a new BytesWarning seems gratuitous.)
Sorry, I don't understand "gratuitous" here: do you mean that a new warning would annoying, and that it is cheap and useful to add it to Python 2.7.x?
Victor
On 31/10/2011 8:39 AM, Victor Stinner wrote:
Le 29/10/2011 07:47, Mark Hammond a écrit :
When previously discussing this issue, I was under the impression that the problem was unencodable bytes passed from the Python code to Windows
- but the reverse is true - only the data coming back from Windows isn't
encodable.
The undecodable filenames issue occurs mostly on os.listdir(bytes) and os.getcwdb().
Unencodable filenames issue occurs on the rest of my function list.
As the data came externally, the only solution the programmer has is to change to the unicode version of the api
- so we recommend the bytes version not be used by anyone,
anytime - which means it is conceptually deprecated already.
I proposed to raise a Unicode error on undecodable filenames, instead of returning invalid filenames (with question marks), to force the developer to move to the Unicode API. But as I explained in my previous message, you have to wait for an user having the problem to be noticied of the problem.
Terry J. Reedy is also concerned about backward compatibility (3.2 -> 3.3). Emiting a warning, disabled by default, is a softer solution :-)
Right - and just to be clear, we are all now agreeing that the UnicodeDecodeError isn't appropriate and a warning will be issued instead?
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely - possibly with a less aggressive timeline than normal.
If there is a warning, I don't really care of removing the bytes API before Python 4.
Agreed - I was trying to say that I think we should start the deprecation process of the bytes API, so a [Pending]DeprecationWarning would then be appropriate. The actual timing of the removal isn't important.
PendingDeprecationgWarning can be used, or maybe a DeprecationWarning mentioning that the code will stay for long time.
In Python 2.7, I think documenting the issue and a recommendation to always use unicode is sufficient (ie, we can't deprecate it and a new BytesWarning seems gratuitous.)
Sorry, I don't understand "gratuitous" here: do you mean that a new warning would annoying, and that it is cheap and useful to add it to Python 2.7.x?
I mean "Uncalled for; lacking good reason; unwarranted." IOW, I don't think we need to take any action for 2.7, apart from possibly documentation changes.
Mark
On 10/30/2011 5:39 PM, Victor Stinner wrote:
Terry J. Reedy is also concerned about backward compatibility (3.2 -> 3.3). Emiting a warning, disabled by default, is a softer solution :-)
The fact that Mark, Martin, and someone else, I believe, agree with you that the bytes api is not useful at all in 3.x and should go away reduces my concern. This fact does suggest that it is not worth changing anything to make those APIs easier to use. Instead, better to encourage people to not use those APIs in any 3.x code. Removal is ultimately, of course, the hardest solution.
Le samedi 29 octobre 2011 07:47:01, vous avez écrit :
Therefore, as you imply, I think the solution to this issue is to start the process of deprecating the bytes version of the api in py3k with a view to removing it completely - possibly with a less aggressive timeline than normal. In Python 2.7, I think documenting the issue and a recommendation to always use unicode is sufficient (ie, we can't deprecate it and a new BytesWarning seems gratuitous.)
I wrote a patch to implement the deprecation: http://bugs.python.org/issue13374
Victor