Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c. I think we can simplify the code hugely. (This means droping bytes support for os.stat etc on windows) # I recently did it for winsound.PlaySound with MvL's approval Thank you.
On 11/11/2010 16:07, Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c. I think we can simplify the code hugely. (This means droping bytes support for os.stat etc on windows)
# I recently did it for winsound.PlaySound with MvL's approval
+1 from me TJG
On Thu, 11 Nov 2010 16:10:35 +0000
Tim Golden
On 11/11/2010 16:07, Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c. I think we can simplify the code hugely. (This means droping bytes support for os.stat etc on windows)
# I recently did it for winsound.PlaySound with MvL's approval
+1 from me
How do you support cross-platform code using bytes filenames? IIRC, it has already been argued that it was an important feature. Many filesystem-related utilities might prefer to handle filenames in bytes form. ("winsound" is a Windows-specific module so that wasn't a concern obviously) Regards Antoine.
How do you support cross-platform code using bytes filenames? IIRC, it has already been argued that it was an important feature. Many filesystem-related utilities might prefer to handle filenames in bytes form.
It would be a policy decision. However, I think it is hear-say that filesystem-related utilities might prefer byte file names. On Windows, some files are inaccessible if you constrain yourself to byte filenames, so once people learn about this limitation, I expect them to switch to Unicode filenames on Windows - for the same reason they use byte filenames on Unix (i.e. to be able to access all files correctly). Regards, Martin
On Thu, 11 Nov 2010 20:44:52 +0100
"Martin v. Löwis"
How do you support cross-platform code using bytes filenames? IIRC, it has already been argued that it was an important feature. Many filesystem-related utilities might prefer to handle filenames in bytes form.
It would be a policy decision. However, I think it is hear-say that filesystem-related utilities might prefer byte file names.
One possible situation is when you receive filenames in bytes form from an external API or tool (or even the contents of a file). If you don't know the encoding, keeping the bytes form is obviously recommended. I don't know how often this happens. Regards Antoine.
On Thursday 11 November 2010 21:02:43 Antoine Pitrou wrote:
On Thu, 11 Nov 2010 20:44:52 +0100
"Martin v. Löwis"
wrote: How do you support cross-platform code using bytes filenames? IIRC, it has already been argued that it was an important feature. Many filesystem-related utilities might prefer to handle filenames in bytes form.
It would be a policy decision. However, I think it is hear-say that filesystem-related utilities might prefer byte file names.
One possible situation is when you receive filenames in bytes form from an external API or tool (or even the contents of a file). If you don't know the encoding, keeping the bytes form is obviously recommended.
I disagree with you: the filename stored in the binary content/network stream may be encoded with a different code page than the current Windows code page. The application have to decode the filename itself, the application has more information about the right encoding than Windows. Examples: - MKV video stores filenames in utf-8 - ZIP stores filenames in cp437 or utf-8 - tar stores filenames... in the locale encoding (except for PAX format which uses utf-8) - etc. Victor
On Fri, 12 Nov 2010 13:13:08 +0100
Victor Stinner
On Thursday 11 November 2010 21:02:43 Antoine Pitrou wrote:
On Thu, 11 Nov 2010 20:44:52 +0100
"Martin v. Löwis"
wrote: How do you support cross-platform code using bytes filenames? IIRC, it has already been argued that it was an important feature. Many filesystem-related utilities might prefer to handle filenames in bytes form.
It would be a policy decision. However, I think it is hear-say that filesystem-related utilities might prefer byte file names.
One possible situation is when you receive filenames in bytes form from an external API or tool (or even the contents of a file). If you don't know the encoding, keeping the bytes form is obviously recommended.
I disagree with you: the filename stored in the binary content/network stream may be encoded with a different code page than the current Windows code page. The application have to decode the filename itself, the application has more information about the right encoding than Windows.
I'm not talking about Windows obviously. POSIX filenames are natively bytes, so if you get a bytes filename from an external source, it makes sense to reuse the bytes form. I think it would be a mistake to allow bytes filenames under POSIX but not under Windows. It makes porting harder.
- tar stores filenames... in the locale encoding (except for PAX format which uses utf-8)
So bytes filenames are useful at least for tar. I'm sure there are many other cases (actually, most kinds of configuration files containing paths would apply). Regards Antoine.
I'm not talking about Windows obviously. POSIX filenames are natively bytes, so if you get a bytes filename from an external source, it makes sense to reuse the bytes form.
I think it would be a mistake to allow bytes filenames under POSIX but not under Windows. It makes porting harder.
Not really. People who want to write portable code should use Unicode filenames everywhere, not byte filenames.
- tar stores filenames... in the locale encoding (except for PAX format which uses utf-8)
So bytes filenames are useful at least for tar.
No, they are not. The tarfile module decodes all file names on its own, IIUC. Regards, Martin
On Thursday 11 November 2010, Hirokazu Yamamoto wrote:
Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c. I think we can simplify the code hugely.
+1 MS Windows variants that only support the ANSI API (win9x) are officially unsupported since 2.5 or 2.6. Further, this also eases porting to MS Windows CE, which I'd still like to to see one day.
(This means droping bytes support for os.stat etc on windows)
I disagree that not using the ANSI win32 API means dropping byte support for os.stat. I'd rather say that it means converting bytes at the earliest possible time and only using unicode internally. But I'm guessing a bit here, I haven't looked at the code for a while.
# I recently did it for winsound.PlaySound with MvL's approval
Interesting, is there a ticket associate with this? Also, was that on Python 3 or 2? Which commits? Uli -- Sator Laser GmbH, Fangdieckstraße 75a, 22547 Hamburg, Deutschland Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Sator Laser GmbH, Fangdieckstraße 75a, 22547 Hamburg, Deutschland Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at http://www.satorlaser.de/ ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. **************************************************************************************
On Thursday 11 November 2010 17:07:28 Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c.
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API. Can it be a problem to port programs from Python2 to Python3? Do major Python2 programs/libraries rely on the bytes API?
I think we can simplify the code hugely. (This means droping bytes support for os.stat etc on windows)
Sure, it will divide the number of lines, of the code specific to Windows, by two. Victor
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API. Can it be a problem to port programs from Python2 to Python3? Do major Python2 programs/libraries rely on the bytes API?
I don't actually know for a fact, but I expect that the answer is "no". The questions is: where do file names typically come from? My guess is that they come from a) hard-coded strings in the source code b) command line arguments/environment variables c) directory listings [of course, there are other ways, like GUI input, getcwd(), etc] In case a), you have filenames such as ".", e.g. as a parameter to listdir or walk. These will typically be regular strings in Python 2, which become Unicode strings in 3. You would actively need to put b"" prefixes into the code. In case b), they will be Unicode strings in Python 3. In case c), they will be Unicode strings if the argument is a Unicode string. So by induction, file names will be typically unicode. The exception will be libraries/applications which make deliberate attempts to get byte-oriented file names. Regards, Martin
On Thursday 11 November 2010 20:50:35 Martin v. Löwis wrote:
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API. Can it be a problem to port programs from Python2 to Python3? Do major Python2 programs/libraries rely on the bytes API?
I don't actually know for a fact, but I expect that the answer is "no".
The questions is: where do file names typically come from? My guess is that they come from a) hard-coded strings in the source code b) command line arguments/environment variables
[...]
In case b), they will be Unicode strings in Python 3.
But not neccessarily with unicode semantics if I get the discussions about the environment topic right. Additionally: d) Over a socket (like the HTTP protocol) -> Bytes. nd
Additionally:
d) Over a socket (like the HTTP protocol) -> Bytes.
Sure. However, you can't really expect that the bytes you receive over the socket are a meaningful filename on your local Windows installation. So it would be a bug in the application to not decode the bytes that you receive before using them as a file name. In a well-specified network protocol, you would know the encoding of the bytes; IETF recommends to use UTF-8 for all new protocols. Using an UTF-8 string as a filename on Windows will create mojibake. Regards, Martin
On Thursday 11 November 2010 20:50:35 you wrote:
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API. Can it be a problem to port programs from Python2 to Python3? Do major Python2 programs/libraries rely on the bytes API?
I don't actually know for a fact, but I expect that the answer is "no".
The questions is: where do file names typically come from? My guess is that they come from a) hard-coded strings in the source code b) command line arguments/environment variables c) directory listings [of course, there are other ways, like GUI input, getcwd(), etc]
In case a), you have filenames such as ".", e.g. as a parameter to listdir or walk. These will typically be regular strings in Python 2, which become Unicode strings in 3. You would actively need to put b"" prefixes into the code.
In case b), they will be Unicode strings in Python 3.
In case c), they will be Unicode strings if the argument is a Unicode string. So by induction, file names will be typically unicode. The exception will be libraries/applications which make deliberate attempts to get byte-oriented file names.
Ok, good answer. In this case, I vote +1 to remove completly the ANSI version from all Python modules. I consider the ANSI version has a compatibility layer for old applications written for MS-Dos or early versions of Windows. Even if these APIs are still widely used in C/C++ applications, the wide versions should always be preferred. Victor
Ok, good answer. In this case, I vote +1 to remove completly the ANSI version from all Python modules.
I think caution is still necessary. So I propose to deprecate byte filenames on Windows in 3.2, with removal in 3.3. People who think this is a terrible mistake and breaks there applications with no hope of a sensible solution can then still intervene. Regards, Martin
On Sat, Nov 13, 2010 at 5:46 AM, "Martin v. Löwis"
Ok, good answer. In this case, I vote +1 to remove completly the ANSI version from all Python modules.
I think caution is still necessary. So I propose to deprecate byte filenames on Windows in 3.2, with removal in 3.3. People who think this is a terrible mistake and breaks there applications with no hope of a sensible solution can then still intervene.
I was going to suggest much the same thing. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Nov 12, 2010 at 5:26 AM, Victor Stinner
On Thursday 11 November 2010 17:07:28 Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c.
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API. Can it be a problem to port programs from Python2 to Python3? Do major Python2 programs/libraries rely on the bytes API?
I think we can simplify the code hugely. (This means droping bytes support for os.stat etc on windows)
Sure, it will divide the number of lines, of the code specific to Windows, by two.
Can we get most of the code cleanup benefit without the backwards compatibility risk by doing the decode from 'mbcs' on our side of the fence? That is, have code that was the C equivalent of: arg_is_bytes = not isinstance(arg, str) if arg_is_bytes: val = _decode_mbcs(arg) # Decoding error checking here else: val = arg # Common processing using WIDE API if arg_is_bytes: result = _encode_mbcs(wide_result) # Encoding error checking here else: result = wide_result Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thursday 11 November 2010 23:01:32 you wrote:
Sure, it will divide the number of lines, of the code specific to Windows, by two.
Can we get most of the code cleanup benefit without the backwards compatibility risk by doing the decode from 'mbcs' on our side of the fence?
I created PyUnicode_FSDecoder, a ParseTuple converter used to work on unicode paths, instead of bytes paths. On Windows, this converter uses mbcs encoding in strict mode, whereas Windows converter uses replace error handler to decode, and ignore to encode. So I don't think that we should this converter on Windows.
That is, have code that was the C equivalent of:
arg_is_bytes = not isinstance(arg, str) if arg_is_bytes: val = _decode_mbcs(arg) # Decoding error checking here else: val = arg # Common processing using WIDE API if arg_is_bytes: result = _encode_mbcs(wide_result) # Encoding error checking here else: result = wide_result
This doesn't make the code shorter, it may be longer than the actual code, and it is less compliant with the Windows native API... Victor
On 2010/11/12 4:26, Victor Stinner wrote:
On Thursday 11 November 2010 17:07:28 Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c.
Even if I hate the MBCS encoding, because it replaces undecodable characters by similar glyphs by default, I'm not certain that it is a good idea to drop the bytes API.
On 2010/11/12 21:08, Victor Stinner wrote:
On Thursday 11 November 2010 23:01:32 you wrote:
Sure, it will divide the number of lines, of the code specific to Windows, by two.
Can we get most of the code cleanup benefit without the backwards compatibility risk by doing the decode from 'mbcs' on our side of the fence?
I created PyUnicode_FSDecoder, a ParseTuple converter used to work on unicode paths, instead of bytes paths. On Windows, this converter uses mbcs encoding in strict mode, whereas Windows converter uses replace error handler to decode, and ignore to encode. So I don't think that we should this converter on Windows.
That is, have code that was the C equivalent of:
arg_is_bytes = not isinstance(arg, str) if arg_is_bytes: val = _decode_mbcs(arg) # Decoding error checking here else: val = arg # Common processing using WIDE API if arg_is_bytes: result = _encode_mbcs(wide_result) # Encoding error checking here else: result = wide_result
This doesn't make the code shorter, it may be longer than the actual code, and it is less compliant with the Windows native API...
Is it possible to implement new PyArg_ParseTuple converter to use PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, /* mbcs */ const char *errors) /* replace */ and use it?
On Saturday 13 November 2010 17:21:37 you wrote:
On 2010/11/12 4:26, Victor Stinner wrote:
On Thursday 11 November 2010 17:07:28 Hirokazu Yamamoto wrote:
Hello. Is it possible to remove Win32 ANSI API (ie: GetFileAttributesA) and only use Win32 WIDE API (ie: GetFileAttributesW)? Mainly in posixmodule.c.
Even if I hate the MBCS encoding, because it replaces undecodable
characters
by similar glyphs by default, I'm not certain that it is a good idea
to drop
the bytes API.
On 2010/11/12 21:08, Victor Stinner wrote:
On Thursday 11 November 2010 23:01:32 you wrote:
Sure, it will divide the number of lines, of the code specific to Windows, by two.
Can we get most of the code cleanup benefit without the backwards compatibility risk by doing the decode from 'mbcs' on our side of the fence?
I created PyUnicode_FSDecoder, a ParseTuple converter used to work on unicode paths, instead of bytes paths. On Windows, this converter uses mbcs encoding in strict mode, whereas Windows converter uses replace error handler to decode, and ignore to encode. So I don't think that we should this converter on Windows.
That is, have code that was the C equivalent of:
arg_is_bytes = not isinstance(arg, str)
if arg_is_bytes: val = _decode_mbcs(arg) # Decoding error checking here
else: val = arg
# Common processing using WIDE API
if arg_is_bytes: result = _encode_mbcs(wide_result) # Encoding error checking here
else: result = wide_result
This doesn't make the code shorter, it may be longer than the actual code, and it is less compliant with the Windows native API...
Is it possible to implement new PyArg_ParseTuple converter to use PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, /* mbcs */ const char *errors) /* replace */ and use it?
Yes, but how do you check if the input argument is a bytes or a str object with your PyArg_Parse converter? You should use "O" format and manually convert it to unicode, and then convert the result back to bytes (if the input was bytes). It don't think that it makes the code shorter. The code is currently working. The question is if we have to drop the ANSI API now, later or never. It looks like the decision moves to "later" (deprecate in 3.2, remove in 3.3). I still think that drop now doesn't really hurt. Victor
On Sun, 14 Nov 2010 01:06:55 +0100
Victor Stinner
The code is currently working. The question is if we have to drop the ANSI API now, later or never.
If the code is currently working and isn't a security hole, then we obviously don't "have to". Apparently several developers "want to", which is different.
It looks like the decision moves to "later" (deprecate in 3.2, remove in 3.3). I still think that drop now doesn't really hurt.
If you drop code without first deprecating it, chances are it will hurt someone. That's why having a deprecation period is the rule we usually follow (most of the time :-)). Regards Antoine.
On Sun, Nov 14, 2010 at 10:19 AM, Antoine Pitrou
On Sun, 14 Nov 2010 01:06:55 +0100 Victor Stinner
wrote: The code is currently working. The question is if we have to drop the ANSI API now, later or never.
If the code is currently working and isn't a security hole, then we obviously don't "have to". Apparently several developers "want to", which is different.
We should also keep in mind that *Microsoft* have chosen to keep the bytes Win32 APIs around, despite their flaws, all in the name of backwards compatibility. While the goal of nudging third party developers towards the superior Unicode APIs is an admirable one, it is still the case that there is a *lot* of ASCII-only code out there. E.g. applications could easily be storing filenames in an ASCII only datastore that provides them back to the application as bytes in 3.x.
It looks like the decision moves to "later" (deprecate in 3.2, remove in 3.3). I still think that drop now doesn't really hurt.
If you drop code without first deprecating it, chances are it will hurt someone. That's why having a deprecation period is the rule we usually follow (most of the time :-)).
Indeed. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
We should also keep in mind that *Microsoft* have chosen to keep the bytes Win32 APIs around, despite their flaws, all in the name of backwards compatibility.
Of course, Microsoft is in a different position. If they remove a functionality in some release, their users typically can't go back and continue to use the old version - at least not on the same computer. For Python, it's different: our users can go back to use an old version if the new one breaks their applications. And we do break applications from time to time, most notably with the introduction of Python 3.
While the goal of nudging third party developers towards the superior Unicode APIs is an admirable one, it is still the case that there is a *lot* of ASCII-only code out there.
The question is: is there also a lot of ASCII-only Python 3 software out there? And would developers of such software have difficulties to port it to a Unicode file name API.
E.g. applications could easily be storing filenames in an ASCII only datastore that provides them back to the application as bytes in 3.x.
That's speculation. My speculation would be that authors of such a datastore find that they can't even print the data anymore in a reasonable way, so they changed their API to return strings (i.e. decoding from ASCII) when they ported it to Python 3. They wouldn't even consider it a change, because it returned strings all the time, and now Python 3 has a different string type.
If you drop code without first deprecating it, chances are it will hurt someone. That's why having a deprecation period is the rule we usually follow (most of the time :-)).
I'm in favor of deprecating it first. Regards, Martin
On Sun, Nov 14, 2010 at 8:14 PM, "Martin v. Löwis"
I'm in favor of deprecating it first.
Aye. I've made the best case I could for keeping it, and even I don't find it terribly convincing. So deprecation for 3.2 sound like a reasonable option. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
If the code is currently working and isn't a security hole, then we obviously don't "have to". Apparently several developers "want to", which is different.
In case the motivation for that isn't clear: it would produce a significant code reduction, and therefore ease maintenance. Regards, Martin
On 2010/11/14 9:06, Victor Stinner wrote:
Yes, but how do you check if the input argument is a bytes or a str object with your PyArg_Parse converter? You should use "O" format and manually convert it to unicode, and then convert the result back to bytes (if the input was bytes). It don't think that it makes the code shorter.
The code is currently working. The question is if we have to drop the ANSI API now, later or never. It looks like the decision moves to "later" (deprecate in 3.2, remove in 3.3). I still think that drop now doesn't really hurt.
Victor
Humble thoughts... Is it possible a conversion from bytes (ANSI) to unicode fails on windows? If not, is it allowed to convert to unicode with PyUnicode_FSDecoder if function doesn't return str? For example, os.stat() takes str as arguments but doesn't return str. # I noticed win_readlink() in Modules/posixmodule.c already unicode # only. Maybe not so much problem? ;-)
Is it possible a conversion from bytes (ANSI) to unicode fails on windows?
It should fail sometimes, right? Not for windows-1252, but certainly for shift-jis (you know better than me). It seems that whether MultiByteToWideChar will fail depends on whether MB_ERR_INVALID_CHARS is given or not. I don't know what it will do if this flag is not given - my guess it fills in REPLACEMENT CHARACTER.
If not, is it allowed to convert to unicode with PyUnicode_FSDecoder if function doesn't return str? For example, os.stat() takes str as arguments but doesn't return str.
This I don't understand. os.stat doesn't return text at all - so what do you want to convert?
# I noticed win_readlink() in Modules/posixmodule.c already unicode # only. Maybe not so much problem? ;-)
Well, readlink is new on Windows, and symlinks are not widespread. So there is no backwards compatibility concern here. Regards, Martin
participants (8)
-
"Martin v. Löwis"
-
André Malo
-
Antoine Pitrou
-
Hirokazu Yamamoto
-
Nick Coghlan
-
Tim Golden
-
Ulrich Eckhardt
-
Victor Stinner