[Python-ideas] Processing surrogates in

random832 at fastmail.us random832 at fastmail.us
Fri May 15 20:14:52 CEST 2015


On Thu, May 14, 2015, at 15:48, Andrew Barnert wrote:
> > Technically filesystem names (and other similar boundary APIs like
> > environ, anything ctypes, etc) on Windows can contain arbitrary
> > surrogates
> 
> Are you sure? I thought that, unless you're using Win95 or NT 3.1 or
> something, Win32 *W APIs are explicitly for Unicode characters (not code
> units),

Windows documentation often uses "unicode" to mean UTF-16 and
"character" to mean WCHAR. The real point is that the APIs perform no
validation, and existing filenames on the disk, user input into edit
controls, etc, can contain invalid surrogates. There's basically nothing
at any point to reject invalid surrogates. I can create a file now whose
filename consists of a single surrogate code unit. I can copy that
filename to the clipboard, paste it anywhere, create more files with it
in the filename or contents, etc. (Notepad, incidentally, will save a
UTF-16 file containing an invalid surrogate, but saving it as UTF-8 will
replace it with U+FFFD, the one and only place I could find where
invalid surrogates are rejected by Windows).

> minus nulls and any relevant reserved characters (e.g.. no
> slashes in filenames, no control characters in filenames except for
> substream names, etc.). That's what the Naming Files doc seems to imply.
> (Then again, there are other areas that seem confusing or
> misleading--e.g., where it tells you not to worry about normalization
> because once the string gets through Win32 and to the filesystem it's
> just a string of WCHARs, which sounds to me like that's exactly why you
> _should_ worry about normalization...)'

Well, it depends on why you're worried about it. No normalization is
great for being able to expect that your filename you just saved will
come back unchanged in a directory listing.


More information about the Python-ideas mailing list