On Tue, Sep 30, 2008 at 7:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I believe on disk it uses UTF-16.
Which is made up of bytes. There may be byte sequences that are illegal UTF-16, but that's not what Martin said. I don't understand how there can be UTF-16 sequences which don't correspond to some sequence of bytes. How would they be represented in memory? Is this to do with the endianness of the UTF-16 sequence?
It has to do with the internal mapping between the ANSI and Unicode functions. On NT systems, CreateFileA will map the ANSI bytestring to a Unicode filename via the active code page, and call CreateFileW accordingly. The active code page cannot be set to something as useful as UTF-8, so given any actual code page (1252, 932, etc.) there are Unicode strings that cannot be represented with a bytestring provided to the ANSI function. -- Michael Urman