[Python-Dev] Unicode strings as filenames

Neil Hodgson nhodgson@bigpond.net.au
Mon, 7 Jan 2002 20:30:11 +1100


[Replacing the other mail destinations as I didn't do a reply all last time
so python-dev dropped off. You may want to resend your last mail to
python-dev.]

> I don't think we can drop W9x support for Python 2.3, although I'm
> still waiting for comments on dropping W3.1 support...

   I wouldn't want to drop either.

> >    Sounds good to me. I'm moving back towards not using the 'utf-8'
system
> > encoding but rather checking of Unicode arguments and handling them
> > explicitly even at the cost of code expansion.
>
> That is very good. I don't know what is best for the file name;
> perhaps it is acceptable to encode it with the file system default
> encoding (even if it ends up having question marks in it). Programs
> relying on the file name to be correct are broken, IMO.

   My thinking now is that there are two modules here, fileobject and
posixmodule which should be handled differently.

   posixmodule is just a library with calls and no state. IIRC there used to
be multiple modules, one per OS, and the correct one was chosen and called
os. I think it is perfectly reasonable for there to be an extra 'ntos'
module that just works on NT that treats all arguments as Unicode (coercing
up using the current locale when given narrow strings) and always calling
the wide APIs. It would contain the same methods (when available) as os. NT
specific code can use it directly and sufficiently interested portable
client code could say something like

if nt:
  filesys = ntos
else:
  filesys = os

   This hides away all the code bloat from posix code, ensures there are no
regressions in posix while developing and debugging ntos, and allows ntos to
just convert all arguments into wide strings without worrying about 9x.

   Maybe call the module osu if there may be implementations on other OS's
like OS X. Could have an enquiry method in the module

if osu.working:
  filesys = osu
else:
  filesys = os

   fileobject is more complex because it holds two strings as state. The
mode can probably be assumed to be ASCII so can be left as a narrow string
(although it does have to be widened to call _wfopen) but the name is more
complex as some client code may just know that it is always a narrow string
and thus die if given a file with a wide name.

> Looks very good indeed. When producing patches, you might want to
> check line endings: currently, your files are a mix of LF only (which
> was there before) and CRLF.

 OK.

> In open_the_file, you are still checking for utf-8; that should be
> removed also. It seems that open_the_file will always get an
> initialized filed, so passing name does not seem to be necessary: one
> could look at f_name.

   OK. So why are the name and mode passed when they are already available?

> I suggest that f_name stays as a byte string for the moment, and
> open_the_file gets an optional "original name" or "unicode name"
> argument, whatever is more convenient. If that is given, open_the_file
> should consider it, else it should fall back to f_name.

   If this is done then the unicode name should also be available as a field
of the object as those mangled "z??.html" strings are totally useless.

   I'm feeling more like making f_name be wide now but I'd expect some
opposition now from backwards compatibility advocates.

> In posixmodule, I cannot see the move towards passing Unicode objects
> directly, either - I guess you were talking about a future plan,
> above.

   Yes, I'm thinking ahead of the coding. Seeing where I'm already going or
about to go wrong.

> I cannot see the rationale for wfuncNull - wouldn't passing
> passing NULL as a function pointer be sufficient as well?

   Yes, must get used to thinking in C again. I don't think I have written C
for 8 years. WTF can't I declare variables just when I need them <incoherent
cursing and mumbling...>

   Neil