os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Leo Kislov Leo.Kislov at gmail.com
Tue Nov 21 10:41:45 CET 2006

Martin v. Löwis wrote:
> Ross Ridge schrieb:
> > Ross Ridge schrieb:
> >> That would conflict with private use characters appearing in file
> >> names.
> >
> > Martin v. Löwis wrote:
> >> Not necessarily: they could get escaped.
> >
> > How?
> Suppose I use U+E001..U+E0FF as the PUA characters for unencodable
> bytes; U+E000 wouldn't be needed since it \0 cannot be part of
> a file name in POSIX.
> Then I would use U+E000 for escaping. Each PUA character in the
> listed file name would get escaped with U+E000 in the Python
> string; when the file name is converted back to the system, it
> gets unescaped.
> Notice that I think this is a really unrealistic case - I expect
> that all file names containing PUA characters were deliberately
> crafted to investigate using PUA characters in file names.

How will it interoperate with non-python world? Will these file names
ever escape python process?

Unicode consortium thinks "safe" utf-8 is a bad idea:


[Lars Kristan]
> Which could be understood as "a proposal to amend UTF-8 to allow invalid
> sequences".

[Kenneth Whistler, Technical Director, The Unicode Consortium]
O.k., and as pointed out already, that simply won't fly. *Nobody*
in the UTC or WG2 is going to go for that. It would destroy
UTF-8, not fix it.

Kenneth Whistler on invalid file names:

And also: http://www.mail-archive.com/unicode@unicode.org/msg27167.html

[Lars Kristan]
> Should all
> filenames that do not conform to UTF-8 be declared invalid?

[Doug Ewell, the guy behind Unicode Technical Note #14]
If you have a UTF-8 file system, yes.

  -- Leo

More information about the Python-list mailing list