[Python-3000] Unicode and OS strings

Thu Sep 13 21:26:15 CEST 2007

Dnia 13-09-2007, Cz o godzinie 19:08 +0200, "Martin v. Löwis"
napisał(a):

> Of course, if the input data already contains PUA characters,
> there would be an ambiguity. We can rule this out for most codecs,
> as they don't support PUA characters. The major exception would
> be UTF-8,

Most codecs other than UTF-8 don't have this problem.

Unicode people are generally allergic to any non-standard variants of
Unicode specifications, and feel that this is a heresy. I experimentally
and optionally use U+0000 escaping, but I'm not convinced that anything
like this is a good idea, and it should probably not be enabled by
default.

Mono uses U+0000 escaping too; I'm not sure if all the details agree.
This escaping scheme has an advantage that it's compatible with real
UTF-8 for strings which contain no \x00 = U+0000. Most of applicable
contexts do guarantee to not contain NUL, so the interpretation of valid
data in both directions is unchanged. My encoder even rejects U+0000
prefixes for bytes which would form valid UTF-8 sequences, so you can't
have two Unicode strings which encode to the same byte string. The
side effect is that not all U+0000 occurrences can be encoded, but the
contexts we are talking about don't allow U+0000 anyway.

> > I'm guessing one thing we need to do is
> > research how various systems decide what encoding to use.

This is the easy part; modern Unices have nl_langinfo(CODESET).
The hard part is deciding what to do when decoding fails.

[I will be absent between Friday and Monday.]

Here is what other environments do. This was over 2 years ago, something
might have changed. In particular Mono now uses some U+0000 escaping,
I need to investigate it again. I checked both directions, i.e. what do
they do with unencodable filenames given by the program. Everything is
on Linux. Some behaviors are obviously awful.

Java (Sun)
----------

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.

Java (GNU)
----------

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified
   UTF-8. Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are silently skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".

C# (mono)
---------

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it is skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, U+0000 throws an exception
   (System.ArgumentException: Path contains invalid chars), paired
   surrogates are treated correctly, and an isolated surrogate causes
   an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external):
assertion failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try
again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are silently skipped.

Perl
----

Depending on the convention used by a particular function and on
imported packages, a Perl string is treated either as Perl-modified
Unicode (with character values up to 32 bits or 64 bits depending on
the architecture) or as an unspecified locale encoding. It has two
internal representations: ISO-8859-1 and Perl-modified UTF-8 (with
an extended range).

If every Perl string is assumed to be a Unicode string, then filenames
are effectively ISO-8859-1.

a) Interpreting. Characters up to U+00FF are used.

b) Creating. If the filename has no characters above 0xFF, it is
   converted to ISO-8859-1. Otherwise it is converted to Perl-modified
   UTF-8 (all characters, not just those above 0xFF).

Command line arguments and standard I/O are treated in the same way,
i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on
output, depending on the contents.

This behavior is modifiable by importing various packages and using
interpreter invocation flags. When Perl is told that command line
arguments are UTF-8, the behavior for strings which cannot be
converted is inconsistent: sometimes it's treated as ISO-8859-1,
sometimes an error is signalled.

Haskell
-------

Haskell nominally uses Unicode. There is no conversion framework
standarized or implemented yet though. Implementations which support
more than 256 characters currently assume ISO-8859-1 for filenames,
command line arguments and all I/O, taking the lowest 8 bits of a
character code on output.

Common Lisp: CLISP
------------------

Common Lisp standard doesn't say anything about string encoding.
In Clisp strings are UTF-32 (internally optimized as UCS-2 and
ISO-8859-1 when possible). Any character code up to U+10FFFF is
allowed, including isolated surrogates.

Filenames are assumed to be in the locale encoding.

a) Interpreting. If a byte cannot be converted, a condition is signaled.

b) Creating. If a character cannot be converted, a condition is
   signaled.

Kogut (my language)
-----

Strings are UTF-32 (internally optimized as ISO-8859-1 when possible).
Any character code up to U+10FFFF is allowed, including isolated
surrogates.

Filenames are assumed to be in the locale encoding; the encoding can be
overridden by a Kogut-specific environment variable. A program can
itself set the encoding to something else, perhaps locally during
execution of some code. It can use a conversion which puts U+FFFD / "?"
instead of throwing an exception on error, or which does something else.

a) Interpreting. If a byte cannot be converted, an exception is thrown.

b) Creating. If a character cannot be converted or if a name contains
   U+0000, an exception is thrown.

Command line arguments and standard I/O are treated in the same way.

There is an additional encoding which is a modified UTF-8 and can be
explicitly used instead of true UTF-8: any byte string can be decoded,
where normally undecodable bytes and \0 are escaped as U+0000 U+00xx.

GNOME
-----

GNOME uses UTF-8 internally, or sometimes byte strings in other
encodings. I guess filenames are passed as byte strings. AFAIK
sometimes filenames are expressed as URLs, even internally when it's
invisible to the user, and then various unsafe bytes are escaped as
two hex digits preceded by the percent sign. From the programmer's
point of view the original byte strings are generally used. Filename
encoding matters for the display though, so here I describe the user's
point of view.

If the environment variable G_FILENAME_ENCODING is present, it
specifies the encoding of filenames, unless it is @locale which means
the encoding of the locale. If it's not present but G_BROKEN_FILENAMES
is present, filenames are assumed to be in the locale encoding.
If neither variable is present, filenames are assumed to be in UTF-8.

a) Interpreting. If a filename cannot be converted from the selected
   encoding, all non-ASCII bytes are shown as octal numbers preceded
   by the backslash, as hex numbers preceded by the percent sign, or
   as question marks, depending on the situation (I can observe all
   three cases in gedit). What is physically stored is the byte string
   and the file is opened successfully.

b) Creating. If a character cannot be represented, the application
   refuses to save the file until a good filename is entered.

Mozilla
-------

I don't know how it handles filenames internally. From the user's
point of view it matters how it presents a local directory listing.

Filenames are assumed to be in the locale encoding.

If a filename cannot be converted, it's skipped. If it can be
converted but contains characters like 0x80-0x9F in ISO-8859-2,
they are displayed as question marks and the file is inaccessible.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/