[Python-3000] Unicode and OS strings

Sun Sep 16 09:13:29 CEST 2007

"Martin v. Löwis" writes:

 > > What I'm suggesting is to provide a way for processes to record and
 > > communicate that information without needing to provide a "source
 > > encoding" slot for strings, and which is able to handle strings
 > > containing unrecognized (including corrupt) characters from multiple
 > > source encodings.
 > 
 > Can you please (re-)state how that way would precisely work? I could
 > not find that in the archives.

The basic idea is to allocate code points in private space as-needed.

All points in private space would be initially "owned" by the Python
process.

When a codec encounters something it can't handle, whether it's a
valid character in a legacy encoding, a private use character in a
UTF, or an invalid sequence of code units, it throws an exception
specifying the character or code unit and the current coded character
set, and the handler either finds that tuple in the table, or assigns
a private use character and enters it in the table with key being the
charset-codepoint tuple, and the inverse assignment in an inverse
mapping table.

It may be that no charset can be assigned to the codepoint, in which
case None would be assigned as the charset, and instead of mapping
characters, the invalid codepoints would be individually mapped.

On output, if the codec can output in the recorded character set, it
does so, otherwise it throws an unencodable character exception.

This definitely requires that the Unicode codecs be modified to do the
right thing if they encounter private use characters in the input
stream or output stream.

Other codecs don't need to be modified, although ISO 2022-based codecs
(at least) would benefit from it.  Some codecs (like ISO-8859 codecs)
will have implicit charsets (ASCII code points can't be errors for
them, so only GR matters), and can use codec-specific handlers that
know what the implicit charset is.  (AIUI this would require that the
handler-specifying protocol be changed from an enumeration of the
available handlers to the ability to actually specify one.)  The rest
can use the None charset, so that code units will be preserved.

Applications which wish to pass strings across process boundaries will
have to pass the table too.  If they don't, then in general they can't
use this family of exception handlers.

To handle cases like Marcin's private encoding, and in general to
allow efficient IPC for process that know they're going to get certain
private use characters in I/O, there should be an API to preallocate
specific code points.  (Theoretically, dynamically allocated private
code points could be reallocated, but that would require translating
all existing strings, and I can't believe that would ever be worth
it.)

What happens if a string "escapes" without the table?

1.  The application uses the preallocation API.  Then the characters
it understands are handled normally, and dynamically allocated private
use characters are errors, anyway.  I don't see how this makes things
worse.

2.  The application doesn't use the preallocation API, but does know
about some private use characters.  Then it will get confused by the
dynamic allocation, as Greg and Marcin point out, and users should be
advised not to use the new handler.

3.  The application doesn't know about any private use characters.
Then dynamically allocated characters are exceptions anyway.  I don't
see how this makes things worse.

Advantages:

1.  Almost all the "interesting" information about the original
encoded source is preserved, including under string operations like
slicing and concatenation with strings form other sources.  (I can
quantify "almost all" more precisely if necessary.)

2.  100% Unicode conformance in the sense that if the internal
representation escapes, it's valid Unicode.

3.  Efficient internal representation in the sense that applications
need not worry about invalid Unicode when doing string operations.

4.  In 16-bit environments, up to 6400 non-BMP characters can be
mapped into the BMP private use area using the same algorithm,
achieving a "string is character array" representation at the expense
of slight overhead in I/O and one extra table reference in each
character property lookup.  As Marcin points out, given that not all
composable characters have one-character NFC representations, we can't
guarantee that the user's notion of string length will equal the
number of characters in the string, but in practice I think that will
almost invariably work out.  And if we're doing normalization, the
codec overhead becomes less important.

Disadvantages:

1.  Unicode codecs will need to be modified, since they need to throw
exceptions on private use characters.

2.  Other codecs will need to be modified to take advantage of this
handler, since AFAIK currently none of the available handlers can use
charset information, so I can't imagine the codecs provide it.

3.  More overhead in exception-handling than James Knight's or Marcin
Kowalczyk's proposals.

4.  Applications that know about some private use characters will need
to be modified to preallocate those characters before they can take
advantage of this handler.

In general, I don't think that the overhead should be weighted very
heavily against this proposal.  Exception handlers impose a fair
amount of overhead anyway, AIUI.  Furthermore, any application that
cares enough to keep track of the original code points will IMO be
hungry for any additional information that can help in exception
handling.  This proposal provides as much as you can get, short of
buffering all the input.

HTH,