[Python-3000] string C API

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Thu Sep 14 14:44:28 CEST 2006


Nick Coghlan <ncoghlan at gmail.com> writes:

> Only the first such call on a given string, though - the idea
> is to use lazy decoding, not to avoid decoding altogether.
> Most manipulations (len, indexing, slicing, concatenation, etc)
> would require decoding to at least UCS-2 (or perhaps UCS-4).

Silently optimizing string recoding might change the way recoding
errors are reported. i.e. they might not be reported at all even
if the string is malformed. Optimizations which change the semantics
are bad.

I imagine only a few cases where lazy decoding would be beneficial:

1. A whole input stream is copied to an output stream which uses the
   same encoding.

   Here the application might choose to copy binary streams instead.

2. A file name, user name, or similar token is obtained from the OS
   in one place and used in another place. Especially on Unix where
   they use byte encodings (Windows prefers UTF-16).

   These cases can be optimized by other means:

   - Sometimes representing the token as a Python string can be
     avoided. For example executing an action in a different directory
     and then returning to the original directory might choose to
     represent the saved directory as a byte array.

   - Under the assumption that the system encoding is ASCII-compatible,
     calling the recoding machinery can be omitted for ASCII-only strings.
     This applies only to strings exchanged with the OS etc., not to
     stream contents which can use non-ASCII-compatible encodings.

My language implementation has only two string representations:
ISO-8859-1 and UTF-32 (the narrow representation is used for all
strings where it's possible). This is completely transparent to the
high level semantics, like the fixnum/bignum split. I'm happy with
this choice.

My text I/O buffers and recoding buffers use UTF-32 exclusively.
It would be too complicated to try to use a narrow representation
when the string is not processed as a whole. This makes the ASCII-only
optimization significant I believe (but I haven't measured it).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Python-3000 mailing list