[Python-3000] string C API
ncoghlan at gmail.com
Fri Sep 15 15:29:58 CEST 2006
Martin v. Löwis wrote:
> Nick Coghlan schrieb:
>> Only the first such call on a given string, though - the idea is to use
>> lazy decoding, not to avoid decoding altogether. Most manipulations
>> (len, indexing, slicing, concatenation, etc) would require decoding to
>> at least UCS-2 (or perhaps UCS-4).
> Ok. Then my objection is this: What about errors that occur in decoding?
> What happens if the bytes are not meaningful in the presumed encoding?
> ISTM that raising the exception lazily (which seems to be necessary)
> would be very confusing.
Yeah, it appears it would be necessary to at least *scan* the string when it
was first created in order to ensure it can be decoded without errors later on.
I also realised there is another issue with an internal representation that
can change over the life of a string, which is that of thread-safety.
Since strings don't currently have any mutable internal state, it's possible
to freely share them between threads (without this property, the interning
behaviour would be doomed).
If strings could change the encoding of their internal buffers then they'd
have to use a read/write lock internally on all operations that might be
affected when the internal representation changes. Blech.
Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as
internal representations, and choosing which one to use when the string is
Sure certain applications that are just copying from one data stream to
another (both in the same encoding) may needlessly decode and then re-encode
the data, but if the application *knows* that this might happen (and has
reason to care about optimising the performance of this case), then the
application is free to decouple the "reading" and "decoding" steps, and just
transfer raw bytes between the streams.
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
More information about the Python-3000