[Python-3000] string C API
Nick Coghlan
ncoghlan at gmail.com
Fri Sep 15 17:15:27 CEST 2006
Jim Jewett wrote:
>> > ISTM that raising the exception lazily (which seems to be necessary)
>> > would be very confusing.
>
>> Yeah, it appears it would be necessary to at least *scan* the string
>> when it
>> was first created in order to ensure it can be decoded without errors
>> later on.
>
> What happens today with strings? I think the answer is:
> "Nothing.
> They print something odd when printed.
> They may raise errors when explicitly recoded to unicde."
> Why is this a problem?
We don't have 8-bit strings lying around in Py3k. To convert bytes to
characters, they *must* be converted to unicode code points.
> I'm not so happy about the efficiency implication of the idea that
> *all* strings *must* be validated (let alone recoded).
Then always define latin-1 as the source encoding for your files - it will
just pass the bytes straight through.
>> Since strings don't currently have any mutable internal state, it's
>> possible
>> to freely share them between threads (without this property, the
>> interning
>> behaviour would be doomed).
>
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings. It
> might make sense to intern only strings that are in the same encoding
> as the source code. (Or whose values are limited to ASCII?)
Unicode strings don't have an encoding - they only store code points.
>> If strings could change the encoding of their internal buffers then
>> they'd
>> have to use a read/write lock internally on all operations that might be
>> affected when the internal representation changes. Blech.
>
> Why?
>
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable. Recoding that results
> in different bytes should not be in-place. Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.
>
> Anything keeping its own reference to the old databuffer (and old
> encoding) will continue to work, so immutability ==> the two buffers
> really are equivalent.
I admit that by using a separate Python object for the data buffer instead of
a pointer to raw memory, the incref/decref in the processing code becomes the
moral equivalent of a read lock, but consider the case where Thread A performs
an operation and decides "I need to recode the buffer to UCS-4" at the same
time that Thread B performs an operation and decides "I need to recode the
buffer to UCS-4".
To deal with that you would still want to be very careful with the incref
new/reassign/decref old step for switching in a new the data buffer (probably
by using some form of atomic reassignment operation).
And this style has some very serious overhead implications, as each string
would now require:
The string object, with a 32 or 64 bit pointer to the data buffer object
The data buffer object
String memory overhead would double, with an additional 32 or 64 bits
depending on platform. This is a pretty significant increase when it comes to
identifier-length strings.
So still blech, even if you make the data buffer a separate Python object to
avoid the need for an actual read/write lock.
>> Sure certain applications that are just copying from one data stream to
>> another (both in the same encoding) may needlessly decode and then
>> re-encode
>> the data,
>
> Other than text editors, "certain" includes almost any application I
> have ever used, let alone written.
If you're reading text and you *know* it is ASCII data, then you can just set
the encoding to latin-1 (since that can just copy the original bytes to the
string's internal buffer - the actual ascii codec needs to check each byte to
see whether or not the high bit is set, so it would be slower, and blow up
with a DecodingError if the high bit was ever set).
I suspect an awful lot of quick-and-dirty scripts written by native English
speakers will do exactly that.
>> but if the application *knows* that this might happen (and has
>> reason to care about optimising the performance of this case), then the
>> application is free to decouple the "reading" and "decoding" steps,
>> and just
>> transfer raw bytes between the streams.
>
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe? Not so good.
No, the standard recipe becomes "handle bytes as bytes and text as
characters". If you know your source data is 8-bit text (or are happy to treat
it that way, even if it isn't), then use the latin-1 codec to decode the
original bytes directly to 8-bit characters.
Or just open the file in binary and read the data in as bytes instead of
characters.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-3000
mailing list