[Python-3000] string C API
Josiah Carlson
jcarlson at uci.edu
Fri Sep 15 19:46:52 CEST 2006
"Jim Jewett" <jimjjewett at gmail.com> wrote:
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings. It
> might make sense to intern only strings that are in the same encoding
> as the source code. (Or whose values are limited to ASCII?)
Why? If the text hash function is defined on *code points*, then
interning, or really any arbitrary dictionary lookup is the same as it
has always been.
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable. Recoding that results
> in different bytes should not be in-place. Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.
What about never recoding? The benefit of the latin-1/ucs-2/ucs-4
method I previously described is that each of the encodings offer a
minimal representation of the code points that the text object contains.
Certain operations would require a bit of work to handle the comparison
of code points stored in an x-bit-wide representation with code points
stored in a y-bit-wide representation.
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe? Not so good.
Presumably there is going to be a mechanism to open files as bytes
(reads return bytes), and for things like web servers, file servers, etc.,
serving the content up as just a bunch of bytes is really the only thing
that makes sense.
- Josiah
More information about the Python-3000
mailing list