
Greg Ewing writes:
Would there be any sanity in having an option to compile Python with UTF-8 as the internal string representation?
Losing Py_UNICODE as mentioned by Stefan Behnel (IIRC) is just the beginning of the pain. If Emacs's experience is any guide, the cost in speed and complexity of a variable-width internal representation is high. There are a number of tricks you can use, but basically everything becomes O(n) for the natural implementation of most operations (such as indexing by character). You can get around that with a position cache, of course, but that adds complexity, and really cuts into the space saving (and worse, adds another chunk that may or may not be paged in when you need it). What we're considering is a system where buffers come in 1-, 2-, and 4-octet widechars, with automatic translation depending on content. But the buffer is the primary random-access structure in Emacsen, so optimizing it is probably worth our effort. I doubt it would be worth it for Python, but my intuitions here are not reliable.