On Fri, Sep 7, 2018 at 5:49 PM Ronald Oussoren via capi-sig capi-sig@python.org wrote:
On 7 Sep 2018, at 10:32, M.-A. Lemburg mal@egenix.com wrote:
Note that UTF-8 is not a good internal representation for Unicode if you want fast indexing and slicing. This is why we are using fixed code units to represent the Unicode strings.
This is something thats completely off-topic for this discussion, but I wonder if fast indexing and slicing are really necessary. Even with our current representation doing slicing correctly is hard due to combining characters and emoji. Any changes in this regard would probably require changes to the string API and/or additional utilities.
Ronald
I agree with this. I'm interested in porting UTF-8 internal representation from PyPy, if it demonstrates good performance and memory efficiency.
Regards,
INADA Naoki songofacandy@gmail.com