
I wrote:
It's a big advantage to have only one string type; it makes many problems we've been discussing easier to talk about.
I think I should've been more explicit about what I meant here. I'll try to phrase it as an addendum to my proposal -- which suddenly is no longer just a narrow/wide string unification but narrow/wide/ultrawide, to really be ready for the future... As someone else suggested in the discussion, I think it's good if we separate the encoding from the data type. Meaning that wide strings are no longer tied to Unicode. This allows for double-byte encodings other than UCS-2 as well as for safe passing-through of binary goop, but that's not the main point. The main point is that this will make the behavior of (wide) strings more understandable and consistent. The extended string type is simply a sequence of code points, allowing for 0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for ultra-wide strings. Upcasting is always safe, downcasting may raise OverflowError. Depending on the used encoding, this comes as close as possible to the sequence-of-characters model. The default character set should of course be Unicode -- and it should be obvious that this implies Latin-1 for narrow strings. (Additionally: an encoding attribute suddenly makes a whole lot of sense again.) Ok, y'all can shoot me now ;-) Just