[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Neil Hodgson nyamatongwe at gmail.com
Wed Oct 26 07:49:39 CEST 2005


M.-A. Lemburg:

> You mean a slice that slices out the next <indextype> ?

   Yes.

> This sounds a lot like you'd want iterators for the various
> index types. Should be possible to implement on top of the
> proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.

   Iterators may be helpful, but can also be too restrictive when the
processing is not completely iterative, such as peeking ahead or
looking behind to wrap at a word boundary in the display example.
There should be

  It was more that there may leave less scope for error if there was a
move away from indexes to slices. The PEP provides ways to specify
what you want to examine or modify but it looks to me like returning
indexes will see code repetition or additional variables with an
increase in fragility.

> Note that what most people refer to as "character" is a
> grapheme in Unicode speak.

   A grapheme-oriented string type may be worthwhile although you'd
probably have to choose a particular normalisation form to ease
processing.

> Given that interpretation,
> "breaking" Unicode "characters" is something you won't
> ever work around with by using larger code units such
> as UCS4 compatible ones.

   I still think we can reduce the scope for errors.

> Furthermore, you should also note that surrogates (two
> code units encoding one code point) are part of Unicode
> life. While you don't need them when storing Unicode
> in UCS4 code units, they can still be part of the
> Unicode data and the programmer has to be aware of
> these.

   Many programmers can and will ignore surrogates. One day that may
bite them but we can't close off text processing to those who have no
idea of what surrogates are, or directional marks, or that sorting is
locale dependent, or have no understanding of the difference between
NFC and NFKD normalization forms.

   Neil


More information about the Python-Dev mailing list