[XML-SIG] Indexing Unicode (Re: Issues with Unicode type)

M.-A. Lemburg mal@lemburg.com
Tue, 24 Sep 2002 16:20:12 +0200

Tom Emerson wrote:
> Fred L. Drake, Jr. writes:
>>I've just added a note to the docs for Python 2.2.2 and 2.3 that len()
>>returns the number of storage units, not abstract characters.  I don't
>>expect that to change given that it's been doing it that way since the
>>Unicode type was introduced.
> Since this appears to be a point of some confusion to people who
> aren't indoctrinated, perhaps a discussion needs to be put into the
> documentation (extracted from the PEP or written anew) about the
> reasons for this. Or is there a feeling that such detail doesn't
> belong in the "regular" docs?

It would probably help to at least raise the issue in the Unicode
docs and make the reader aware of the differences between code
units, code points and graphemes.

len() traditionally refers to the number of items stored in
a sequence, so in Unicode terms it returns the number of
code units stored in the Unicode object. The same is true
for indexing: u[i] will give you the i-th code unit, not
necessarily the i-th code point or even i-th grapheme.

Depending on how you view this, you could say that any
given Unicode implementation is a variable length encoding
of graphemes -- the talk I referenced earlier in this thread
has a slide explaining this.

Would be nice to have a Unicode indexing module which provides
different indexing and length measuring methods than just
code units.

Here's a PEP I started for this last year but which never
got finished:

Title: Unicode Indexing Helper Module
Version: $Revision: 1.0 $
Author: mal@lemburg.com (Marc-Andr? Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001


     This PEP proposes a new module "unicodeindex" which provides
     means to index Unicode objects in various higher level abstractions
     of "characters".

Problem and Terminology

     Unicode objects can be indexed just like string object using what
     in Unicode terms is called a code unit as index basis.

     Code units are the storage entities used by the Unicode
     implementation to store a single Unicode information unit and do
     not necessarily map 1-1 to code points which are the smallest
     entities encoded by the Unicode standard. Python exposes code
     units to the programmer via the Unicode object indexing and slicing
     API, e.g. u[10] or u[12:15] refer to the code units at index 10
     and indices 12 to 14.

     These code points can sometimes be composed to form graphemes
     which are then displayed by the Unicode output device as one
     character. A word is then a sequence of characters separated by
     space characters or punctuation, a line is a sequence of code
     points separated by line breaking code point sequences.

     For addressing Unicode, there are basically five different methods
     by which you can reference the data:

     1. per code unit    (codeunit)
     2. per code point   (codepoint)
     3. per grapheme     (grapheme)
     4. per word         (word)
     5. per line         (line)

     The indexing type name is given in parenthesis and used in the
     module interface.

Proposed Solution

     I propose to add a new module to the standard Python library which
     provides interfaces implementing the above indexing methods.

Module Interface

     The module should provide the following interfaces for all four
     indexing styles:

     next_<indextype>(u, index) -> integer

         Returns the Unicode object index for the start of the next
         <indextype> found after u[index] or -1 in case no next element
         of this type exists.

     prev_<indextype>(u, index) -> integer

         Returns the Unicode object index for the start of the previous
         <indextype> found before u[index] or -1 in case no previous
         element of this type exists.

     <indextype>_index(u, n) -> integer

         Returns the Unicode object index for the start of the n-th
         <indextype> element in u. Raises an IndexError in case no n-th
         element can be found.

     <indextype>_count(u, index) -> integer

         Counts the number of complete <indextype> elements found in
         u[:index] and returns the count as integer.

     <indextype>_start(u, index) -> integer

         Returns 1 or 0 depending on u[index] marks the start of an
         <indextype> element.

     <indextype>_end(u, index) -> integer

         Returns 1 or 0 depending on u[index] marks the end of an
         <indextype> element.

     <indextype>_slice(u, index) -> slice object or None

         Returns the slice pointing to the <indextype> element found in
         u at the given index or None in case no such element can be found
         at that position.

     Symbols used in the above definitions:

        <indextype>   one of: codeunit, codepoint, grapheme, word, line
        u             is the Unicode object
        index         the Unicode object index, e.g. 10 in u[10]
        n             is an integer

     Note that in Unicode terms, the Unicode object index refers to a
     code unit.


     This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/