PEP: Unicode Indexing Helper Module
M.-A. Lemburg
mal at lemburg.com
Fri Jul 13 08:04:16 EDT 2001
Please comment...
--
PEP: 0262 (?)
Title: Unicode Indexing Helper Module
Version: $Revision: 1.0 $
Author: mal at lemburg.com (Marc-André Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001
Post-History:
Abstract
This PEP proposes a new module "unicodeindex" which provides
means to index Unicode objects in various higher level abstractions
of "characters".
Problem and Terminology
Unicode objects can be indexed just like string object using what
in Unicode terms is called a code unit as index basis.
Code units are the storage entities used by the Unicode
implementation to store a single Unicode information unit and do
not necessarily map 1-1 to code points which are the smallest
entities encoded by the Unicode standard.
These code points can sometimes be composed to form graphemes
which are then displayed by the Unicode output device as one
character. A word is then a sequence of characters separated by
space characters or punctuation, a line is a sequence of code
points separated by line breaking code point sequences.
For addressing Unicode, there are basically five different methods
by which you can reference the data:
1. per code unit (codeunit)
2. per code point (codepoint)
3. per grapheme (grapheme)
4. per word (word)
5. per line (line)
The indexing type name is given in parenthesis and used in the
module interface.
Proposed Solution
I propose to add a new module to the standard Python library which
provides interfaces implementing the above indexing methods.
Module Interface
The module should provide the following interfaces for all four
indexing styles:
next_<indextype>(u, index) -> integer
Returns the Unicode object index for the start of the next
<indextype> found after u[index] or -1 in case no next element
of this type exists.
prev_<indextype>(u, index) -> integer
Returns the Unicode object index for the start of the previous
<indextype> found before u[index] or -1 in case no previous
element of this type exists.
<indextype>_index(u, n) -> integer
Returns the Unicode object index for the start of the n-th
<indextype> element in u. Raises an IndexError in case no n-th
element can be found.
<indextype>_count(u, index) -> integer
Counts the number of complete <indextype> elements found in
u[:index] and returns the count as integer.
<indextype>_start(u, index) -> integer
Returns 1 or 0 depending on u[index] marks the start of an
<indextype> element.
<indextype>_end(u, index) -> integer
Returns 1 or 0 depending on u[index] marks the end of an
<indextype> element.
Used symbols:
<indextype> one of: codeunit, codepoint, grapheme, word, line
u is the Unicode object
index the Unicode object index
n is an integer
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/
More information about the Python-list
mailing list