Please comment... -- PEP: 0262 (?) Title: Unicode Indexing Helper Module Version: $Revision: 1.0 $ Author: mal@lemburg.com (Marc-André Lemburg) Status: Draft Type: Standards Track Python-Version: 2.3 Created: 06-Jun-2001 Post-History: Abstract This PEP proposes a new module "unicodeindex" which provides means to index Unicode objects in various higher level abstractions of "characters". Problem and Terminology Unicode objects can be indexed just like string object using what in Unicode terms is called a code unit as index basis. Code units are the storage entities used by the Unicode implementation to store a single Unicode information unit and do not necessarily map 1-1 to code points which are the smallest entities encoded by the Unicode standard. These code points can sometimes be composed to form graphemes which are then displayed by the Unicode output device as one character. A word is then a sequence of characters separated by space characters or punctuation, a line is a sequence of code points separated by line breaking code point sequences. For addressing Unicode, there are basically five different methods by which you can reference the data: 1. per code unit (codeunit) 2. per code point (codepoint) 3. per grapheme (grapheme) 4. per word (word) 5. per line (line) The indexing type name is given in parenthesis and used in the module interface. Proposed Solution I propose to add a new module to the standard Python library which provides interfaces implementing the above indexing methods. Module Interface The module should provide the following interfaces for all four indexing styles: next_<indextype>(u, index) -> integer Returns the Unicode object index for the start of the next <indextype> found after u[index] or -1 in case no next element of this type exists. prev_<indextype>(u, index) -> integer Returns the Unicode object index for the start of the previous <indextype> found before u[index] or -1 in case no previous element of this type exists. <indextype>_index(u, n) -> integer Returns the Unicode object index for the start of the n-th <indextype> element in u. Raises an IndexError in case no n-th element can be found. <indextype>_count(u, index) -> integer Counts the number of complete <indextype> elements found in u[:index] and returns the count as integer. <indextype>_start(u, index) -> integer Returns 1 or 0 depending on u[index] marks the start of an <indextype> element. <indextype>_end(u, index) -> integer Returns 1 or 0 depending on u[index] marks the end of an <indextype> element. Used symbols: <indextype> one of: codeunit, codepoint, grapheme, word, line u is the Unicode object index the Unicode object index n is an integer Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
M.-A. Lemburg:
next_<indextype>(u, index) -> integer
Returns the Unicode object index for the start of the next <indextype> found after u[index] or -1 in case no next element of this type exists.
prev_<indextype>(u, index) -> integer ...
Its not clear to me from the description whether the term "object index" is used for a code unit index or an <indextype> index. Code unit index seems to make the most sense but this should be explicit. Neil
Neil Hodgson wrote:
M.-A. Lemburg:
next_<indextype>(u, index) -> integer
Returns the Unicode object index for the start of the next <indextype> found after u[index] or -1 in case no next element of this type exists.
prev_<indextype>(u, index) -> integer ...
Its not clear to me from the description whether the term "object index" is used for a code unit index or an <indextype> index. Code unit index seems to make the most sense but this should be explicit.
Good point. The "Unicode object index" refers to the index you use for slicing or indexing Unicode objects, i.e. like in "u[10]" or "u[12:15]". As such it refers to the Unicode code unit as implemented by the Unicode implementation (and is application specific). I'll add a note to the PEP. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
participants (2)
-
M.-A. Lemburg -
Neil Hodgson