[ python-Feature Requests-1528154 ] New sequences for Unicode groups and block ranges needed

Mon Dec 4 10:27:55 CET 2006

Feature Requests item #1528154, was opened at 2006-07-25 06:44
Message generated for change (Comment added) made by effbot
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1528154&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: None
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: gmarketer (gmarketer)
Assigned to: Nobody/Anonymous (nobody)
Summary: New sequences for Unicode groups and block ranges needed

Initial Comment:
The special sequences consist of "\" and another
character need to be added to RE sintax to simplify the
finding of several Unicode classes like:
 * All uppercase letters
 * All lowercase letters

----------------------------------------------------------------------

>Comment By: Fredrik Lundh (effbot)
Date: 2006-12-04 10:27

Message:
Logged In: YES 
user_id=38376
Originator: NO

note that posix uses a special set syntax, [:name:], for this purpose:

[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
[:alpha:]   [:digit:]   [:print:]   [:upper:]
[:blank:]   [:graph:]   [:punct:]   [:xdigit:]

adding a new character escape will probably break more existing
expressions, but no matter what syntax we chose, this is (micro-)PEP
territory.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-09-10 12:36

Message:
Logged In: YES 
user_id=21627

If anything, I think Python should implement Unicode TR#18:

http://www.unicode.org/unicode/reports/tr18/

This does include the \p notation for property expressions,
e.g. \p{Ll} or \p{East Asian Width:Narrow}.

We currently don't include the Script property, so \p{Greek}
could not be implemented (we can, of course, add support for
the script property). I can't find anything in the report
that makes \p{IsGreek} valid, so we shouldn't support it.

----------------------------------------------------------------------

Comment By: gmarketer (gmarketer)
Date: 2006-07-26 04:06

Message:
Logged In: YES 
user_id=1334865

We need to process several strings in utf-8 and need to use
regular expressions to match pattern, for ex.:
r"[ANY_LANGUAGE_UPPERCASE_LETTER,0-9ANY_LANGUAGE_LOWERCASE_LETTER]+|NOT_ANY_LANGUAGE_CURRENCY"

We don't know how to implement this logic by our hands.

Also, I found this logic implemented in Microsoft dot NET
regular expressions:

\p{name}        Matches any character in the named character
class 'name'. Supported names are Unicode groups and block
ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc
(currency). 

\P{name}        Matches text not included in the named
character class 'name'. 

We need same logic in regular expressions.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-07-25 09:45

Message:
Logged In: YES 
user_id=38388

Could you make your request a little more specific ?

We already have catregories in the re module, so adding a
few more would be possible (patches are welcome !). However,
we do need to know why you need them and whether there are
other RE implementations that already have such special
matching characters, e.g. the Perl RE implementation.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1528154&group_id=5470