[New-bugs-announce] [issue6561] Regex '\d' should not match unicode category 'No'.

Mark Dickinson report at bugs.python.org
Fri Jul 24 12:48:01 CEST 2009


New submission from Mark Dickinson <dickinsm at gmail.com>:

In Python 3, or in Python 2 with the re.UNICODE flag, it appears that 
the regex r'\d' matches all unicode characters with category either 'Nd' 
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in 
category 'Nl' (Number, Letter):

Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>

I believe (but am not 100% sure) that r'\d' should only match characters 
in category 'Nd'.  To back up this belief:

(1) int and float currently accept characters in category 'Nd' but not 
'No'; it would seem useful for '\d' to match those characters that are 
accepted by int, so that e.g., something matched with '\d+' could be 
directly passed to int.  (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;  
that's a separate issue, though.)

(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches 
only characters in category 'Nd'

(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at 
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should 
correspond to \p{gc=Decimal_Number}

Marc-André, do you have any opinion on this?

It's probably slightly dangerous to change this in 2.6 or 3.1;  I'm 
proposing that '\d' should be modified to accept only characters of 
category 'Nd' in 2.7 and 3.2.

(Thanks Ezio Melotti for finding all the references above and doing Perl 
testing!)

----------
components: Extension Modules
messages: 90878
nosy: ezio.melotti, lemburg, marketdickinson
severity: normal
stage: test needed
status: open
title: Regex '\d' should not match unicode category 'No'.
type: behavior
versions: Python 2.7, Python 3.2

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6561>
_______________________________________


More information about the New-bugs-announce mailing list