[Python-3000] Regular expressions, py3k and unicode

Antoine Pitrou solipsis at pitrou.net
Sun Jun 29 16:15:14 CEST 2008


Mark Dickinson <dickinsm <at> gmail.com> writes:
> 
> Is there a quick way to convert a general Unicode digit to its
> ascii equivalent?  Having to run str(int(c)) on each numeric character
> sounds painful, and the Decimal constructor doesn't need to
> be any slower right now.

In C it looks like PyUnicode_EncodeDecimal() does the trick (it's used by float
and int conversion functions). What is the status of C-accelerated Decimal?

In plain Python I don't know, perhaps you could keep the fast path for ASCII
strings and have a slow fallback for unicode digits. Or suggest exporting the
above C function as a str method.
(or perhaps, simply, just disallow non-ASCII digits by using [0-9] instead of
\d. I'm not sure anybody really cares)

> I'm more worried, perhaps
> needlessly, about what other unidentified problems might be
> lurking deep in the standard library.  Any use of '\d', '\w', '\s', etc.
> might potentially be a problem.

Yes, we should do a scan of the standard library for this kind of pattern and
try to find out where there might be a problem.





More information about the Python-3000 mailing list