[Python-3000] Regular expressions, py3k and unicode
Antoine Pitrou
solipsis at pitrou.net
Sun Jun 29 16:15:14 CEST 2008
Mark Dickinson <dickinsm <at> gmail.com> writes:
>
> Is there a quick way to convert a general Unicode digit to its
> ascii equivalent? Having to run str(int(c)) on each numeric character
> sounds painful, and the Decimal constructor doesn't need to
> be any slower right now.
In C it looks like PyUnicode_EncodeDecimal() does the trick (it's used by float
and int conversion functions). What is the status of C-accelerated Decimal?
In plain Python I don't know, perhaps you could keep the fast path for ASCII
strings and have a slow fallback for unicode digits. Or suggest exporting the
above C function as a str method.
(or perhaps, simply, just disallow non-ASCII digits by using [0-9] instead of
\d. I'm not sure anybody really cares)
> I'm more worried, perhaps
> needlessly, about what other unidentified problems might be
> lurking deep in the standard library. Any use of '\d', '\w', '\s', etc.
> might potentially be a problem.
Yes, we should do a scan of the standard library for this kind of pattern and
try to find out where there might be a problem.
More information about the Python-3000
mailing list