[Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky
alexander.belopolsky at gmail.com
Tue Nov 30 17:40:19 CET 2010
On Mon, Nov 29, 2010 at 2:38 PM, Alexander Belopolsky
<alexander.belopolsky at gmail.com> wrote:
..
>> Still, if it's not detrimental and it it's not difficult to support,
>> then why do you care?
>
> It is difficult to support. A fix for issue10557 would be much
> simpler if we did not support non-European digits. I now added a
> patch that handles non-ascii digits, so you can see what's involved.
> Note that when Unicode Consortium inevitably adds more Nd characters
> to the non-BMP planes, we will have to add surrogate pairs' support to
> this code.
>
It turns out that this did in fact happen:
# Newly assigned in Unicode 3.1.0 (March, 2001)
..
1D7CE..1D7FF ; 3.1 # [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL
MONOSPACE DIGIT NINE
See http://unicode.org/Public/UNIDATA/DerivedAge.txt
And of course,
>>> unicodedata.digit('\U0001D7CE')
0
but
>>> int('\U0001D7CE')
..
UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' ..
on a narrow Unicode build. (Note the character reported in the error message!)
If you think non-ASCII digits are not difficult to support, please
contribute to the following tracker issues:
http://bugs.python.org/issue10581
(Review and document string format accepted in numeric data type constructors)
http://bugs.python.org/issue10557
(Malformed error message from float())
http://bugs.python.org/issue10435
(Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal)
http://bugs.python.org/issue8646
(PyUnicode_EncodeDecimal is undocumented)
http://bugs.python.org/issue6632
(Include more fullwidth chars in the decimal codec)
and back to the issue of user confusion
http://bugs.python.org/issue652104 [closed/invalid]
(int(u"\u1234") raises UnicodeEncodeError by Guido van Rossum)
More information about the Python-Dev
mailing list