Python & Unicode decimal interpretation

Scott David Daniels scott.daniels at acm.org
Sat Dec 3 17:14:24 CET 2005


Martin v. Löwis wrote:
> Scott David Daniels wrote:
>> In reading over the source for CPython's PyUnicode_EncodeDecimal,
>> I see a dance to handle characters which are neither dec-equiv nor
>> in Latin-1.  Does anyone know about the intent of such a conversion?
> 
> To support this:
> 
>  >>> int(u"\N{DEVANAGARI DIGIT SEVEN}")
> 7
OK, That much I have handled.  I am fiddling with direct-to-number
conversions and wondering about cases like
    >>> int(u"\N{DEVANAGARI DIGIT SEVEN}" + XXX
            + u"\N{DEVANAGARI DIGIT SEVEN}")

Where XXX does not pass the digit test, but must either:
     (A) be dropped, giving a result of 77
or  (B) get translated (e.g. to u'234') giving 72347
or  (C) get translated (to u'2' + YYY + u'4') where YYY will
         require further handling ...

I don't really understand how the "ignore" or "something_else"
cases get caused by python source [where they come from].  Are they
only there for C-program access?

> In the "ignore" case, no output is produced at all, for the unencodable
> character; this is the same way that '?' would be treated (it is
> also unencodable).
If I understand you correctly -- I can consider the digit stream to stop
as soon as I hit a non-digit (except for handling bases 11-36).

> In the something_else case, a user-defined exception handler could
> treat the error in any way it liked, e.g. encoding all letters
> u'A' to digit '0'. This might be different from the way this error
> handler would treat '?'.

--Scott David Daniels
scott.daniels at acm.org



More information about the Python-list mailing list