Python & Unicode decimal interpretation
"Martin v. Löwis"
martin at v.loewis.de
Sat Dec 3 13:31:53 EST 2005
Scott David Daniels wrote:
>> >>> int(u"\N{DEVANAGARI DIGIT SEVEN}")
>> 7
>
> OK, That much I have handled. I am fiddling with direct-to-number
> conversions and wondering about cases like
> >>> int(u"\N{DEVANAGARI DIGIT SEVEN}" + XXX
> + u"\N{DEVANAGARI DIGIT SEVEN}")
int() passes NULL as error mode, equalling strict. So if you get an
unencodable character, you get the UnicodeError.
> I don't really understand how the "ignore" or "something_else"
> cases get caused by python source [where they come from]. Are they
> only there for C-program access?
Neither, nor. This code is dead.
>> In the "ignore" case, no output is produced at all, for the unencodable
>> character; this is the same way that '?' would be treated (it is
>> also unencodable).
>
> If I understand you correctly -- I can consider the digit stream to stop
> as soon as I hit a non-digit (except for handling bases 11-36).
No. In "ignore" mode, a codec doesn't stop at the unencodable character.
Instead, it skips it, continuing with the next character.
I mistakenly said that this would happen to '?' (question mark) also;
this is incorrect: PyUnicode_EncodeDecimal copies all Latin-1 characters
to the output, latin-1-encoded. So '?' would appear in the output,
even in "ignore" mode.
Handling of bases is not done in the function at all. Instead, the
callers of PyUnicode_EncodeDecimal will deal with number formats
(base, prefix, exponent syntax, etc.) They will assume ASCII
bytes.
Regards,
Martin
More information about the Python-list
mailing list