[issue6632] Include more fullwidth chars in the decimal codec

Marc-Andre Lemburg report at bugs.python.org
Thu Sep 24 10:40:15 CEST 2009


Marc-Andre Lemburg <mal at egenix.com> added the comment:

Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin at v.loewis.de> added the comment:
> 
>> The codec currently doesn't look at the base at all - and shouldn't
>> need to:
>>
>> It simply converts input characters that have a decimal digit value
>> associated with them, to the usual ASCII digits in preparation
>> for parsing them using the standard number parsing tools we have in
>> Python.
> 
> Right. And as such, it shouldn't stop with digit 9, but continue into
> digits a, b, c, and so on, as appropriate.

I don't think that's needed. The codec already passes those
through as-is.

>> This is to support number representations using non-ASCII code
>> points for digits (e.g. Japanese or Sanskrit numbers)
> 
> Notice that it also supports bases other than 10:
> 
> 80
> 
> So calling it "decimal" is a misnomer.

Not really: _PyUnicode_ToDecimalDigit() is used for the
conversion and that API explicitly only returns integer
values for code points that map to the digits 0-9 - at
least that's how it was originally written (see the code
in Python 1.6 which makes this explicit).

If it returns values outside that range, that's a bug
and needs to be fixed, since it would cause the codec
to fail. It is designed to only work on digits, not
arbitrary decimals.

>> Also note that we already have a hex codec in Python 2.x
>> which converts between the hex representations of a string
>> and its regular form. This was removed in 3.x for some reason
>> I don't understand (probably just an oversight).
> 
> The hex codec doesn't have to do anything with number conversions;
> nor does it have to do with character encodings. To introduce it was
> a mistake in Python 2.x which has been fixed in 3.x (by removing
> it and other similar "codecs", such as rot13).

That's your particular view of things. It's not mine and never
was the basis of the codec design.

Codecs in Python are open to work on arbitrary types and
it's well possible to have codecs that return the same type
as their input.

The hex codec in Python 2.x is a very useful and handy
codec and it's used a lot.

It should be added back again - after all, even by your
restrictive view of codecs in Python only serving as a way to
do character encodings, it is a valid character encoding -
that of Latin-1 code points to a two-byte HEX representation
and vice-versa.

Just like rot-13 and most of the others that were apparently
removed (uu, base64, quoted-printable, zip, bz2).

BTW: I noticed that idna and punycode were not removed...
even though they fall into the same category as the hex
codec.

I guess we should have a discussion about this on python-dev.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6632>
_______________________________________


More information about the Python-bugs-list mailing list