Add a builtin method to 'int' for base/radix conversion

I believe int(s, base) needs an inverse function to allow string representation with different bases. An example use case is 'hashing' a counter like video ID's on youtube, you could use a regular int internally and publish a shorter base-62 id for links. This subject was discussed 2.5 years ago: http://mail.python.org/pipermail/python-dev/2006-January/059789.html I opened a feature request ticket: http://bugs.python.org/issue6783 Some of the questions that remain: 1. Whether this should be a method for int or a regular function in a standard library module like math. 2. What should the method/function be called? (base_convert, radix, etc) What do you guys think? RunThePun

How would it work for bases beyond 10? OK, so base-16 has a well known encoding, but what about base-100? Base-16 is case insensitive, but would Base-37 be case insensitive?… ISTM that the common encodings of bin, oct, and hex are already covered by built-in functions and for bases beyond that, you have decide what system to use for encoding symbols that are out of range. Do people really need trinary through septary that often? Is there a single, best-practices way of doing base-1013? Feels like YAGNI to me, but I could be wrong. — Carl Johnson

Doing some testing with int, I found: ValueError: int() base must be >= 2 and <= 36 That resolves a lot of questions (it has to be case insensitive), but takes away ability to do the motivating use case: writing out video ids in Base-62. —Carl On Sun, Aug 30, 2009 at 6:49 PM, Carl Johnson<cmjohnson.mailinglist@gmail.com> wrote:

On Sun, Aug 30, 2009 at 9:55 PM, Carl Johnson < cmjohnson.mailinglist@gmail.com> wrote:
If there are widely accepted and used non-controversial standard character -> digit mappings for bases greater than that I don't immediately see any reasons not to support them for ints but I do not think we should go beyond base 255. If someone wants base 256 they should use struct.pack() and beyond that I expect to find people fighting over encoding schemes. ;) Since this is the ideas list... Consider adding a digit character sequence or map passed to the conversion function defining which characters map to which digit number.

Carl Johnson wrote:
How would it work for bases beyond 10?
If you keep going through the alphabet you can get up to base 36. If anyone cares beyond that, it could take a string of characters to use for digits. With Unicode that would provide quite a wide range. :-) -- Greg

Yuvgoog Greenle wrote:
This has been coming up for years and always gets bogged down in a spelling argument (a method on int, a function in the math module and an update to the str.format mini language would be the current contenders). However, most of the actual real use cases for bases between 2 and 36 were dealt with by the addition of binary and octal output to string formatting so the impetus to do anything about it is now a lot lower. As far as bases between 37 and 62 go, that would involve first getting agreement on extending int() to handle those bases by allowing case sensitive digit parsing. Presumably that would use string lexical ordering so that int('a', 37) > int('A', 37) and int('b', 37) would raise an exception. That would only be intuitive to someone that knows how ASCII based alphanumeric ordering works though. Probably not an impossible sell, but you would want a compelling use case (converting long strings of digits into shorter link URLs is a pretty good one though - I've seen that in places other than youtube). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On 31 Aug 2009, at 15:00 , Nick Coghlan wrote: Yuvgoog Greenle wrote:
Though I'm not sure this is of much interest really: even Erlang (which provides pretty good base conversion tools: it supports literal integers of any base between 2 and 36) doesn't natively support bases beyond 36. A library would probably be better for those more conflictual (or less intuitive) ranges.

Masklinn wrote:
ASCII? Surely it should be Unicode! :-)
The default could cover only bases 2 to 36. Any base > 36 would require a user-supplied translation table.
It could permit a dict as the translation table when 'decoding' so that both 'A' and 'a' could be mapped to 10, if necessary.

Does anybody have any more use cases, ideas or suggestions? I'm getting the feeling this suggestion is +0 to most people and +1 for the rest. I'm pretty new to these mailing lists so does that mean a yes or a no? On Mon, Aug 31, 2009 at 6:47 PM, MRAB <python@mrabarnett.plus.com> wrote:

Yuvgoog Greenle wrote:
A generally lukewarm response means a maybe :) A positive response on python-ideas is still a maybe until the idea has subsequently also run the gauntlet of python-dev with actual code to back it up. In this case, the status quo is: str -> int (arbitrary base up to 36) via int() constructor (base "0" meaning Python literal format). int -> str via str() (for decimal output), hex(), oct(), bin() and string formatting So the currently unsupported use cases are limited to outputting numbers in bases between 3 and 36 that are not 8, 10 or 16. You're probably going to have a hard time convincing anyone that those additional use cases are worth putting much effort into supporting (and even then, they're probably better off as a 3rd party library that can add things like support for integers in bases up to 62). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, Sep 11, 2009 at 4:16 PM, Yuvgoog Greenle <ubershmekel@gmail.com> wrote:
Just out of curiosity, I did a Google code search[*] for uses of the inverse operation: int(<some_string> ,n). I found a good handful of uses of int(s, 36), almost all apparently to do with turning integers into suitable id strings; there was also evidence that people have implemented the reverse 'integer -> base 36 string' conversion at least twice. I found no meaningful uses of any bases other than 2, 8, 10, 16, and 36. So the main use case seems to be serialization and deserialization of integers into some 'suitably nice' alphabet, and that alphabet is likely to be application-dependent. -0 for int.to_base(n) (2 <= n <= 36) or equivalent functionality in the core. +0 for a pair of library functions converting to and from base n, with explicitly given translation table. I agree with MRAB that an implicit digit set should only be allowed for 2 <= base <= 36, if at all. By the way, _PyLong_Format in Objects/longobject.c *does* contain code for general integer -> base b conversions, 2 <= b <= 36, but that code is currently unused (as far as I can tell). Mark [*] http://www.google.com/codesearch?hl=en&lr=&q=%5CWint%5Cs*%5C%28.*%5C%2C%5Cs*36%5Cs*%5C%29+lang%3Apython&sbtn=Search

Btw, when you say translation table, do you mean just a string? Because a translation table would need to be continuous from 0 to the base so a real dicitionary-esque table may be overkill. The only advantage of a table might be to convert certain digits into multiple bytes (some sort of ad-hoc unicode use case?). --yuv On Sun, Sep 13, 2009 at 9:06 PM, Mark Dickinson <dickinsm@gmail.com> wrote:

Yuvgoog Greenle wrote:
If the translation table is limited to a string, the function would be very limited. For example, it might be useful to use base-change function to convert between an IPv4 address and integer. The common representation of IPv4 address uses base-255 "number" (0.0.0.0-255.255.255.255)

On Mon, Sep 14, 2009 at 3:51 AM, Yuvgoog Greenle <ubershmekel@gmail.com> wrote:
Yes, sorry, I just meant a string (or possibly some other iterable of characters). Something like (3.x code): def encode_int(n, alphabet): if n < 0: raise ValueError("nonnegative integers only, please") base = len(alphabet) cs = [] while True: n, c = divmod(n, base) cs.append(alphabet[c]) if not n: break return ''.join(reversed(cs)) def decode_int(s, alphabet): base = len(alphabet) char_to_int = {c: i for i, c in enumerate(alphabet)} n = 0 for c in s: n = n * base + char_to_int[c] return n
This doesn't allow negative numbers. If negative numbers should be permitted, there are some decisions to be made there too. How are they represented? With a leading '-'? If so, then '-' should not be permitted in the alphabet. Should the negative sign character be user-configurable? One problem with allowing multi-character digits in encoding is that it complicates the decoding: parsing the digit string is no longer trivial. I don't see how to make this a viable option. I'm still only +0 (now leaning towards -0, having seen how easy this is to implement, and thinking about how much possible variation there might be in what's actually needed) on adding something like this. Mark

Mark Dickinson wrote:
I'd prefer the arguments to be: value, base, optional translation table. The translation table would default to 0-9, A-Z/a-z (when decoding, multiple characters could map to the same numeric value, eg 'A' => 10 and 'a' => 10, hence the ability to use a dict). The default translation table would work up to base 36; higher bases would raise a ValueError exception "translation table too small for base". Could a single translation table work both ways? A dict for decoding could contain {'A': 10, 'a': 10}, but how could you reverse that for encoding?

also, negative numbers aren't taboo. FTFY import string def mirror_dict(dict): # list because iteration doesn't support changing of the dict for key, val in list(dict.items()): dict[val] = key _lows = string.digits + string.ascii_lowercase default_alphabet = dict(zip(range(len(_lows)), _lows)) mirror_dict(default_alphabet) def encode_int(value, base, alphabet=default_alphabet): if value < 0: is_negative = True value = value*(-1) else: is_negative = False cs = [] while True: value, index = divmod(value, base) cs.insert(0, alphabet[index]) if not value: break if is_negative: cs.insert(0, '-') return ''.join(cs) def decode_int(str_num, base, alphabet=default_alphabet): char_to_int = {c: i for i, c in enumerate(alphabet)} value = 0 for c in str_num: value = value * base + char_to_int[c] return value alphabet = '1ilI|:' enc = encode_int(10**10, len(alphabet), alphabet) print('|IIli|l|ili||') print(enc) dec = decode_int(enc, len(alphabet), alphabet) print(dec) print(10000000000) print(encode_int(12345, 32)) print(encode_int(-12345, 32)) ----- The only problem with {'A': 10, 'a': 10} is that it's not reversible. If we wantted to encode, 10, what should be used, A or a? On Tue, Sep 15, 2009 at 7:48 PM, MRAB <python@mrabarnett.plus.com> wrote:

Excuse my 'obviously', let me clarify. The following alphabet is reversable: for encode: {0 : 'a', 1, 'b'} for encode/decode: {0: 'a', 1: 'b', 'a':0, 'b':1} You can have 'a' and 'A' coexist, but the encode will have only one option. for encode: :{0 : 'a', 1, 'b'} for encode/decode: {0: 'a', 1: 'b', 'a': 0, 'b': 1, 'A': 0, 'B': 1} So one could say the encoding alphabet is the canonical representation. The encoding alphabet is actually alphabet_dict[0],..., alphabet_dict[base - 1]. On Wed, Sep 16, 2009 at 10:34 AM, Arnaud Delobelle <arnodel@googlemail.com>wrote:

How would it work for bases beyond 10? OK, so base-16 has a well known encoding, but what about base-100? Base-16 is case insensitive, but would Base-37 be case insensitive?… ISTM that the common encodings of bin, oct, and hex are already covered by built-in functions and for bases beyond that, you have decide what system to use for encoding symbols that are out of range. Do people really need trinary through septary that often? Is there a single, best-practices way of doing base-1013? Feels like YAGNI to me, but I could be wrong. — Carl Johnson

Doing some testing with int, I found: ValueError: int() base must be >= 2 and <= 36 That resolves a lot of questions (it has to be case insensitive), but takes away ability to do the motivating use case: writing out video ids in Base-62. —Carl On Sun, Aug 30, 2009 at 6:49 PM, Carl Johnson<cmjohnson.mailinglist@gmail.com> wrote:

On Sun, Aug 30, 2009 at 9:55 PM, Carl Johnson < cmjohnson.mailinglist@gmail.com> wrote:
If there are widely accepted and used non-controversial standard character -> digit mappings for bases greater than that I don't immediately see any reasons not to support them for ints but I do not think we should go beyond base 255. If someone wants base 256 they should use struct.pack() and beyond that I expect to find people fighting over encoding schemes. ;) Since this is the ideas list... Consider adding a digit character sequence or map passed to the conversion function defining which characters map to which digit number.

Carl Johnson wrote:
How would it work for bases beyond 10?
If you keep going through the alphabet you can get up to base 36. If anyone cares beyond that, it could take a string of characters to use for digits. With Unicode that would provide quite a wide range. :-) -- Greg

Yuvgoog Greenle wrote:
This has been coming up for years and always gets bogged down in a spelling argument (a method on int, a function in the math module and an update to the str.format mini language would be the current contenders). However, most of the actual real use cases for bases between 2 and 36 were dealt with by the addition of binary and octal output to string formatting so the impetus to do anything about it is now a lot lower. As far as bases between 37 and 62 go, that would involve first getting agreement on extending int() to handle those bases by allowing case sensitive digit parsing. Presumably that would use string lexical ordering so that int('a', 37) > int('A', 37) and int('b', 37) would raise an exception. That would only be intuitive to someone that knows how ASCII based alphanumeric ordering works though. Probably not an impossible sell, but you would want a compelling use case (converting long strings of digits into shorter link URLs is a pretty good one though - I've seen that in places other than youtube). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On 31 Aug 2009, at 15:00 , Nick Coghlan wrote: Yuvgoog Greenle wrote:
Though I'm not sure this is of much interest really: even Erlang (which provides pretty good base conversion tools: it supports literal integers of any base between 2 and 36) doesn't natively support bases beyond 36. A library would probably be better for those more conflictual (or less intuitive) ranges.

Masklinn wrote:
ASCII? Surely it should be Unicode! :-)
The default could cover only bases 2 to 36. Any base > 36 would require a user-supplied translation table.
It could permit a dict as the translation table when 'decoding' so that both 'A' and 'a' could be mapped to 10, if necessary.

Does anybody have any more use cases, ideas or suggestions? I'm getting the feeling this suggestion is +0 to most people and +1 for the rest. I'm pretty new to these mailing lists so does that mean a yes or a no? On Mon, Aug 31, 2009 at 6:47 PM, MRAB <python@mrabarnett.plus.com> wrote:

Yuvgoog Greenle wrote:
A generally lukewarm response means a maybe :) A positive response on python-ideas is still a maybe until the idea has subsequently also run the gauntlet of python-dev with actual code to back it up. In this case, the status quo is: str -> int (arbitrary base up to 36) via int() constructor (base "0" meaning Python literal format). int -> str via str() (for decimal output), hex(), oct(), bin() and string formatting So the currently unsupported use cases are limited to outputting numbers in bases between 3 and 36 that are not 8, 10 or 16. You're probably going to have a hard time convincing anyone that those additional use cases are worth putting much effort into supporting (and even then, they're probably better off as a 3rd party library that can add things like support for integers in bases up to 62). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, Sep 11, 2009 at 4:16 PM, Yuvgoog Greenle <ubershmekel@gmail.com> wrote:
Just out of curiosity, I did a Google code search[*] for uses of the inverse operation: int(<some_string> ,n). I found a good handful of uses of int(s, 36), almost all apparently to do with turning integers into suitable id strings; there was also evidence that people have implemented the reverse 'integer -> base 36 string' conversion at least twice. I found no meaningful uses of any bases other than 2, 8, 10, 16, and 36. So the main use case seems to be serialization and deserialization of integers into some 'suitably nice' alphabet, and that alphabet is likely to be application-dependent. -0 for int.to_base(n) (2 <= n <= 36) or equivalent functionality in the core. +0 for a pair of library functions converting to and from base n, with explicitly given translation table. I agree with MRAB that an implicit digit set should only be allowed for 2 <= base <= 36, if at all. By the way, _PyLong_Format in Objects/longobject.c *does* contain code for general integer -> base b conversions, 2 <= b <= 36, but that code is currently unused (as far as I can tell). Mark [*] http://www.google.com/codesearch?hl=en&lr=&q=%5CWint%5Cs*%5C%28.*%5C%2C%5Cs*36%5Cs*%5C%29+lang%3Apython&sbtn=Search

Btw, when you say translation table, do you mean just a string? Because a translation table would need to be continuous from 0 to the base so a real dicitionary-esque table may be overkill. The only advantage of a table might be to convert certain digits into multiple bytes (some sort of ad-hoc unicode use case?). --yuv On Sun, Sep 13, 2009 at 9:06 PM, Mark Dickinson <dickinsm@gmail.com> wrote:

Yuvgoog Greenle wrote:
If the translation table is limited to a string, the function would be very limited. For example, it might be useful to use base-change function to convert between an IPv4 address and integer. The common representation of IPv4 address uses base-255 "number" (0.0.0.0-255.255.255.255)

On Mon, Sep 14, 2009 at 3:51 AM, Yuvgoog Greenle <ubershmekel@gmail.com> wrote:
Yes, sorry, I just meant a string (or possibly some other iterable of characters). Something like (3.x code): def encode_int(n, alphabet): if n < 0: raise ValueError("nonnegative integers only, please") base = len(alphabet) cs = [] while True: n, c = divmod(n, base) cs.append(alphabet[c]) if not n: break return ''.join(reversed(cs)) def decode_int(s, alphabet): base = len(alphabet) char_to_int = {c: i for i, c in enumerate(alphabet)} n = 0 for c in s: n = n * base + char_to_int[c] return n
This doesn't allow negative numbers. If negative numbers should be permitted, there are some decisions to be made there too. How are they represented? With a leading '-'? If so, then '-' should not be permitted in the alphabet. Should the negative sign character be user-configurable? One problem with allowing multi-character digits in encoding is that it complicates the decoding: parsing the digit string is no longer trivial. I don't see how to make this a viable option. I'm still only +0 (now leaning towards -0, having seen how easy this is to implement, and thinking about how much possible variation there might be in what's actually needed) on adding something like this. Mark

Mark Dickinson wrote:
I'd prefer the arguments to be: value, base, optional translation table. The translation table would default to 0-9, A-Z/a-z (when decoding, multiple characters could map to the same numeric value, eg 'A' => 10 and 'a' => 10, hence the ability to use a dict). The default translation table would work up to base 36; higher bases would raise a ValueError exception "translation table too small for base". Could a single translation table work both ways? A dict for decoding could contain {'A': 10, 'a': 10}, but how could you reverse that for encoding?

also, negative numbers aren't taboo. FTFY import string def mirror_dict(dict): # list because iteration doesn't support changing of the dict for key, val in list(dict.items()): dict[val] = key _lows = string.digits + string.ascii_lowercase default_alphabet = dict(zip(range(len(_lows)), _lows)) mirror_dict(default_alphabet) def encode_int(value, base, alphabet=default_alphabet): if value < 0: is_negative = True value = value*(-1) else: is_negative = False cs = [] while True: value, index = divmod(value, base) cs.insert(0, alphabet[index]) if not value: break if is_negative: cs.insert(0, '-') return ''.join(cs) def decode_int(str_num, base, alphabet=default_alphabet): char_to_int = {c: i for i, c in enumerate(alphabet)} value = 0 for c in str_num: value = value * base + char_to_int[c] return value alphabet = '1ilI|:' enc = encode_int(10**10, len(alphabet), alphabet) print('|IIli|l|ili||') print(enc) dec = decode_int(enc, len(alphabet), alphabet) print(dec) print(10000000000) print(encode_int(12345, 32)) print(encode_int(-12345, 32)) ----- The only problem with {'A': 10, 'a': 10} is that it's not reversible. If we wantted to encode, 10, what should be used, A or a? On Tue, Sep 15, 2009 at 7:48 PM, MRAB <python@mrabarnett.plus.com> wrote:

Excuse my 'obviously', let me clarify. The following alphabet is reversable: for encode: {0 : 'a', 1, 'b'} for encode/decode: {0: 'a', 1: 'b', 'a':0, 'b':1} You can have 'a' and 'A' coexist, but the encode will have only one option. for encode: :{0 : 'a', 1, 'b'} for encode/decode: {0: 'a', 1: 'b', 'a': 0, 'b': 1, 'A': 0, 'B': 1} So one could say the encoding alphabet is the canonical representation. The encoding alphabet is actually alphabet_dict[0],..., alphabet_dict[base - 1]. On Wed, Sep 16, 2009 at 10:34 AM, Arnaud Delobelle <arnodel@googlemail.com>wrote:
participants (10)
-
Arnaud Delobelle
-
Carl Johnson
-
Greg Ewing
-
Gregory P. Smith
-
Lie Ryan
-
Mark Dickinson
-
Masklinn
-
MRAB
-
Nick Coghlan
-
Yuvgoog Greenle