int.to_base, int.from_base

Hey there, The `int()` function allows to specify a base to convert from, for example:
int("foo", 26) 10788
Which is documented as:
The base defaults to 10. Valid bases are 0 and 2-36.
For other common bases functions exist in the base64 module in stdlib. I often need other bases, or bases with custom alphabets. Doing so involves a bit of code every time that I think could be generalized so I'd like to propose an `int.to_base()`, and `int.from_base`. These would not supercede or replace any current possibilities but extend and simplify current possibilities. The signature(s) I had in mind for now are akin to:
int.from_base(x, alphabet, padding_character)
and
int.to_base(alphabet, padding_character)
Has any discussion on this been had previously (I searched around a bit), and if not would this make a decent PEP? Regards, Simon

On Mon, 2 May 2022 at 14:49, Simon de Vlieger <cmdr@supakeen.com> wrote:
Let's not go as far as a PEP yet, and figure out a couple of things: 1) What's it like using existing tools? 2) How common is it to need something that's really clunky with existing tools? The "alternate alphabet" case can be done by base converting and then replacing on the string. It's not the smoothest, so that counts a bit of clunkiness; but it's also not all THAT common (I can recall doing it for SteamGuard 2FA codes, which are base 26 but avoid confusable digit/letter pairs, and that's about it). When you say "other bases", do you mean beyond base 36? Do you have use-cases for anything >36 that isn't 64, 85, or 256? If so, how do you currently do this? The CPython integer type is implemented in C for performance. If that's not a consideration, maybe this would be better done in the base64 module (which is where base 85 also lives), as a general tool for arbitrary ASCIIfication. Can you link to your codebase where you 'often' do these kinds of conversions? Is it in a performance-critical area? ChrisA

On Mon, May 2, 2022, at 7:03 AM, Chris Angelico wrote:
I've mostly resorted to using str.maketrans and .replace as well.
Some examples I've encountered over the past year are: Base58, as used in Bitcoin [1]. Base45 [2], and Base91. My experience is likely skewed as I do take part in CTFs where obscureness is often part of the deal.
For my usecases it hasn't been especially performance critical. The base64 module might be a good place for this to live instead of the integer type. Perhaps the base64 module is in fact a better place as converting to bytes is likely what's wanted instead of going to/from integer first.
Can you link to your codebase where you 'often' do these kinds of conversions? Is it in a performance-critical area?
I can't but it hasn't been performance critical. Regards, Simon https://tools.ietf.org/id/draft-msporny-base58-01.html [1] https://datatracker.ietf.org/doc/draft-faltstrom-base45/ [2]

On Mon, 2 May 2022 at 16:37, Simon de Vlieger <cmdr@supakeen.com> wrote:
As far as I can tell, these are all separate algorithms, and they don't really generalise well. Knowing all of the ones mentioned (45, 58, 64, 85, 91, 256), you still wouldn't be able to synthesize a (say) Base 73 encoding. So that suggests to me that these belong (if anywhere) in the base64 module, or perhaps in the codecs module (you can find base64 itself there as well).
My experience is likely skewed as I do take part in CTFs where obscureness is often part of the deal.
Not familiar with the term CTF in this context, my brain assumes Capture The Flag but maybe that's not it?
Yeah, that's the other reason - those kinds of encodings are often used for representing long strings, not numbers.
Cool. Then I would be inclined to push forward with this as additional functions in the base64 module. Particularly when they have well-known use-cases (you mentioned Bitcoin for Base58, would help if you can cite others). ChrisA

On Mon, 2 May 2022 at 16:46, Serhiy Storchaka <storchaka@gmail.com> wrote:
I'm aware of PEP 313 for Roman, but not for the others. Was there a PEP when the int() constructor started to support other types of digits? I can't find one but it wouldn't surprise me. In any case, I just said "yet" - there's no need to go straight to the PEP stage, even if it's necessary before the matter gets finally decided. (This was said with my PEP Editor hat on; every now and then, we get someone putting a PR on the peps repository to request a change in the language, and it's usually better to thrash things out on a mailing list first.) ChrisA

On 02.05.2022 08:54, Chris Angelico wrote:
That was a consequence of PEP 100, the addition of Unicode to the language. There are now a lot more characters which represent digits than we had in the 8-bit world. Just a word of warning: numeric bases are not necessarily the same as numeric encodings. The latter usually come with other formatting criteria in addition to representing numeric values, e.g. base64 is an encoding and not the same as representing numbers in base 64. We have the binascii module for the encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, May 02 2022)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On Mon, May 02, 2022 at 09:58:35AM +0200, Marc-Andre Lemburg wrote:
Correct. base64 is for encoding byte-strings, not numbers:
binascii.hexlify(b"Hello world") b'48656c6c6f20776f726c64'
Of course we can treat any byte string as a base-256 number, in which case "Hello world" has the value 87521618088882671231069284. There's no obvious collation/alphabet to use for base 64, but if we use (say) ASCII digits + uppercase + lowercase + "!@" then that "Hello world" number 875...284 above is: 4XbR6nl87TlScna (in base 64) which is completely different from the base64 encoding. By the way, in base 64 that "Hello world" number has: * digital sum of 445; * digital root of 4, with persistance of 3; * digital product of 261040984907288205312; * zero-free digital product root of 48, with persistance of 7. There is absolutely no significance to any of this. I'm just geeking out :-) -- Steve

On Mon, 2 May 2022 at 14:49, Simon de Vlieger <cmdr@supakeen.com> wrote:
Let's not go as far as a PEP yet, and figure out a couple of things: 1) What's it like using existing tools? 2) How common is it to need something that's really clunky with existing tools? The "alternate alphabet" case can be done by base converting and then replacing on the string. It's not the smoothest, so that counts a bit of clunkiness; but it's also not all THAT common (I can recall doing it for SteamGuard 2FA codes, which are base 26 but avoid confusable digit/letter pairs, and that's about it). When you say "other bases", do you mean beyond base 36? Do you have use-cases for anything >36 that isn't 64, 85, or 256? If so, how do you currently do this? The CPython integer type is implemented in C for performance. If that's not a consideration, maybe this would be better done in the base64 module (which is where base 85 also lives), as a general tool for arbitrary ASCIIfication. Can you link to your codebase where you 'often' do these kinds of conversions? Is it in a performance-critical area? ChrisA

On Mon, May 2, 2022, at 7:03 AM, Chris Angelico wrote:
I've mostly resorted to using str.maketrans and .replace as well.
Some examples I've encountered over the past year are: Base58, as used in Bitcoin [1]. Base45 [2], and Base91. My experience is likely skewed as I do take part in CTFs where obscureness is often part of the deal.
For my usecases it hasn't been especially performance critical. The base64 module might be a good place for this to live instead of the integer type. Perhaps the base64 module is in fact a better place as converting to bytes is likely what's wanted instead of going to/from integer first.
Can you link to your codebase where you 'often' do these kinds of conversions? Is it in a performance-critical area?
I can't but it hasn't been performance critical. Regards, Simon https://tools.ietf.org/id/draft-msporny-base58-01.html [1] https://datatracker.ietf.org/doc/draft-faltstrom-base45/ [2]

On Mon, 2 May 2022 at 16:37, Simon de Vlieger <cmdr@supakeen.com> wrote:
As far as I can tell, these are all separate algorithms, and they don't really generalise well. Knowing all of the ones mentioned (45, 58, 64, 85, 91, 256), you still wouldn't be able to synthesize a (say) Base 73 encoding. So that suggests to me that these belong (if anywhere) in the base64 module, or perhaps in the codecs module (you can find base64 itself there as well).
My experience is likely skewed as I do take part in CTFs where obscureness is often part of the deal.
Not familiar with the term CTF in this context, my brain assumes Capture The Flag but maybe that's not it?
Yeah, that's the other reason - those kinds of encodings are often used for representing long strings, not numbers.
Cool. Then I would be inclined to push forward with this as additional functions in the base64 module. Particularly when they have well-known use-cases (you mentioned Bitcoin for Base58, would help if you can cite others). ChrisA

On Mon, 2 May 2022 at 16:46, Serhiy Storchaka <storchaka@gmail.com> wrote:
I'm aware of PEP 313 for Roman, but not for the others. Was there a PEP when the int() constructor started to support other types of digits? I can't find one but it wouldn't surprise me. In any case, I just said "yet" - there's no need to go straight to the PEP stage, even if it's necessary before the matter gets finally decided. (This was said with my PEP Editor hat on; every now and then, we get someone putting a PR on the peps repository to request a change in the language, and it's usually better to thrash things out on a mailing list first.) ChrisA

On 02.05.2022 08:54, Chris Angelico wrote:
That was a consequence of PEP 100, the addition of Unicode to the language. There are now a lot more characters which represent digits than we had in the 8-bit world. Just a word of warning: numeric bases are not necessarily the same as numeric encodings. The latter usually come with other formatting criteria in addition to representing numeric values, e.g. base64 is an encoding and not the same as representing numbers in base 64. We have the binascii module for the encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, May 02 2022)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On Mon, May 02, 2022 at 09:58:35AM +0200, Marc-Andre Lemburg wrote:
Correct. base64 is for encoding byte-strings, not numbers:
binascii.hexlify(b"Hello world") b'48656c6c6f20776f726c64'
Of course we can treat any byte string as a base-256 number, in which case "Hello world" has the value 87521618088882671231069284. There's no obvious collation/alphabet to use for base 64, but if we use (say) ASCII digits + uppercase + lowercase + "!@" then that "Hello world" number 875...284 above is: 4XbR6nl87TlScna (in base 64) which is completely different from the base64 encoding. By the way, in base 64 that "Hello world" number has: * digital sum of 445; * digital root of 4, with persistance of 3; * digital product of 261040984907288205312; * zero-free digital product root of 48, with persistance of 7. There is absolutely no significance to any of this. I'm just geeking out :-) -- Steve
participants (5)
-
Chris Angelico
-
Marc-Andre Lemburg
-
Serhiy Storchaka
-
Simon de Vlieger
-
Steven D'Aprano