C API for converting Python integers to/from bytes sequences
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them as fixed-width integers (little- and big-endian, signed and unsigned). In Python, we can use struct, array, memoryview and ctypes use for some standard sizes and int methods int.to_bytes and int.from_bytes for non-standard sizes. In C, there is the C API for converting to/from C types long, unsigned long, long long and unsigned long long. For other C types (signed and unsigned char, short, int) we need to use the C API for converting to long, and then truncate to the destination type with checking for overflow. For integers type aliases like pid_t we need to determine their size and signess and use corresponding C API or wrapper. For non-standard integers (e.g. 24-bit), integers wider than long long, and arbitrary precision integers all is much more complicated. There are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only. I am planning to add public analogs of these private functions, but more powerful and convenient. PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed) Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow) PyLong_FromBytes() returns the int object. It only fails in case of memory error or incorrect arguments (e.g. buf is NULL). PyLong_AsBytes() writes bytes to the specified buffer, it does not allocate memory. If buf is NULL it returns the minimum size of the buffer for representing the integer. -1 is returned on error. if overflow is NULL, then OverfowError is raised, otherwise *overflow is set to +1 for overflowing the upper limit, -1 for overflowing the lower limit, and 0 for no overflow. Now I have some design questions. 1. How to encode the byte order? a) 1 -- little endian, 0 -- big endian b) 0 -- little endian, 1 -- big endian c) -1 -- little endian, +1 -- big endian, 0 -- native endian. Do we need to reserve some values for mixed endians? 2. How to specify the reduction modulo 2**(8*size) (like in PyLong_AsUnsignedLongMask)? Add yet one flag in PyLong_AsBytes()? Use special value for the signed argument? 0 -- unsigned, 1 -- signed, 2 (or -1) -- modulo. Or use some combination of signed and overflow? 3. How to specify saturation (like in PyNumber_AsSsize_t())? I.e. values less than the lower limit are replaced with the lower limit, values greater than the upper limit are replaced with the upper limit. Same options as for (2): separate flag, encode in signed (but we need two values here) or combination of other parameters. 4. What exact names to use? PyLong_FromByteArray/PyLong_AsByteArray, PyLong_FromBytes/PyLong_AsBytes, PyLong_FromBytes/PyLong_ToBytes?
Serhiy Storchaka writes:
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them [...]. [In the case of non-standard precisions,] [t]here are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only.
I am planning to add public analogs of these private functions, but more powerful and convenient.
PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed)
Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow)
I don't understand why such a complex API is useful as a public facility. For example, I've often thought it would be amusing (and maybe useful) to have PyXEmacs, which would still be a Lisp application, but capable of delegating some computations to a subinterpreter. Now, XEmacs's native representations are none (ie, overflow is signaled), GMP (GNU multiprecision), and MP (BSD multiprecision). XEmacs supports bigratios and bigfloats as well as bigints. So I might want PyLong_AsGMPInt and PyLong_AsGMPRatio as well as the corresponding functions for MP, and maybe even PyLong_AsGMPFloat. The obvious way to write those is <library constructor>(str(python_integer)), I think. The same approach is easily generalized to Decimal and Fraction, although you might need to use format() instead of str(), and extract precision information to pass to the library constructor. How often would someone need something more performant? I don't think PyLong_AsBytes would be useful to me, since I'll have to write pybytes_to_gmp_int etc, etc, anyway. I admit that this is both specific to me and at present imaginary :-), but it seems to me that most applications that can accept non-standard-precision integers will very likely be using such libraries. In the unlikely event that an application needs to squeeze out that tiny bit of performance, I guess the library constructors all accept buffers of bytes, too, probably with a similarly complex API that can handle whatever the Python ABI throws at them. In which case why not just expose the internal functions? Is it at all likely that that representation would ever change? Are there likely to be applications that use idiosyncratic representations of high-precision integers that this API would handle directly?
08.08.21 07:08, Stephen J. Turnbull пише:
Serhiy Storchaka writes:
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them [...]. [In the case of non-standard precisions,] [t]here are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only.
I am planning to add public analogs of these private functions, but more powerful and convenient.
PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed)
Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow)
I don't understand why such a complex API is useful as a public facility.
There are several goals: 1. Support conversion to/from all C integer types (char, short, int, long, long long, intN_t, intptr_t, intmax_t, wchar_t, wint_t and corresponding unsigned types), POSIX integer types (pid_t, uid_t, off_t, etc) and other platfrom or library specific integer types (like Tcl_WideInt in libtcl). Currently only supported types are long, unsigned long, long long, unsigned long, ssize_t and size_t. For other types you should choose the most appropriate supertype (long or long long, sometimes providing several varians) and manually handle overflow. There are requests for PyLong_AsShort(), PyLong_AsInt32(), PyLong_AsMaxInt(), etc. It is better to provide a single universal function than extend API by several dozens functions. 2. Support different options for overflow handling. Different options are present in PyLong_AsLong(), PyLong_AsLongAndOverflow(), PyLong_AsUnsignedLongMask() and PyNumber_AsSsize_t(). But not all options are available for all types. There is no *AndOverflow() variant for unsigned types, size_t, ssize_t, and saturation is only available for ssize_t. 3. Support serialization of arbitrary precision integers. It is used in pickle and random, and can be used to support other binary data formats. All these goals can be achieved by few universal functions.
So I might want PyLong_AsGMPInt and PyLong_AsGMPRatio as well as the corresponding functions for MP, and maybe even PyLong_AsGMPFloat. The obvious way to write those is <library constructor>(str(python_integer)), I think.
PyLong_AsGMPInt() cannot be added until GMP be included in Python interpreter, and it is very unlikely. Converting via decimal representation is very inefficient way, especially for very long integers (it has cubic complexity from the size of the integer). I think GMP support more efficient conversions.
In the unlikely event that an application needs to squeeze out that tiny bit of performance, I guess the library constructors all accept buffers of bytes, too, probably with a similarly complex API that can handle whatever the Python ABI throws at them.
For using the library constructors accepting buffers of bytes we need buffers of bytes. And the proposed functions provide the only interface for conversion Python integers to/from buffer of bytes.
In which case why not just expose the internal functions?
If you mean _PyLong_FromByteArray/_PyLong_AsByteArray, it is because we should polish them before exposing them. They currently do not provide different options for overflow, and I think that it may be more convenient way for common case of native bytes order. The names of functions, the number and order of parameters can be discussed. For such discussion I opened this thread. If you have alternative propositions, please show them.
Is it at all likely that that representation would ever change?
They do not rely on internal representation. They are for implementation-indepent representation.
On Sun, Aug 8, 2021 at 9:54 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
1. Support conversion to/from all C integer types (char, short, int, long, long long, intN_t, intptr_t, intmax_t, wchar_t, wint_t and corresponding unsigned types),
I suggest support for the "new" C sized types available in <stdint.h> Why anyone would want to use `long` that could be 32 or 64 bit depending on platform/compiler compiler, rather than `int32_t` or `int_64_t` is still confusing to me. granted, I only write a small amount of C extension code, but I always used the sized types, as otherwise I have no idea what might happen on different platforms. But in any case, thanks for doing this, it's a great idea. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 8 Aug 2021, at 18:53, Serhiy Storchaka <storchaka@gmail.com> wrote:
08.08.21 07:08, Stephen J. Turnbull пише:
Serhiy Storchaka writes:
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them [...]. [In the case of non-standard precisions,] [t]here are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only.
I am planning to add public analogs of these private functions, but more powerful and convenient.
PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed)
Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow)
I don't understand why such a complex API is useful as a public facility.
There are several goals:
1. Support conversion to/from all C integer types (char, short, int, long, long long, intN_t, intptr_t, intmax_t, wchar_t, wint_t and corresponding unsigned types), POSIX integer types (pid_t, uid_t, off_t, etc) and other platfrom or library specific integer types (like Tcl_WideInt in libtcl). Currently only supported types are long, unsigned long, long long, unsigned long, ssize_t and size_t. For other types you should choose the most appropriate supertype (long or long long, sometimes providing several varians) and manually handle overflow.
There are requests for PyLong_AsShort(), PyLong_AsInt32(), PyLong_AsMaxInt(), etc. It is better to provide a single universal function than extend API by several dozens functions.
But how would you convert from the buffer to the actual type you want? IMHO “pid_t a_pid; Py_LongAsBytes(val, &a_pid, sizeof(pid_t), …)” would be worse than having a number of aliases. The API is more cumbersome to use, and you loose type checking from the C compiler. Other than that, the variants you mention could in general just by aliases for conversion functions to/from the basic C types. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On 7 Aug 2021, at 19:22, Serhiy Storchaka <storchaka@gmail.com> wrote:
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them as fixed-width integers (little- and big-endian, signed and unsigned). In Python, we can use struct, array, memoryview and ctypes use for some standard sizes and int methods int.to_bytes and int.from_bytes for non-standard sizes. In C, there is the C API for converting to/from C types long, unsigned long, long long and unsigned long long. For other C types (signed and unsigned char, short, int) we need to use the C API for converting to long, and then truncate to the destination type with checking for overflow. For integers type aliases like pid_t we need to determine their size and signess and use corresponding C API or wrapper. For non-standard integers (e.g. 24-bit), integers wider than long long, and arbitrary precision integers all is much more complicated. There are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only.
I am planning to add public analogs of these private functions, but more powerful and convenient.
PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed)
Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow)
PyLong_FromBytes() returns the int object. It only fails in case of memory error or incorrect arguments (e.g. buf is NULL).
PyLong_AsBytes() writes bytes to the specified buffer, it does not allocate memory. If buf is NULL it returns the minimum size of the buffer for representing the integer. -1 is returned on error. if overflow is NULL, then OverfowError is raised, otherwise *overflow is set to +1 for overflowing the upper limit, -1 for overflowing the lower limit, and 0 for no overflow.
Now I have some design questions.
1. How to encode the byte order?
a) 1 -- little endian, 0 -- big endian b) 0 -- little endian, 1 -- big endian c) -1 -- little endian, +1 -- big endian, 0 -- native endian.
Use an enum and do not use 0 as a valid value to make mistakes easier to detect. I think you are right to have big endian, little endian and native endian. I do not think the numeric values of the enum matter (apart from avoiding 0).
Do we need to reserve some values for mixed endians?
What is mixed endian? I would guess that its use would be application specific - so I assume you would not need to support it.
2. How to specify the reduction modulo 2**(8*size) (like in PyLong_AsUnsignedLongMask)?
Add yet one flag in PyLong_AsBytes()? Use special value for the signed argument? 0 -- unsigned, 1 -- signed, 2 (or -1) -- modulo. Or use some combination of signed and overflow?
3. How to specify saturation (like in PyNumber_AsSsize_t())? I.e. values less than the lower limit are replaced with the lower limit, values greater than the upper limit are replaced with the upper limit.
Same options as for (2): separate flag, encode in signed (but we need two values here) or combination of other parameters.
Maybe a single enum that has: signed (modulo) signed saturate unsigned (modulo) unsigned saturate
4. What exact names to use?
PyLong_FromByteArray/PyLong_AsByteArray, PyLong_FromBytes/PyLong_AsBytes, PyLong_FromBytes/PyLong_ToBytes?
Barry
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/V2EKXM... Code of Conduct: http://python.org/psf/codeofconduct/
On 2021-08-08 at 09:41:34 +0100, Barry Scott <barry@barrys-emacs.org> wrote:
What is mixed endian? I would guess that its use would be application specific - so I assume you would not need to support it.
Not AFAIK application specific, but hardware specific: https://en.wikipedia.org/wiki/Endianness#Mixed
08.08.21 11:41, Barry Scott пише:
On 7 Aug 2021, at 19:22, Serhiy Storchaka <storchaka@gmail.com> wrote: 1. How to encode the byte order?
a) 1 -- little endian, 0 -- big endian b) 0 -- little endian, 1 -- big endian c) -1 -- little endian, +1 -- big endian, 0 -- native endian.
Use an enum and do not use 0 as a valid value to make mistakes easier to detect. I think you are right to have big endian, little endian and native endian. I do not think the numeric values of the enum matter (apart from avoiding 0).
There is a precedence of using +1/-1/0 for big/little/native in the UTF16 and UTF32 codecs. I think that using the same convention will be more error-proof.
Maybe a single enum that has: signed (modulo) signed saturate unsigned (modulo) unsigned saturate
There is a problem with enum -- the size of the type is not specified. It can be int, it can be 8 bits, it can be less than 8 bits in structure. Adding new members can change the size of the type. Therefore it is not stable for ABI. But combining options for signessness and overflow handling (or providing a set of functions for different overflow handling, because the output overflow parameters is not in all cases) may be the best option.
I lack the relevant experience to have an opinion on most of this, but FWIW "PyLong_FromBytes/PyLong_ToBytes' seems clearest to me out of the options proposed. On Sat, Aug 7, 2021 at 2:23 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
Python integers have arbitrary precision. For serialization and interpolation with other programs and libraries we need to represent them as fixed-width integers (little- and big-endian, signed and unsigned). In Python, we can use struct, array, memoryview and ctypes use for some standard sizes and int methods int.to_bytes and int.from_bytes for non-standard sizes. In C, there is the C API for converting to/from C types long, unsigned long, long long and unsigned long long. For other C types (signed and unsigned char, short, int) we need to use the C API for converting to long, and then truncate to the destination type with checking for overflow. For integers type aliases like pid_t we need to determine their size and signess and use corresponding C API or wrapper. For non-standard integers (e.g. 24-bit), integers wider than long long, and arbitrary precision integers all is much more complicated. There are private C API functions _PyLong_AsByteArray and _PyLong_FromByteArray, but they are for internal use only.
I am planning to add public analogs of these private functions, but more powerful and convenient.
PyObject *PyLong_FromBytes(const void *buf, Py_ssize_t size, int byteorder, int signed)
Py_ssize_t PyLong_AsBytes(PyObject *o, void *buf, Py_ssize_t n, int byteorder, int signed, int *overflow)
PyLong_FromBytes() returns the int object. It only fails in case of memory error or incorrect arguments (e.g. buf is NULL).
PyLong_AsBytes() writes bytes to the specified buffer, it does not allocate memory. If buf is NULL it returns the minimum size of the buffer for representing the integer. -1 is returned on error. if overflow is NULL, then OverfowError is raised, otherwise *overflow is set to +1 for overflowing the upper limit, -1 for overflowing the lower limit, and 0 for no overflow.
Now I have some design questions.
1. How to encode the byte order?
a) 1 -- little endian, 0 -- big endian b) 0 -- little endian, 1 -- big endian c) -1 -- little endian, +1 -- big endian, 0 -- native endian.
Do we need to reserve some values for mixed endians?
2. How to specify the reduction modulo 2**(8*size) (like in PyLong_AsUnsignedLongMask)?
Add yet one flag in PyLong_AsBytes()? Use special value for the signed argument? 0 -- unsigned, 1 -- signed, 2 (or -1) -- modulo. Or use some combination of signed and overflow?
3. How to specify saturation (like in PyNumber_AsSsize_t())? I.e. values less than the lower limit are replaced with the lower limit, values greater than the upper limit are replaced with the upper limit.
Same options as for (2): separate flag, encode in signed (but we need two values here) or combination of other parameters.
4. What exact names to use?
PyLong_FromByteArray/PyLong_AsByteArray, PyLong_FromBytes/PyLong_AsBytes, PyLong_FromBytes/PyLong_ToBytes?
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/V2EKXM... Code of Conduct: http://python.org/psf/codeofconduct/
participants (7)
-
2QdxY4RzWzUUiLuE@potatochowder.com
-
Barry Scott
-
Christopher Barker
-
Kyle Stanley
-
Ronald Oussoren
-
Serhiy Storchaka
-
Stephen J. Turnbull