A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string. According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested. Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way. -- Greg
Hi,
Internally, CPython has a _PyUnicodeWriter which is an efficient way
to create a string but appending substrings or characters.
_PyUnicodeWriter changes the internal storage format depending on
characters code points (ascii or latin1: 1 byte/character, BMP: 2 b/c,
full UCS: 4 b/c). I tried once to expose it in Python, but I wasn't
convinced by performances. The overhead of method calls was quite
significant, and I wasn't convinced by "writer += str" performance
neither. Maybe I should try again. PyPy also has such object. It
avoids the "str += str" hack in ceval.c to avoid very poor performance
(_PyUnicodeWriter also uses overallocation which can be controlled
with multiple parameters to reduce the number of realloc).
Another alternative would be have to add a "strarray" type similar to
bytes/bytearray couple.
Is is what you are looking for? Or do you really need array.array API?
Victor
Le ven. 22 mars 2019 à 08:38, Greg Ewing
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested.
Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way.
-- Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
-- Night gathers, and now my watch begins. It shall not end until my death.
22.03.19 09:45, Victor Stinner пише:
Internally, CPython has a _PyUnicodeWriter which is an efficient way to create a string but appending substrings or characters. _PyUnicodeWriter changes the internal storage format depending on characters code points (ascii or latin1: 1 byte/character, BMP: 2 b/c, full UCS: 4 b/c). I tried once to expose it in Python, but I wasn't convinced by performances. The overhead of method calls was quite significant, and I wasn't convinced by "writer += str" performance neither. Maybe I should try again. PyPy also has such object. It avoids the "str += str" hack in ceval.c to avoid very poor performance (_PyUnicodeWriter also uses overallocation which can be controlled with multiple parameters to reduce the number of realloc).
Another alternative would be have to add a "strarray" type similar to bytes/bytearray couple.
Another alternative of mutable string buffer and string builder is io.StringIO.
On Fri, Mar 22, 2019 at 4:38 PM Greg Ewing
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested.
Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way.
I think it's because there are not much use cases found
when implementing PEP 393.
If there are use cases enough to keep it in stdlib, I'm OK
about un-deprecate it and make it always 32bit (int32_t).
--
Inada Naoki
FYI, I have created issue on bugs.python.org about adding deprecation warning
for array('u').
https://bugs.python.org/issue36299
I created PR to change Py_UNICODE to Py_UCS4, instead of deprecate it.
https://github.com/python/cpython/pull/12497
Then, I found same change had made and reverted in the past.
https://github.com/python/cpython/commit/62bb394729a167a46d950954c4aed5f3ba7...
The issue for the revert is this.
https://bugs.python.org/issue13072
--
Inada Naoki
On Fri, Mar 22, 2019 at 08:31:33PM +1300, Greg Ewing wrote:
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested.
Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way.
I can't answer any of those questions, but perhaps the poster can do this instead: py> a = array('L', 'ℍℰâѵÿ Ϻεταł'.encode('utf-32be')) py> a array('L', [220266496, 807469056, 3791650816, 1963196416, 4278190080, 536870912, 4194500608, 3036872704, 3288530944, 2969763840, 1107361792]) Getting the string out again is no harder: py> bytes(a).decode('utf-32be') 'ℍℰâѵÿ Ϻεταł' But having said that, it would be nice to have an array code which treated the values as single UTF-32 characters: array('?', ['ℍ', 'ℰ', 'â', 'ѵ', 'ÿ', ' ', 'Ϻ', 'ε', 'τ', 'α', 'ł']) if for no other reason than it looks nicer than a bunch of 32 bit ints. -- Steven
On Fri, 22 Mar 2019 20:31:33 +1300
Greg Ewing
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
TBH, I think anyone trying to use array.array should be directed to Numpy these days. The only reason for array.array being here is that it predates Numpy. Otherwise we'd never have added it. Regards Antoine.
Antoine Pitrou schrieb am 22.03.19 um 11:39:
On Fri, 22 Mar 2019 20:31:33 +1300 Greg Ewing wrote:
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
TBH, I think anyone trying to use array.array should be directed to Numpy these days. The only reason for array.array being here is that it predates Numpy. Otherwise we'd never have added it.
Well, maybe it wouldn't get *added* these days anymore, with pip+PyPI nicely in place. But being there already, it makes for a nice and efficient "batteries included" list replacement for simple data that would otherwise waste a lot of object memory. Stefan
On Fri, 22 Mar 2019 12:51:49 +0100
Stefan Behnel
Antoine Pitrou schrieb am 22.03.19 um 11:39:
On Fri, 22 Mar 2019 20:31:33 +1300 Greg Ewing wrote:
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
TBH, I think anyone trying to use array.array should be directed to Numpy these days. The only reason for array.array being here is that it predates Numpy. Otherwise we'd never have added it.
Well, maybe it wouldn't get *added* these days anymore, with pip+PyPI nicely in place. But being there already, it makes for a nice and efficient "batteries included" list replacement for simple data that would otherwise waste a lot of object memory.
It's not really "batteries included". array.array() supports almost no useful operation. It's a bare-bones container for which you have to implement every useful feature by yourself. (yes, you can use generic mutable sequence algorithms such as heapq or random.shuffle; how often do you need to heapify or shuffle an array of unicode codepoints?) Also, when using a unicode array, there's no substantial win of memory compared to a single str object. You may be losing some actually, because of the flexible str representation. Regards Antoine.
Antoine Pitrou wrote:
TBH, I think anyone trying to use array.array should be directed to Numpy these days. The only reason for array.array being here is that it predates Numpy.
Numpy is a huge dependency to pull in when you don't need all the heavyweight array machinery. Also, numpy arrays behave very differently from Python sequences in many ways. For example, they don't have a flexible size, and you can't concatenate them using +. Arrays from the array module follow the sequence protocol much more closely, and that makes them valuable, IMO. -- Greg
22.03.19 09:31, Greg Ewing пише:
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested.
Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way.
Making it always 32 bits would be compatibility breaking change. Currently array('u') represents the wchar_t string, and many API on Windows require it. But we can add a new code, e.g. 'U', for UCS4.
On Fri, 22 Mar 2019 13:27:08 +0200
Serhiy Storchaka
22.03.19 09:31, Greg Ewing пише:
A poster on comp.lang.python is asking about array.array('u'). He wants an efficient mutable collection of unicode characters that can be initialised from a string.
According to the docs, the 'u' code is deprecated and will be removed in 4.0, but no alternative is suggested.
Why is this being deprecated, instead of keeping it and making it always 32 bits? It seems like useful functionality that can't be easily obtained another way.
Making it always 32 bits would be compatibility breaking change. Currently array('u') represents the wchar_t string, and many API on Windows require it.
The question is: why would you use a array.array() with a Windows C API? Regards Antoine.
22.03.19 13:33, Antoine Pitrou пише:
On Fri, 22 Mar 2019 13:27:08 +0200 Serhiy Storchaka
wrote: Making it always 32 bits would be compatibility breaking change. Currently array('u') represents the wchar_t string, and many API on Windows require it.
The question is: why would you use a array.array() with a Windows C API?
I do not. But maybe it is used. And changing the width of the 'u' code would break such use case.
On Fri, 22 Mar 2019 16:11:45 +0200
Serhiy Storchaka
22.03.19 13:33, Antoine Pitrou пише:
On Fri, 22 Mar 2019 13:27:08 +0200 Serhiy Storchaka
wrote: Making it always 32 bits would be compatibility breaking change. Currently array('u') represents the wchar_t string, and many API on Windows require it.
The question is: why would you use a array.array() with a Windows C API?
I do not. But maybe it is used. And changing the width of the 'u' code would break such use case.
Right. But changing the width is not what I had in mind. Just keep it identical until we finally decide to remove it. Regards Antoine.
On 22Mar2019 0433, Antoine Pitrou wrote:
The question is: why would you use a array.array() with a Windows C API?
I started replying to this with a whole lot of examples, and eventually convinced myself that you wouldn't (or shouldn't). That said, I see value in having a type for array/struct/memoryview that "is the same size as wchar_t", since that will avoid people having to guess (the docs for array in particular are deliberately vague about the actual size of the various types). This is not the same as "UCS-4" - that's a very Linux-centric point of view. Decoupling it from Py_UNICODE is fine though, since that type will have no meaning eventually. But the PyUnicode_*WideChar APIs are not going away, which means that wchar_t still exists and has to have a known size at compile time. Cheers, Steve
On Fri, 22 Mar 2019 at 16:12, Steve Dower
On 22Mar2019 0433, Antoine Pitrou wrote:
The question is: why would you use a array.array() with a Windows C API?
I started replying to this with a whole lot of examples, and eventually convinced myself that you wouldn't (or shouldn't).
That said, I see value in having a type for array/struct/memoryview that "is the same size as wchar_t", since that will avoid people having to guess (the docs for array in particular are deliberately vague about the actual size of the various types).
This is pretty much what ctypes provides for dealing with unicode? https://docs.python.org/3/library/ctypes.html#ctypes.create_unicode_buffer Seems a fine place to have things that help with win32 api interactions. Martin
On 25Mar2019 0812, Martin (gzlist) wrote:
On Fri, 22 Mar 2019 at 16:12, Steve Dower
wrote: On 22Mar2019 0433, Antoine Pitrou wrote:
The question is: why would you use a array.array() with a Windows C API?
I started replying to this with a whole lot of examples, and eventually convinced myself that you wouldn't (or shouldn't).
That said, I see value in having a type for array/struct/memoryview that "is the same size as wchar_t", since that will avoid people having to guess (the docs for array in particular are deliberately vague about the actual size of the various types).
This is pretty much what ctypes provides for dealing with unicode?
https://docs.python.org/3/library/ctypes.html#ctypes.create_unicode_buffer
Seems a fine place to have things that help with win32 api interactions.
Sure, though there are other reasons to deal with "pure" data that would benefit from having the data type in array. I don't need to directly refer to an existing buffer in memory, just to easily create/reinterpret bytes (for which memoryview is often better, though it inherits its codes from struct, which has no 'u' code, which is probably why I end up using array instead ;) ) Also, I keep finding that every time I deploy Python these days, it's critical to remove ctypes to reduce the attack surface area (I'm getting it into more and more high value systems where the rules are more strict). So I'm a big fan of treating ctypes as optional. Cheers, Steve
Serhiy Storchaka wrote:
Making it always 32 bits would be compatibility breaking change. Currently array('u') represents the wchar_t string, and many API on Windows require it.
Ah, I see. It would be helpful if the array module docs made that clear. At one point the 3.x docs said that it depended on whether you had a wide or narrow unicode Python build, which confused me, because I thought that distinction had gone away in Python 3.
But we can add a new code, e.g. 'U', for UCS4.
+1. -- Greg
participants (9)
-
Antoine Pitrou
-
Greg Ewing
-
Inada Naoki
-
Martin (gzlist)
-
Serhiy Storchaka
-
Stefan Behnel
-
Steve Dower
-
Steven D'Aprano
-
Victor Stinner