O(1) deletes from the front of bytearray (was: Re: Adding bytes.frombuffer() constructor to PEP 467 (was: [Python-ideas] Adding bytes.frombuffer() constructor)
On Tue, Oct 11, 2016 at 10:53 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 12.10.16 08:03, Nathaniel Smith wrote:
On Tue, Oct 11, 2016 at 9:08 PM, INADA Naoki <songofacandy@gmail.com> wrote:
From Python 3.4, bytearray is good solution for I/O buffer, thanks to #19087 [1]. Actually, asyncio uses bytearray as I/O buffer often.
Whoa what?! This is awesome, I had no idea that bytearray had O(1) deletes at the front. I literally reimplemented this myself on type of bytearray for some 3.5-only code recently because I assumed bytearray had the same asymptotics as list, and AFAICT this is totally undocumented. Shouldn't we write this down somewhere? Maybe here? -> https://docs.python.org/3/library/functions.html#bytearray
I afraid this is CPython implementation detail (like string concatenation optimization). Other implementations can have O(N) deletes at the front of bytearray.
Well, it shouldn't be :-). The problem with the string concatenation optimization is that to work, it requires both the use of refcounting GC and that you get lucky with how the underlying malloc implementation happened to lay things out in memory. Obviously it shouldn't be part of the language spec. But amortized O(1) deletes from the front of bytearray are totally different, and more like amortized O(1) appends to list: there are important use cases[1] that simply cannot be implemented without some feature like this, and putting the implementation inside bytearray is straightforward, deterministic, and more efficiently than hacking together something on top. Python should just guarantee it, IMO. -n [1] My use case is parsing HTTP out of a receive buffer. If deleting the first k bytes of an N byte buffer is O(N), then not only does parsing becomes O(N^2) in the worst case, but it's the sort of O(N^2) that random untrusted network clients can trigger at will to DoS your server. -- Nathaniel J. Smith -- https://vorpus.org
On 12.10.16 09:31, Nathaniel Smith wrote:
But amortized O(1) deletes from the front of bytearray are totally different, and more like amortized O(1) appends to list: there are important use cases[1] that simply cannot be implemented without some feature like this, and putting the implementation inside bytearray is straightforward, deterministic, and more efficiently than hacking together something on top. Python should just guarantee it, IMO.
-n
[1] My use case is parsing HTTP out of a receive buffer. If deleting the first k bytes of an N byte buffer is O(N), then not only does parsing becomes O(N^2) in the worst case, but it's the sort of O(N^2) that random untrusted network clients can trigger at will to DoS your server.
Deleting from buffer can be avoided if pass the starting index together with the buffer. For example: def read_line(buf: bytes, start: int) -> (bytes, int): try: end = buf.index(b'\r\n', start) except ValueError: return b'', start return buf[start:end], end+2
On Wed, Oct 12, 2016 at 12:17 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 12.10.16 09:31, Nathaniel Smith wrote:
But amortized O(1) deletes from the front of bytearray are totally different, and more like amortized O(1) appends to list: there are important use cases[1] that simply cannot be implemented without some feature like this, and putting the implementation inside bytearray is straightforward, deterministic, and more efficiently than hacking together something on top. Python should just guarantee it, IMO.
-n
[1] My use case is parsing HTTP out of a receive buffer. If deleting the first k bytes of an N byte buffer is O(N), then not only does parsing becomes O(N^2) in the worst case, but it's the sort of O(N^2) that random untrusted network clients can trigger at will to DoS your server.
Deleting from buffer can be avoided if pass the starting index together with the buffer. For example:
def read_line(buf: bytes, start: int) -> (bytes, int): try: end = buf.index(b'\r\n', start) except ValueError: return b'', start
return buf[start:end], end+2
It's more complicated than that -- the right algorithm is the one that Antoine implemented in 3.4. But yes, having implemented this by hand, I am aware that it can be implemented by hand :-). My point is that forcing everyone who writes network code in Python to do that is silly, especially given that CPython's apparently been shipping this feature for years. -n -- Nathaniel J. Smith -- https://vorpus.org
[1] My use case is parsing HTTP out of a receive buffer. If deleting the first k bytes of an N byte buffer is O(N), then not only does parsing becomes O(N^2) in the worst case, but it's the sort of O(N^2) that random untrusted network clients can trigger at will to DoS your server.
Deleting from buffer can be avoided if pass the starting index together with the buffer. For example:
def read_line(buf: bytes, start: int) -> (bytes, int): try: end = buf.index(b'\r\n', start) except ValueError: return b'', start
return buf[start:end], end+2
In case of asyncio, we can't assume the order of append and consume. For example, stream processing HTTP chunked response. Append to receive buffer and consume a chunk in buffer can happen in arbitrary order. That's why bytes is not good for receive buffer. Efficient append is "must have". For example, Torando implements receive buffer by deque of bytes. See this code. https://github.com/tornadoweb/tornado/blob/master/tornado/iostream.py#L784-L... When Tornado drop Python 2.7 support, they can use bytearray, and iostream can be more simple and fast. So I hope "amortized O(1) deletion from the front" is language spec, at least for Python 3.5+ -- INADA Naoki <songofacandy@gmail.com>
2016-10-12 10:01 GMT+02:00 Nathaniel Smith <njs@pobox.com>:
It's more complicated than that -- the right algorithm is the one that Antoine implemented in 3.4. (...) My point is that forcing everyone who writes network code in Python to do that is silly, especially given that CPython's apparently been shipping this feature for years.
"For years" means since March 2014, Python 3.4.0 release, so 2 years ago. We can document the optimization as a CPython implementation detail and explain that it's only in Python >= 3.4. So an application which should work on Python 2.7 as well cannot rely on this optimization for example. Victor
On Wed, Oct 12, 2016 at 4:55 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-10-12 10:01 GMT+02:00 Nathaniel Smith <njs@pobox.com>:
It's more complicated than that -- the right algorithm is the one that Antoine implemented in 3.4. (...) My point is that forcing everyone who writes network code in Python to do that is silly, especially given that CPython's apparently been shipping this feature for years.
"For years" means since March 2014, Python 3.4.0 release, so 2 years ago.
We can document the optimization as a CPython implementation detail and explain that it's only in Python >= 3.4.
So an application which should work on Python 2.7 as well cannot rely on this optimization for example.
The proposal is that it should be documented as being part of the language spec starting in 3.4 (or whatever). So applications that support Python 2.7 can't rely on it, sure. But if I have an application that requires, say, 3.5+ but I don't want to depend on CPython-only implementation details, then I'm still allowed to use it. AFAIK basically the only project that would be affected by this is PyPy, and I when I asked on #pypy they said: <cfbolz> njs`: I think we either plan to or already support this so I'm not sure why this is controversial. -n -- Nathaniel J. Smith -- https://vorpus.org
On Wed, Oct 12, 2016 at 3:28 AM, INADA Naoki <songofacandy@gmail.com> wrote:
When Tornado drop Python 2.7 support, they can use bytearray, and iostream can be more simple and fast.
FYI 2.7 does have bytearray. (You still have to implement the O(1) deletion part as a layer on top, like Victor points out, but I suspect that'd still be dramatically simpler than what they're doing now...) -n -- Nathaniel J. Smith -- https://vorpus.org
AFAIK basically the only project that would be affected by this is PyPy, and I when I asked on #pypy they said:
<cfbolz> njs`: I think we either plan to or already support this
so I'm not sure why this is controversial.
FYI, I had filed the feature request on PyPy issue tracker. https://bitbucket.org/pypy/pypy/issues/2367/del-bytearray-n-should-be-optimi... In case of micropython, it doesn't support deletion for now. https://github.com/micropython/micropython/blob/b9672bcbe8dd29b61326af8eb026... -- INADA Naoki <songofacandy@gmail.com>
On Wed, 12 Oct 2016 14:14:48 -0700 Nathaniel Smith <njs@pobox.com> wrote:
The proposal is that it should be documented as being part of the language spec starting in 3.4 (or whatever). So applications that support Python 2.7 can't rely on it, sure. But if I have an application that requires, say, 3.5+ but I don't want to depend on CPython-only implementation details, then I'm still allowed to use it.
AFAIK basically the only project that would be affected by this is PyPy, and I when I asked on #pypy they said:
<cfbolz> njs`: I think we either plan to or already support this
so I'm not sure why this is controversial.
The main reason it could be controversial is if someone finds out another optimization that is made difficult or impossible by the O(1) delete-at-front guarantee. That sounds unlikely, though, since the general structure of a bytearray is constrained by other factors as well (such as the C API and the buffer API). By the way, to answer another question, the reason this wasn't made part of the spec originally is that the feature in itself was already contentious (see issue tracker discussion). Now that more people seem to get interested in network programming, they seem to understand the point ;-) Regards Antoine.
On 13/10/2016 11:41, Serhiy Storchaka wrote:
On 13.10.16 00:14, Nathaniel Smith wrote:
AFAIK basically the only project that would be affected by this is PyPy,
And MicroPython.
And Jython, except that from the start its implementation of bytearray deferred resizing until the proportion unused space reaches some limit. I think that should make it O(log N) on average to delete (or add) a byte, at either end of a buffer of size N,. However, observations with timeit() look constant up to the point I run out of heap. Jeff Allen
The proposal is that it should be documented as being part of the language spec starting in 3.4 (or whatever).
Is the performance characteristics of any object part of the language spec? I.e if someone wrote an implementation with an O(n) insert dict, it would suck, but wouldn't it still be Python? -CHB
So applications that support Python 2.7 can't rely on it, sure. But if I have an application that requires, say, 3.5+ but I don't want to depend on CPython-only implementation details, then I'm still allowed to use it.
AFAIK basically the only project that would be affected by this is PyPy, and I when I asked on #pypy they said:
<cfbolz> njs`: I think we either plan to or already support this
so I'm not sure why this is controversial.
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
On Wed, Oct 19, 2016 at 2:57 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
The proposal is that it should be documented as being part of the language spec starting in 3.4 (or whatever).
Is the performance characteristics of any object part of the language spec?
I.e if someone wrote an implementation with an O(n) insert dict, it would suck, but wouldn't it still be Python?
This exact question came up when MicroPython wanted to represent Unicode strings internally as UTF-8. It was decided that having O(n) indexing/slicing was acceptable, and was something that the implementation could judge the value of. (Since uPy is designed for smaller systems than CPython is, O(n) is less significant.) ChrisA
participants (8)
-
Antoine Pitrou
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
INADA Naoki
-
Jeff Allen
-
Nathaniel Smith
-
Serhiy Storchaka
-
Victor Stinner