[Python-Dev] Encoding variable-length integers/counts in pickle
Andrew McLean
lists at andros.org.uk
Tue Jul 10 06:03:03 EDT 2018
Google Protocol Buffers use something similar, which they call "
<https://developers.google.com/protocol-buffers/docs/encoding#top_of_page>Base
128 Varints"
https://developers.google.com/protocol-buffers/docs/encoding#varints
I prefer the way this handles negative numbers.
- Andrew
On 10 July 2018 at 02:53, MRAB <python at mrabarnett.plus.com> wrote:
> In the regex module I use an encoding scheme when pickling pattern objects
> which is based on the way MIDI encodes variable-length integers, and I
> think it might have a use in a pickle protocol.
>
> In the basic format, an integer is split up into 7-bit chunks, each chunk
> put into a byte, and the most-significant bit of the byte used to signal
> whether the value continues into the following byte.
>
> And integer must be encoded into the minimum number of bytes, so an
> encoded sequence of bytes would never start with 0x80.
>
> MIDI deviates from the basic idea in order to reduce the number of bytes,
> so as sequence of bytes in MIDI _can_ start with x080; this is fine for
> MIDI because it doesn't need to represent negative integers.
>
> The format I'm suggesting uses an initial 0x80 as a way of letting it
> encode negative integers.
>
> Here are a couple of Python functions that summarise the encoding and
> decoding (minus obvious optimisations for simplicity):
>
>
> def encode_varint(value: int) -> List[int]:
> negative = value < 0
> encoded = []
>
> if negative:
> final = -1
> else:
> final = 0
>
> while True:
> encoded.append(0x80 | (value & 0x7F))
> value >>= 7
>
> if value == final:
> break
>
> if negative:
> encoded.append(0x80)
>
> encoded.reverse()
>
> encoded[-1] &= 0x7F
>
> return encoded
>
>
> def decode_varint(encoded: Iterable[int]) -> int:
> byte = next(encoded)
>
> if byte == 0x80:
> value = -1
> byte = next(encoded)
> else:
> value = 0
>
> value = (value << 7) | (byte & 0x7F)
>
> while (byte & 0x80) != 0:
> byte = next(encoded)
> value = (value << 7) | (byte & 0x7F)
>
> return value
>
>
> The advantage of encoding integers in this way is that there's no limit to
> their size, so there's no need to add a new protocol to support larger
> counts.
>
> They can also make pickles smaller.
>
> Example:
>
> # Pickle (None, )
> 0: \x80 PROTO 4
> 2: \x95 FRAME 4
> 11: N NONE
> 12: \x85 TUPLE1
> 13: \x94 MEMOIZE (as 0)
> 14: . STOP
>
> Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a
> version that accepted a variable-length count, you could reduce that
> argument to 1 byte.
>
> You also wouldn't need to have different fixed-length versions of an
> opcode, e.g. BINUNICODE and SHORT_BINUNICODE.
>
> Whether you do anything with this is entirely up to the core devs, I just
> thought someone might find it useful.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/lists%
> 40andros.org.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180710/117c6eb9/attachment.html>
More information about the Python-Dev
mailing list