[Python-Dev] Encoding variable-length integers/counts in pickle
MRAB
python at mrabarnett.plus.com
Mon Jul 9 21:53:47 EDT 2018
In the regex module I use an encoding scheme when pickling pattern
objects which is based on the way MIDI encodes variable-length integers,
and I think it might have a use in a pickle protocol.
In the basic format, an integer is split up into 7-bit chunks, each
chunk put into a byte, and the most-significant bit of the byte used to
signal whether the value continues into the following byte.
And integer must be encoded into the minimum number of bytes, so an
encoded sequence of bytes would never start with 0x80.
MIDI deviates from the basic idea in order to reduce the number of
bytes, so as sequence of bytes in MIDI _can_ start with x080; this is
fine for MIDI because it doesn't need to represent negative integers.
The format I'm suggesting uses an initial 0x80 as a way of letting it
encode negative integers.
Here are a couple of Python functions that summarise the encoding and
decoding (minus obvious optimisations for simplicity):
def encode_varint(value: int) -> List[int]:
negative = value < 0
encoded = []
if negative:
final = -1
else:
final = 0
while True:
encoded.append(0x80 | (value & 0x7F))
value >>= 7
if value == final:
break
if negative:
encoded.append(0x80)
encoded.reverse()
encoded[-1] &= 0x7F
return encoded
def decode_varint(encoded: Iterable[int]) -> int:
byte = next(encoded)
if byte == 0x80:
value = -1
byte = next(encoded)
else:
value = 0
value = (value << 7) | (byte & 0x7F)
while (byte & 0x80) != 0:
byte = next(encoded)
value = (value << 7) | (byte & 0x7F)
return value
The advantage of encoding integers in this way is that there's no limit
to their size, so there's no need to add a new protocol to support
larger counts.
They can also make pickles smaller.
Example:
# Pickle (None, )
0: \x80 PROTO 4
2: \x95 FRAME 4
11: N NONE
12: \x85 TUPLE1
13: \x94 MEMOIZE (as 0)
14: . STOP
Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a
version that accepted a variable-length count, you could reduce that
argument to 1 byte.
You also wouldn't need to have different fixed-length versions of an
opcode, e.g. BINUNICODE and SHORT_BINUNICODE.
Whether you do anything with this is entirely up to the core devs, I
just thought someone might find it useful.
More information about the Python-Dev
mailing list