[Python-Dev] Encoding variable-length integers/counts in pickle

MRAB python at mrabarnett.plus.com
Mon Jul 9 21:53:47 EDT 2018


In the regex module I use an encoding scheme when pickling pattern 
objects which is based on the way MIDI encodes variable-length integers, 
and I think it might have a use in a pickle protocol.

In the basic format, an integer is split up into 7-bit chunks, each 
chunk put into a byte, and the most-significant bit of the byte used to 
signal whether the value continues into the following byte.

And integer must be encoded into the minimum number of bytes, so an 
encoded sequence of bytes would never start with 0x80.

MIDI deviates from the basic idea in order to reduce the number of 
bytes, so as sequence of bytes in MIDI _can_ start with x080; this is 
fine for MIDI because it doesn't need to represent negative integers.

The format I'm suggesting uses an initial 0x80 as a way of letting it 
encode negative integers.

Here are a couple of Python functions that summarise the encoding and 
decoding (minus obvious optimisations for simplicity):


def encode_varint(value: int) -> List[int]:
     negative = value < 0
     encoded = []

     if negative:
         final = -1
     else:
         final = 0

     while True:
         encoded.append(0x80 | (value & 0x7F))
         value >>= 7

         if value == final:
             break

     if negative:
         encoded.append(0x80)

     encoded.reverse()

     encoded[-1] &= 0x7F

     return encoded


def decode_varint(encoded: Iterable[int]) -> int:
     byte = next(encoded)

     if byte == 0x80:
         value = -1
         byte = next(encoded)
     else:
         value = 0

     value = (value << 7) | (byte & 0x7F)

     while (byte & 0x80) != 0:
         byte = next(encoded)
         value = (value << 7) | (byte & 0x7F)

     return value


The advantage of encoding integers in this way is that there's no limit 
to their size, so there's no need to add a new protocol to support 
larger counts.

They can also make pickles smaller.

Example:

     # Pickle (None, )
     0: \x80 PROTO      4
     2: \x95 FRAME      4
    11: N    NONE
    12: \x85 TUPLE1
    13: \x94 MEMOIZE    (as 0)
    14: .    STOP

Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a 
version that accepted a variable-length count, you could reduce that 
argument to 1 byte.

You also wouldn't need to have different fixed-length versions of an 
opcode, e.g. BINUNICODE and SHORT_BINUNICODE.

Whether you do anything with this is entirely up to the core devs, I 
just thought someone might find it useful.


More information about the Python-Dev mailing list