creating very small types

Bengt Richter bokr at oz.net
Thu Apr 28 13:25:28 EDT 2005


On Thu, 28 Apr 2005 05:07:34 GMT, bokr at oz.net (Bengt Richter) wrote:

... some not quite correct code ;-/
(I copy/pasted and created an illusion. My code dict has no EOS, so
I decode pad zero bits as code that a single zero stands for ('a' in this case)
so that was an oversight. I should have extended the coding some more or reserved
perhaps 'F' as EOString, and tested for EOS in the decode stream.

>
This is revised, sorry.
You can make a 'way more efficient decoder as an exercise ;-)
Hint: if you make a dict of all the 2**width integers as keys where
width is your widest code (4 here for 2**4), then you can do a single
mask of 2**width-1 and look up the translation to (char, codewidth) directly,
according to the codewidth least significant bits, if you make all the states
of the other bits into keys that map to the same (char,codewidth).
So for example all the even values of out 2**16 would have to map to ('a', 1)
and all the values with 2 LSBs of 01 (of which there are 4) would map to ('b',2)
and so forth. Thus the incrementing of width and trying one thing after
another is not necessary. I think that will work ...

Well, now Andrea has something to do ;-)
>----< trivcodec.py >------------------------------------------------
#trivcodec.py
EOS = ''
import itertools
def encode(stream):
    outchar = count = 0
    for char in itertools.chain(stream, [EOS]):
        bits, width = codes[char]
        outchar |= (bits<<count)
        count += width
        while count >= 8:
            yield chr(outchar&0xff)
            outchar >>= 8
            count -= 8
    if count:
        yield chr(outchar)

def decode(stream):
    codebuf = count = 0
    width = 1; char = None
    for codebyte in stream:
        codebyte = ord(codebyte)
        codebuf |= (codebyte<<count)
        count += 8
        if width>count:
            continue
        while width <= count:
            code = (codebuf&((1<<width)-1), width)
            if code not in chars:
                width += 1
                continue
            char = chars[code]
            if char == EOS: break
            yield char
            codebuf >>= width
            count -= width
            width = 1
        if char == EOS: break
        width = 1 

#trivial encoding dict: a=0 b=01 C=0011 D=0111 E=1011 EOS=1111
codes = {'a':(0x00,1), 'b':(0x01, 2), 
         'C':(0x3+0x4*0,4), 'D':(0x3+0x4*1,4),
         'E':(0x3+0x4*2,4), EOS:(0x3+0x4*3,4)}
chars = dict([(v,k) for k,v in codes.items()]) # reverse

def test(*tests):
    if not tests: tests = ['abaDECbaabbD']
    for charstream in tests:
        print
        codestream = ''.join(list(encode(charstream)))
        print '%r [%s] -(encode)-> %r [%s]' % (
            charstream, len(charstream), codestream, len(codestream))
        recovered = ''.join(list(decode(codestream)))
        print '%r [%s] -(decode)-> %r [%s]' % (
            codestream, len(codestream), recovered, len(recovered))
     
if __name__ == '__main__':
    import sys
    test(*sys.argv[1:])
>--------------------------------------------------------------------
>
>Result:
Not really. Tack enough a's on the recovered chars to account for zero fill
of the last encoded byte ;-/
>
>[22:02] C:\pywk\clp>py24 trivcodec.py
>
>'abaFECbaabbDF' [13] -(encode)-> '\xf2;Q\xf7' [4]
>'\xf2;Q\xf7' [4] -(decode)-> 'abaFECbaabbDF' [13]

Fine, because the [4] bytes were totally filled.
>
>[22:02] C:\pywk\clp>py24 trivcodec.py  a b C D E F
>
>'a' [1] -(encode)-> '\x00' [1]
>'\x00' [1] -(decode)-> 'a' [1]
Not so fine. There is no character serving as EOS mark.
>
>'b' [1] -(encode)-> '\x01' [1]
>'\x01' [1] -(decode)-> 'b' [1]
>
>'C' [1] -(encode)-> '\x03' [1]
>'\x03' [1] -(decode)-> 'C' [1]
>
>'D' [1] -(encode)-> '\x07' [1]
>'\x07' [1] -(decode)-> 'D' [1]
>
>'E' [1] -(encode)-> '\x0b' [1]
>'\x0b' [1] -(decode)-> 'E' [1]
>
>'F' [1] -(encode)-> '\x0f' [1]
>'\x0f' [1] -(decode)-> 'F' [1]

That really produced:

[ 8:03] C:\pywk\clp>py24  trivcodec.py a b C D E F

'a' [1] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'aaaaaaaa' [8]

'b' [1] -(encode)-> '\x01' [1]
'\x01' [1] -(decode)-> 'baaaaaa' [7]

'C' [1] -(encode)-> '\x03' [1]
'\x03' [1] -(decode)-> 'Caaaa' [5]

'D' [1] -(encode)-> '\x07' [1]
'\x07' [1] -(decode)-> 'Daaaa' [5]

'E' [1] -(encode)-> '\x0b' [1]
'\x0b' [1] -(decode)-> 'Eaaaa' [5]

'F' [1] -(encode)-> '\x0f' [1]
'\x0f' [1] -(decode)-> 'Faaaa' [5]

So we need to assign a code as the EOS.
And probably translate that to '' as the return char,
and end on that.

So now it does (having given up 1111 for EOS instead of F):

[10:22] C:\pywk\clp>py24  trivcodec.py

'abaDECbaabbD' [12] -(encode)-> 'r;Q\xf7' [4]
'r;Q\xf7' [4] -(decode)-> 'abaDECbaabbD' [12]

[10:22] C:\pywk\clp>py24  trivcodec.py a b C D E

'a' [1] -(encode)-> '\x1e' [1]
'\x1e' [1] -(decode)-> 'a' [1]

'b' [1] -(encode)-> '=' [1]
'=' [1] -(decode)-> 'b' [1]

'C' [1] -(encode)-> '\xf3' [1]
'\xf3' [1] -(decode)-> 'C' [1]

'D' [1] -(encode)-> '\xf7' [1]
'\xf7' [1] -(decode)-> 'D' [1]

'E' [1] -(encode)-> '\xfb' [1]
'\xfb' [1] -(decode)-> 'E' [1]

Goes to show you, eh? Slice and dice similar lines carefully ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list