compressing short strings?

inhahe inhahe at gmail.com
Wed May 21 11:15:45 EDT 2008


i don't see anybody mentioning huffman encoding.  i think it just works per 
byte, so it's not as tight as gzip or whatever.  but it sounds like it would 
be easy to implement and wouldn't require any corpus-wide compression 
information. except a character frequency count if you wanted to be optimal.

<i don't know anything about compression>

"Paul Rubin" <http://phr.cx@NOSPAM.invalid> wrote in message 
news:7xy7658k8o.fsf_-_ at ruckus.brouhaha.com...
>I have a lot of short English strings I'd like to compress in order to
> reduce the size of a database.  That is, I'd like a compression
> function that takes a string like (for example) "George Washington"
> and returns a shorter string, with luck maybe 6 bytes or so.  One
> obvious idea is take the gzip function, compress some large text
> corpus with it in streaming mode and throw away the output (but
> setting up the internal state to model the statistics of English
> text), then put in "George Washington" and treat the additional output
> as the compressed string.  Obviously to get reasonable speed there
> would have to be a way to save the internal state after initializing
> from the corpus.
>
> Anyone know if this has been done and if there's code around for it?
> Maybe I'm better off with freezing a dynamic Markov model?  I think
> there's DMM code around but am not sure where to look.
>
> Thanks. 





More information about the Python-list mailing list