unicode and hashlib

Jeff H dundeemt at gmail.com
Sun Nov 30 03:54:10 CET 2008

On Nov 29, 12:23 pm, Scott David Daniels <Scott.Dani... at Acm.Org>
> Scott David Daniels wrote:
> ...
> > If you now, and for all time, decide that the only source you will take
> > is cp1252, perhaps you should decode to cp1252 before hashing.
> Of course my dyslexia sticks out here as I get encode and decode exactly
> backwards -- Marc 'BlackJack' Rintsch has it right.
> Characters (a concept) are "encoded" to a byte format (representation).
> Bytes (a precise representation) are "decoded" to characters (a format
> with semantics).
> --Scott David Daniels
> Scott.Dani... at Acm.Org

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
>128 - shhh'boom.  So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

>>> a='André'
>>> b=unicode(a,'cp1252')
>>> b
>>> hashlib.md5(b.encode('utf-8')).hexdigest()

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'

More information about the Python-list mailing list