unicode and hashlib
dundeemt at gmail.com
Sun Nov 30 03:54:10 CET 2008
On Nov 29, 12:23 pm, Scott David Daniels <Scott.Dani... at Acm.Org>
> Scott David Daniels wrote:
> > If you now, and for all time, decide that the only source you will take
> > is cp1252, perhaps you should decode to cp1252 before hashing.
> Of course my dyslexia sticks out here as I get encode and decode exactly
> backwards -- Marc 'BlackJack' Rintsch has it right.
> Characters (a concept) are "encoded" to a byte format (representation).
> Bytes (a precise representation) are "decoded" to characters (a format
> with semantics).
> --Scott David Daniels
> Scott.Dani... at Acm.Org
Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
>128 - shhh'boom. So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)
Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.
decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set
Please correct me if you see something wrong and thank you for your
advice and direction.
u'unicordial-ly yours. ;)'
More information about the Python-list