unicode, bytes redux

willie willie at jamots.com
Mon Sep 25 02:37:58 EDT 2006


(beating a dead horse)

Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it's feasible to calculate the number
of bytes that make up the unicode code points.

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"

u = buf.decode('UTF-8')

# ... later ...

u.bytes() -> 3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)



More information about the Python-list mailing list