unicode, bytes redux
Paul Rubin
http
Mon Sep 25 03:45:29 EDT 2006
willie <willie at jamots.com> writes:
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> u = buf.decode('UTF-8')
> # ... later ...
> u.bytes() -> 3
>
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)
Duncan Booth explains why that doesn't work. But I don't see any big
problem with a byte count function that lets you specify an encoding:
u = buf.decode('UTF-8')
# ... later ...
u.bytes('UTF-8') -> 3
u.bytes('UCS-4') -> 4
That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.
More information about the Python-list
mailing list