unicode, bytes redux

Mon Sep 25 04:11:52 EDT 2006

Paul Rubin wrote:
> Duncan Booth explains why that doesn't work.  But I don't see any big
> problem with a byte count function that lets you specify an encoding:
> 
>      u = buf.decode('UTF-8')
>      # ... later ...
>      u.bytes('UTF-8') -> 3
>      u.bytes('UCS-4') -> 4
> 
> That avoids creating a new encoded string in memory, and for some
> encodings, avoids having to scan the unicode string to add up the
> lengths.

It requires a fairly large change to code and API for a relatively 
uncommon problem. How often do you need to know how many bytes an 
encoded Unicode string takes up without needing the encoded string itself?