About size of Unicode string

Mon Jun 6 16:02:40 EDT 2005

Frank Abel Cancio Bello wrote:
> Can I get how many bytes have a string object independently of its encoding?
> Is the "len" function the right way of get it?

No.  len(unicode_string) returns the number of characters in the
unicode_string.

Number of bytes depends on how the unicode character are represented.
Different encodings will use different numbers of bytes.

>>> u = u"G\N{Latin small letter A with ring above}"
>>> u
u'G\xe5'
>>> len(u)
2
>>> u.encode("utf-8")
'G\xc3\xa5'
>>> len(u.encode("utf-8"))
3
>>> u.encode("latin1")  
'G\xe5'
>>> len(u.encode("latin1"))
2
>>> u.encode("utf16") 
'\xfe\xff\x00G\x00\xe5'
>>> len(u.encode("utf16"))
6
>>> 

> Laci look the following code:
> 
> 	import urllib2
> 	request = urllib2.Request(url= 'http://localhost:6000')
> 	data = 'data to send\n'.encode('utf_8')
> 	request.add_data(data)
> 	request.add_header('content-length', str(len(data)))
> 	request.add_header('content-encoding', 'UTF-8')
> 	file = urllib2.urlopen(request)
> 
> Is always true that "the size of the entity-body" is "len(data)"
> independently of the encoding of "data"?

For this case it is true because the logical length of 'data'
(which is a byte string) is equal to the number of bytes in the
string, and the utf-8 encoding of a byte string with character
values in the range 0-127, inclusive, is unchanged from the
original string.

In general, as if 'data' is a unicode strings, no.

len() returns the logical length of 'data'.  That number does
not need to be the number of bytes used to represent 'data'.
To get the bytes you must encode the object.

				Andrew
				dalke at dalkescientific.com