[Tutor] simple unicode question

Sat Aug 23 03:56:53 CEST 2014

On Fri, Aug 22, 2014 at 02:10:21PM -0700, Albert-Jan Roskam wrote:
> Hi,
> 
> I have data that is either floats or byte strings in utf-8. I need to 
> cast both to unicode strings. I am probably missing something simple, 
> but.. in the code below, under "float", why does [B] throw an error 
> but [A] does not?

Unicode in Python 2 is a little more confusing than in Python 3. But 
let's see what is going on:

> >>> value = 1.0
> >>> unicode(value)      # [A]
> u'1.0'

This works for the same reason that str(value) works. By definition, the 
value 1.0 can only be converted into a single [text or byte] string, 
namely 1.0.

(Well, to be absolutely pedantic, Python could support other languages, 
like ۱.۰ which is Arabic, but it doesn't.)

> >>> unicode(value, sys.getdefaultencoding())  # [B]
> 
> Traceback (most recent call last):
>   File "<pyshell#22>", line 1, in <module>
>     unicode(value, sys.getdefaultencoding())
> TypeError: coercing to Unicode: need string or buffer, float found

Here, on the other hand, unicode sees that you are providing a second 
argument, so it expects a string or buffer object, but gets a float so 
it raises an error.

You're probably thinking something along these lines:

"Unicode strings need an encoding, so I want to convert 1.0 into Unicode 
using the ASCII encoding (or the UTF-8 encoding)"

but that's not how it works. Unicode strings DON'T need an encoding. The 
Unicode string "1.0", or for that matter "π÷⇒Ж", is a string of exactly 
those characters and nothing else. In the same way that ASCII defines 
127 characters, including A, B, C, ... Unicode defines (up to) 1114112 
characters. There's no need to specify an encoding, because Unicode *is* 
the encoding.

You only need to use an encoding when converting from Unicode to bytes, 
or visa versa.

-- 
Steven