[Tutor] simple unicode question

Danny Yoo dyoo at hashcollision.org
Sat Aug 23 02:53:14 CEST 2014


On Fri, Aug 22, 2014 at 2:10 PM, Albert-Jan Roskam
<fomcl at yahoo.com.dmarc.invalid> wrote:
> Hi,
>
> I have data that is either floats or byte strings in utf-8. I need to cast both to unicode strings.


Just to be sure, I'm parsing the problem statement above as:

    data :== float
                | utf-8-encoded-byte-string

because the alternative way to parse the statement in English:

    data :== float-in-utf-8
                | byte-string-in-utf-8

doesn't make any technical sense.  :P



> I am probably missing something simple, but.. in the code below, under "float", why does [B] throw an error but [A] does not?
>

> # float: cannot explicitly give encoding, even if it's the default
>>>> value = 1.0
>>>> unicode(value)      # [A]
> u'1.0'
>>>> unicode(value, sys.getdefaultencoding())  # [B]
>
> Traceback (most recent call last):
>   File "<pyshell#22>", line 1, in <module>
>     unicode(value, sys.getdefaultencoding())
> TypeError: coercing to Unicode: need string or buffer, float found


Yeah.  Unfortunately, you're right: this doesn't make too much sense.

What's happening is that the standard library overloads two
_different_ behaviors to the same function unicode(). It's conditioned
on whether we're passing in a single value, or if we're passing in
two.  I would not try to reconcile a single, same behavior for both
uses: treat them as two distinct behaviors.

Reference: https://docs.python.org/2/library/functions.html#unicode


Specifically, the two arg case is meant where you've got an
uninterpreted source of bytes that should be decoded to Unicode using
the provided encoding.


So for your problem statement, the function should look something like:

###############################
def convert(data):
    if isinstance(data, float):
        return unicode(data)
    if isinstance(data, bytes):
        return unicode(data, "utf-8")
    raise ValueError("Unexpected data", data)
###############################

where you must use unicode with either the 1-arg or 2-arg variant
based on your input data.


More information about the Tutor mailing list