Why does the "".join(r) do this?

Peter Otten __peter__ at web.de
Fri May 21 04:04:51 EDT 2004


Jim Hefferon wrote:

> Peter Otten <__peter__ at web.de> wrote
>> So why doesn't it just concatenate? Because there is no way of knowing
>> how to properly decode chr(174) or any other non-ascii character to
>> unicode:
>> 
>> >>> chr(174).decode("latin1")
>>  u'\xae'
>> >>> chr(174).decode("latin2")
>>  u'\u017d'
>> >>>
> 
> Forgive me, Peter, but you've only rephrased my question: I'm going to
> decode them later, so why does the concatenator insist on decoding
> them now?  As I understand it (perhaps this is my error),
> encoding/decoding is stuff that you do external to manipulating the
> arrays of characters.

Perhaps another example will help in addition to the answers already given:

>>> 1 + 2.0
3.0

In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
have 

>>> float(1) + 2.0
3.0

In the same spirit

>>> u"a" + "b"
u'ab'

"b" is converted to unicode before u"a" and u"b" can be concatenated. The
same goes for string formatting:

>>> "a%s" % u"b"
u'ab'
>>> u"a%s" % "b"
u'ab'

The following might be the conversion function:

>>> def tounicode(s, encoding="ascii"):
...     return s.decode(encoding)
...
>>> u"a" + tounicode("b")
u'ab'

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:

>>> u"a" + tounicode(chr(174), "latin1")
u'a\xae'
>>> u"a" + tounicode(chr(174), "latin2")
u'a\u017d'
>>>

By the way, in the real conversion routine the encoding isn't hardcoded, see
sys.get/setdefaultencoding() for the details. Therefore you _could_ modify
site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
practical benefit of that is limited as you cannot make assumptions about
machines not under your control and therefore are stuck with ascii as the
least common denominator for scripts meant to be portable - which brings us
back to:

>> Use either unicode or str, but don't mix them. That should keep you out
>> of trouble.

Or make all conversions explicit with the str.decode()/unicode.encode()
methods.

> Well, I got this string as the filename of some kind of Macintosh file
> (I'm on Linux but I'm working with an archive that contains some pre-X
> Mac stuff) while calling some os and os.path functions.  So I'm taking
> strings from a Python library function (and using % to stuff them into
> strings that will end up on the web, which should preserve
> unicode-type-ness, right?) and then .join-ing them.
> 
> I didn't go into the whole story when posting, because I tried to boil
> the question down.  Perhaps I should have.

While details are often helpful to identify a problem that is different from
the poster's guess, unicode handling is pretty general, and it was rather
my post that was lacking clarity.

Peter




More information about the Python-list mailing list