[Tutor] What is ™
Steven D'Aprano
steve at pearwood.info
Fri Dec 23 00:30:25 CET 2011
bob gailer wrote:
> >>> "™"
> '\xe2\x84\xa2'
>
> What is this hex string?
Presumably you are running Python 2, yes? I will assume that you are using
Python 2 in the following explanation.
You have just run smack bang into a collision between text and bytes, and in
particular, the fact that your console is probably Unicode aware, but Python
so-called strings are by default bytes and not text.
When you enter "™", your console is more than happy to allow you to enter a
Unicode trademark character[1] and put it in between " " delimiters. This
creates a plain bytes string. But the ™ character is not a byte, and shouldn't
be treated as one -- Python should raise an error, but in an effort to be
helpful, instead it tries to automatically encode that character to bytes
using some default encoding. (Probably UTF-8.) The three hex bytes you
actually get is the encoding of the TM character.
Python 2 does have proper text strings, but you have to write it as a unicode
string:
py> s = u"™"
py> len(s)
1
py> s
u'\u2122'
py> print s
™
py> s.encode('utf-8')
'\xe2\x84\xa2'
Notice that encoding the trademark character to UTF-8 gives the same sequence
of bytes as Python guesses on your behalf, which supports my guess that it is
using UTF-8 by default.
If you take the three character byte string and decode it using UTF-8, you
will get the trademark character back.
If all my talk of encodings doesn't mean anything to you, you should read this:
http://www.joelonsoftware.com/articles/Unicode.html
[1] Assuming your console is set to use the same encoding as my mail client is
using. Otherwise I'm seeing something different to you.
--
Steven
More information about the Tutor
mailing list