[Tutor] unicode decode/encode issue

Steven D'Aprano steve at pearwood.info
Mon Sep 26 13:31:42 EDT 2016

On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:

> When I look at the input content, I have :
>  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)

You misinterpret what you see. \u2013 *is* a dash (its an en-dash):

py> import unicodedata
py> unicodedata.name(u'\u2013')

Try printing the string, and you will see what it looks like:

py> content = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
py> print content
English 120 Course Syllabus – Fall – 2006

Python strings include a lot of escape codes. Simple byte strings 

\t tab
\n newline
\r carriage return
\0 ASCII null byte

plus escape codes for hex codes:

\xDD (two digit hex code, between hex 00 and hex FF)

That lets you enter any byte between (decimal) 0 and 255. For example:


is the hex code 20 (decimal 32), which is a space:

py> '\x20' == ' '

Unicode strings allow the same escape codes as byte strings, plus 
special Unicode escape codes:

\uDDDD (four digit hex codes, for codes between 0 and 65535)

\UDDDDDDDD (eight digit hex codes, for codes between 0 and 1114111)

\N{name}  (Unicode names)

Remember to print the string to see what it looks like with the escape 
codes shown as actual characters, instead of escape codes.


More information about the Tutor mailing list