[Tutor] unicode decode/encode issue

Steven D'Aprano steve at pearwood.info
Mon Sep 26 13:31:42 EDT 2016


On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:

> When I look at the input content, I have :
> 
>  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> 
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)

You misinterpret what you see. \u2013 *is* a dash (its an en-dash):

py> import unicodedata
py> unicodedata.name(u'\u2013')
'EN DASH'

Try printing the string, and you will see what it looks like:

py> content = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
py> print content
English 120 Course Syllabus – Fall – 2006


Python strings include a lot of escape codes. Simple byte strings 
include:

\t tab
\n newline
\r carriage return
\0 ASCII null byte
etc.

plus escape codes for hex codes:

\xDD (two digit hex code, between hex 00 and hex FF)

That lets you enter any byte between (decimal) 0 and 255. For example:

\x20

is the hex code 20 (decimal 32), which is a space:

py> '\x20' == ' '
True


Unicode strings allow the same escape codes as byte strings, plus 
special Unicode escape codes:

\uDDDD (four digit hex codes, for codes between 0 and 65535)

\UDDDDDDDD (eight digit hex codes, for codes between 0 and 1114111)

\N{name}  (Unicode names)


Remember to print the string to see what it looks like with the escape 
codes shown as actual characters, instead of escape codes.



-- 
Steve


More information about the Tutor mailing list