[Tutor] unicode decode/encode issue
Steven D'Aprano
steve at pearwood.info
Mon Sep 26 13:31:42 EDT 2016
On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:
> When I look at the input content, I have :
>
> u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)
You misinterpret what you see. \u2013 *is* a dash (its an en-dash):
py> import unicodedata
py> unicodedata.name(u'\u2013')
'EN DASH'
Try printing the string, and you will see what it looks like:
py> content = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
py> print content
English 120 Course Syllabus – Fall – 2006
Python strings include a lot of escape codes. Simple byte strings
include:
\t tab
\n newline
\r carriage return
\0 ASCII null byte
etc.
plus escape codes for hex codes:
\xDD (two digit hex code, between hex 00 and hex FF)
That lets you enter any byte between (decimal) 0 and 255. For example:
\x20
is the hex code 20 (decimal 32), which is a space:
py> '\x20' == ' '
True
Unicode strings allow the same escape codes as byte strings, plus
special Unicode escape codes:
\uDDDD (four digit hex codes, for codes between 0 and 65535)
\UDDDDDDDD (eight digit hex codes, for codes between 0 and 1114111)
\N{name} (Unicode names)
Remember to print the string to see what it looks like with the escape
codes shown as actual characters, instead of escape codes.
--
Steve
More information about the Tutor
mailing list