[Tutor] unicode decode/encode issue

Mon Sep 26 13:54:44 EDT 2016

bruce wrote:

> Hi.
> 
> Ive got a "basic" situation that should be simpl. So it must be a user
> (me) issue!
> 
> 
> I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> ascii. I'm not worried about any cruft that might get stripped out as the
> data is generated from a us site. (It's a college/class dataset).
> 
> I know this is a unicode issue. I know I need to have a much more
> robust/ythnic/correct approach. I will later, but for now, just want to
> resolve this issue, and get it off my plate so to speak.
> 
> I've looked at stackoverflow, as well as numerous other sites, so I turn
> to the group for a pointer or two...
> 
> The unicode that I'm dealing with is 'u\2013'
> 
> The basic things I've done up to now are:
> 
>   s=content
>   s=ascii_strip(s)
>   s=s.replace('\u2013', '-')
>   s=s.replace(u'\u2013', '-')
>   s=s.replace(u"\u2013", "-")
>   s=re.sub(u"\u2013", "-", s)
>   print repr(s)
> 
> When I look at the input content, I have :
> 
>  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> 
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)

I suppose you want to replace the DASH with HYPHEN-MINUS. For that both

>   s=s.replace(u'\u2013', '-')
>   s=s.replace(u"\u2013", "-")

should work (the Python interpreter sees no difference between the two). 
Let's try:

>>> s = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>>> t = s.replace(u"\u2013", "-")
>>> s == t
False
>>> s
u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>>> t
u'English 120 Course Syllabus - Fall - 2006'

So it look like you did not actually try the code you posted.

To remove all non-ascii codepoints you can use encode():

>>> s.encode("ascii", "ignore")
'English 120 Course Syllabus  Fall  2006'

(Note that the result is a byte string)