[Tutor] unicode decode/encode issue
Peter Otten
__peter__ at web.de
Mon Sep 26 13:54:44 EDT 2016
bruce wrote:
> Hi.
>
> Ive got a "basic" situation that should be simpl. So it must be a user
> (me) issue!
>
>
> I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> ascii. I'm not worried about any cruft that might get stripped out as the
> data is generated from a us site. (It's a college/class dataset).
>
> I know this is a unicode issue. I know I need to have a much more
> robust/ythnic/correct approach. I will later, but for now, just want to
> resolve this issue, and get it off my plate so to speak.
>
> I've looked at stackoverflow, as well as numerous other sites, so I turn
> to the group for a pointer or two...
>
> The unicode that I'm dealing with is 'u\2013'
>
> The basic things I've done up to now are:
>
> s=content
> s=ascii_strip(s)
> s=s.replace('\u2013', '-')
> s=s.replace(u'\u2013', '-')
> s=s.replace(u"\u2013", "-")
> s=re.sub(u"\u2013", "-", s)
> print repr(s)
>
> When I look at the input content, I have :
>
> u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)
I suppose you want to replace the DASH with HYPHEN-MINUS. For that both
> s=s.replace(u'\u2013', '-')
> s=s.replace(u"\u2013", "-")
should work (the Python interpreter sees no difference between the two).
Let's try:
>>> s = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>>> t = s.replace(u"\u2013", "-")
>>> s == t
False
>>> s
u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
>>> t
u'English 120 Course Syllabus - Fall - 2006'
So it look like you did not actually try the code you posted.
To remove all non-ascii codepoints you can use encode():
>>> s.encode("ascii", "ignore")
'English 120 Course Syllabus Fall 2006'
(Note that the result is a byte string)
More information about the Tutor
mailing list