[Tutor] unicode decode/encode issue

Mon Sep 26 17:27:54 EDT 2016

Hey folks. (peter!)

Thanks for the reply.

I wound up doing:

  #s=s.replace('\u2013', '-')
  #s=s.replace(u'\u2013', '-')
  #s=s.replace(u"\u2013", "-")
  #s=re.sub(u"\u2013", "-", s)
  s=s.encode("ascii", "ignore")
  s=s.replace(u"\u2013", "-")
  s=s.replace("–", "-")  ##<<< this was actually in the raw content
apparently

  print repr(s)

The test no longer has the unicode 'dash'

I'll revisit and simplify later. One or two of the above ines should be
able to be removed, and still have the unicode issue resolved.

Thanks

On Mon, Sep 26, 2016 at 1:54 PM, Peter Otten <__peter__ at web.de> wrote:

> bruce wrote:
>
> > Hi.
> >
> > Ive got a "basic" situation that should be simpl. So it must be a user
> > (me) issue!
> >
> >
> > I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> > ascii. I'm not worried about any cruft that might get stripped out as the
> > data is generated from a us site. (It's a college/class dataset).
> >
> > I know this is a unicode issue. I know I need to have a much more
> > robust/ythnic/correct approach. I will later, but for now, just want to
> > resolve this issue, and get it off my plate so to speak.
> >
> > I've looked at stackoverflow, as well as numerous other sites, so I turn
> > to the group for a pointer or two...
> >
> > The unicode that I'm dealing with is 'u\2013'
> >
> > The basic things I've done up to now are:
> >
> >   s=content
> >   s=ascii_strip(s)
> >   s=s.replace('\u2013', '-')
> >   s=s.replace(u'\u2013', '-')
> >   s=s.replace(u"\u2013", "-")
> >   s=re.sub(u"\u2013", "-", s)
> >   print repr(s)
> >
> > When I look at the input content, I have :
> >
> >  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >
> > So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> > could even handle just a ' ' (space)
>
> I suppose you want to replace the DASH with HYPHEN-MINUS. For that both
>
> >   s=s.replace(u'\u2013', '-')
> >   s=s.replace(u"\u2013", "-")
>
> should work (the Python interpreter sees no difference between the two).
> Let's try:
>
> >>> s = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >>> t = s.replace(u"\u2013", "-")
> >>> s == t
> False
> >>> s
> u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >>> t
> u'English 120 Course Syllabus - Fall - 2006'
>
> So it look like you did not actually try the code you posted.
>
> To remove all non-ascii codepoints you can use encode():
>
> >>> s.encode("ascii", "ignore")
> 'English 120 Course Syllabus  Fall  2006'
>
> (Note that the result is a byte string)
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>