Re: getting rid of —

Fri Jul 3 14:34:21 EDT 2009

On 3 Jul., 18:54, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Tep wrote:
> > On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> >> "Tep" <petshm... at googlemail.com> wrote in message
>
> >>news:46d36544-1ea2-4391-8922-11b8127a2fef at o6g2000yqj.googlegroups.com...
>
> >>> On 3 Jul., 06:40, Simon Forman <sajmik... at gmail.com> wrote:
> >>>> On Jul 2, 4:31 am, Tep <petshm... at googlemail.com> wrote:
> >> [snip]
> >>>>>>>> how can I replace '—' sign from string? Or do split at that
> >>>>>>>> character?
> >>>>>>>> Getting unicode error if I try to do it:
> >>>>>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
> >>>>>>>> position
> >>>>>>>> 1: ordinal not in range(128)
> >>>>>>>> Thanks, Pet
> >>>>>>>> script is # -*- coding: UTF-8 -*-
> >> [snip]
> >>>> I just tried a bit of your code above in my interpreter here and it
> >>>> worked fine:
> >>>> |>>> data = 'foo — bar'
> >>>> |>>> data.split('—')
> >>>> |['foo ', ' bar']
> >>>> |>>> data = u'foo — bar'
> >>> |>>> data.split(u'—')
> >>>> |[u'foo ', u' bar']
> >>>> Figure out the smallest piece of "html source code" that causes the
> >>>> problem and include that with your next post.
> >>> The problem was, I've converted "html source code" to unicode object
> >>> and didn't encoded to utf-8 back, before using split...
> >>> Thanks for help and sorry for not so smart question
> >>> Pet
> >> You'd still benefit from posting some code.  You shouldn't be converting
>
> > I've posted code below
>
> >> back to utf-8 to do a split, you should be using a Unicode string with split
> >> on the Unicode version of the "html source code".  Also make sure your file
> >> is actually saved in the encoding you declare.  I print the encoding of your
> >> symbol in two encodings to illustrate why I suspect this.
>
> > File was indeed in windows-1252, I've changed this. For errors see
> > below
>
> >> Below, assume "data" is your "html source code" as a Unicode string:
>
> >> # -*- coding: UTF-8 -*-
> >> data = u'foo — bar'
> >> print repr(u'—'.encode('utf-8'))
> >> print repr(u'—'.encode('windows-1252'))
> >> print data.split(u'—')
> >> print data.split('—')
>
> >> OUTPUT:
>
> >> '\xe2\x80\x94'
> >> '\x97'
> >> [u'foo ', u' bar']
> >> Traceback (most recent call last):
> >>   File
> >> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
> >> line 427, in ImportFile
> >>     exec codeObj in __main__.__dict__
> >>   File "<auto import>", line 1, in <module>
> >>   File "x.py", line 6, in <module>
> >>     print data.split('—')
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
> >> ordinal not in range(128)
>
> >> Note that using the Unicode string in split() works.  Also note the decode
> >> byte in the error message when using a non-Unicode string to split the
> >> Unicode data.  In your original error message the decode byte that caused an
> >> error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
> >> save your source code in the encoding you declare.  If I save the above
> >> script in windows-1252 encoding and change the coding line to windows-1252 I
> >> get the same results, but the decode byte is 0x97.
>
> >> # coding: windows-1252
> >> data = u'foo — bar'
> >> print repr(u'—'.encode('utf-8'))
> >> print repr(u'—'.encode('windows-1252'))
> >> print data.split(u'—')
> >> print data.split('—')
>
> >> '\xe2\x80\x94'
> >> '\x97'
> >> [u'foo ', u' bar']
> >> Traceback (most recent call last):
> >>   File
> >> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
> >> line 427, in ImportFile
> >>     exec codeObj in __main__.__dict__
> >>   File "<auto import>", line 1, in <module>
> >>   File "x.py", line 6, in <module>
> >>     print data.split('ק)
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
> >> ordinal not in range(128)
>
> >> -Mark
>
> > #! /usr/bin/python
> > # -*- coding: UTF-8 -*-
> > import urllib2
> > import re
> > def getTitle(input):
> >     title = re.search('<title>(.*?)</title>', input)
>
> The input is Unicode, so it's probably better for the regular expression
> to also be Unicode:
>
>      title = re.search(u'<title>(.*?)</title>', input)
>
> (In the current implementation it actually doesn't matter.)
>
> >     title = title.group(1)
> >     print "FULL TITLE", title.encode('UTF-8')
> >     parts = title.split(' — ')
>
> The title is Unicode, so the string with which you're splitting should
> also be Unicode:
>
>      parts = title.split(u' — ')
>

Oh, so simple. I'm new to python and still feel uncomfortable with
unicode stuff.

Thanks to all for help!

>
>
> >     return parts[0]
>
> > def getWebPage(url):
> >     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
> >     headers = { 'User-Agent' : user_agent }
> >     req = urllib2.Request(url, '', headers)
> >     response = urllib2.urlopen(req)
> >     the_page = unicode(response.read(), 'UTF-8')
> >     return the_page
>
> > def main():
> >     url = "http://bg.wikipedia.org/wiki/
> > %D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
> >     title = getTitle(getWebPage(url))
> >     print title[0]
>
> > if __name__ == "__main__":
> >     main()
>
> > Traceback (most recent call last):
> >   File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
> >     main()
> >   File "C:\user\Projects\test\src\new_main.py", line 24, in main
> >     title = getTitle(getWebPage(url))
> > FULL TITLE Ð‘Ð°Ñ…Ñ€ÐµÐ¹Ð½ â€” Ð£Ð¸ÐºÐ¸Ð¿ÐµÐ´Ð¸Ñ
> >   File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
> >     parts = title.split(' â€” ')
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
> > 1: ordinal not in range(128)