[Tutor] Removing GB pound symbols from Beautiful soup output

Ethan Wei think123 at gmail.com
Fri Jul 16 17:29:06 CEST 2010


1st. what's the BBC's rss page coding? UTF-8 or something.
2nd. confirm you file coding and document coding equate rss page coding.
3rd. add fromEncoding to you BeautifulSoup instance。

>  ex. soup = BeautifulSoup(html,fromEncoding="utf-8")
>


2010/7/16 Andy <cheesman at titan.physx.u-szeged.hu>

> Dear Nice people
>
> I've been using beautiful soup to filter the BBC's rss feed. However,
> recently the bbc have changed the feed and it is causing me problems with
> the pound(money) symbol. The initial error was "UnicodeEncodeError: 'ascii'
> codec can't encode character u'\xa3'" which means that the default encoding
> can't process this (unicode) character. I was having simular problems with
> HTML characters appearing but I used a simple regex system to
> remove/substitute them to something suitable.
> I tried applying the same approach and make a generic regex patten
> (re.compile(u"""\u\[A-Fa-f0-9\]\{4\}""") but this fails because it doesn't
> follow the standard patten for ascii. I'm not sure that I 100% understand
> the unicode system but is there a simple way to remove/subsitute these non
> ascii strings?
>
> Thanks for any help!
>
> Andy
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100716/a49b2543/attachment.html>


More information about the Tutor mailing list