extracting from web pages but got disordered words sometimes
Frank Potter
could.net at gmail.com
Sat Jan 27 21:33:49 EST 2007
Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also
dealing with Chinese html pages and nothing error happened. I read the
old code and I find the difference. Change the page to unicode before
feeding to beautiful soup, then everything will be OK.
On Jan 28, 3:26 am, "Paul McGuire" <p... at austin.rr.com> wrote:
> After looking at the pyparsing results, I think I see the problem with
> your original code. You are selecting only the characters after the
> rightmost "-" character, but you really want to select everything to
> the right of "- -". In some of the titles, the encoded Chinese
> includes a "-" character, so you are chopping off everything before
> that.
>
> Try changing your code to:
> title=full_title.split("- -")[1]
>
> I think then your original program will work.
>
> -- Paul
More information about the Python-list
mailing list