I want to be able to grab the title of articles from a webpage. I wrote my script using the following XPath *en_tree.xpath('//a[@class="pubSectionTitle"]')* To grab from the following example XML: *<a href='/en/publications/**magazines/wp20141201/ancient-* *city-timgad/' class="pubSectionTitle" title="Timgad—A Buried City Reveals Its Secrets"> Timgad<wbr />—A Buried City Reveals Its Secrets </a> * When I encounter the above example, and continue with my script (See the code below) I only get 'Timgad', not the entire title. Thanks for any help, I'm very inexperienced with this! *en_toc = en_tree.xpath('//a[@class="* *pubSectionTitle"]') for title in chs_toc: entry = title.text.strip() en_titles.append(entry)*
On 12/24/14 14:13, Jason Williams wrote:
I want to be able to grab the title of articles from a webpage. I wrote my script using the following XPath
/en_tree.xpath('//a[@class="pubSectionTitle"]')/
To grab from the following example XML:
/<a href='/en/publications///magazines/wp20141201/ancient-//city-timgad/' class="pubSectionTitle" title="Timgad—A Buried City Reveals Its Secrets">
Timgad<wbr />—A Buried City Reveals Its Secrets </a> / When I encounter the above example, and continue with my script (See the code below) I only get 'Timgad', not the entire title.
Thanks for any help, I'm very inexperienced with this!
/en_toc = en_tree.xpath('//a[@class="//pubSectionTitle"]')
for title in chs_toc: entry = title.text.strip() en_titles.append(entry)/
try entry = title.text_content() the rest of the text is technically at the tail of the wbr tag. so entry = (title.text + title[0].tail).strip would return what you want. But don't use the above in production, you need proper error handling etc. just use text_content() hth burak
participants (2)
-
Burak Arslan
-
Jason Williams