[CentralOH] Screen Scraping Presentation

Thomas Winningham winningham at gmail.com
Tue Dec 10 23:56:38 CET 2013


Oh geez now I've done it :P

Honestly I don't do too much. I had only heard about Nutch once, and when I
googled it came across Scrapy. I did have some success with PyQuery like
earlier in the year, but I only needed like one thing really and was
thinking about the problem in CSS3 style, so it fit. I cannot say much
about its performance. I've used lxml for some broken HTML and XML since
its forgiving parser is somewhat nice depending on how muddy the water the
is, but I don't know how much I could say about these things you couldn't
google.

I'll play around and try to come up with something then perhaps heh :P


On Tue, Dec 10, 2013 at 5:50 PM, Eric Miller <miller.eric.t at gmail.com>wrote:

> +1, would love to see this.
>
>
> On Tue, Dec 10, 2013 at 5:45 PM, Chris Folsom <jcfolsom at pureperfect.com>wrote:
>
>>
>> Sounds awesome. I would definitely attend. Currently using Nutch + Regexp.
>>
>>  -------- Original Message --------
>> Subject: [CentralOH] Screen Scraping Presentation
>> From: jep200404 at columbus.rr.com
>> Date: Tue, December 10, 2013 5:37 pm
>> To: "Mailing list for Central Ohio Python User Group (COhPy)"
>> <centraloh at python.org>
>>
>> On Tue, 10 Dec 2013 17:09:26 -0500, Thomas Winningham <
>> winningham at gmail.com> wrote:
>>
>> > Is lxml or pyquery what should be talked about instead of beautiful
>> soup? i
>> > know lxml has a beautiful soup mode, but CSS3 selectors are so very
>> nice.
>>
>> Please give a presentation on other ways of screen scraping
>> (that might even be better).
>>
>> How many other folks would like to see his presentation?
>>
>> _______________________________________________
>> CentralOH mailing list
>> CentralOH at python.org
>> https://mail.python.org/mailman/listinfo/centraloh
>>
>>
>> _______________________________________________
>> CentralOH mailing list
>> CentralOH at python.org
>> https://mail.python.org/mailman/listinfo/centraloh
>>
>>
>
> _______________________________________________
> CentralOH mailing list
> CentralOH at python.org
> https://mail.python.org/mailman/listinfo/centraloh
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/centraloh/attachments/20131210/d37d2f3a/attachment.html>


More information about the CentralOH mailing list