Trying to parse matchup.io (lxml, SGMLParser, urlparse)
jerry.rocteur at gmail.com
Sun Jan 18 13:07:37 CET 2015
I'm trying to parse https://matchup.io/players/rocteur/friends
The body source I'm interested in contains blocks exactly like this
<a href="/players/mizucci0"><img alt="mizucci0" class="media__avatar"
I wanted to do it Python as I'm learning and I looked at the different
modules but it isn't easy for me to work out the best way to do this
as most tutorials I see use complicated classes and I just want to
parse this one paragraph at a time (as I would do in Perl) and print
1 mizuho 26648 35315
2 xxxxxx 99999 99999
3 xxxxxx 99999 99999
etc. (in the above case I'm ignoring 818.7 and Miles.
The best way I found so far is this:
from lxml import html
page = requests.get("https://matchup.io/players/rocteur/friends/week/")
tree = html.fromstring(page.text)
a = tree.xpath('//span/text()')
b = tree.xpath('//td/text()')
And the manipulating indices
print "%s %s %s %s" % (a[usern], a[users], b[tots], b[weekb])
tots += 4
weekb += 4
usern += 2
users += 2
But it isn't very scientific ;-)
Which module would you use and how would you suggest is the best way to do it ?
Thanks very much in advance, I haven't done a lot of HTML parsing.. I
would much prefer using WebServices and an API but unfortunately they
don't have it.
More information about the Python-list