[Tutor] search/match file position q
Peter Otten
__peter__ at web.de
Tue Oct 7 12:49:39 CEST 2014
Clayton Kirkwood wrote:
> I was trying to keep it generic.
> Wrapped data file:
> <tr data-row-symbol="SWKS"><td class="col-symbol
> txt"><span class="wrapper "
> data-model="name:DatumModel;id:null;" data-tmpl=""><a
> data-ylk="cat:portfolio;cpos:1"
> href="http://finance.yahoo.com/q?s=SWKS"
> data-rapid_p="18">SWKS</a></span></td><td
> class="col-fiftytwo_week_low cell-raw:23.270000"><span
> class="wrapper "
> data-model="name:DatumModel;id:SWKS:qsi:wk52:low;"
> data-tmpl="change:yfin.datum">23.27</span></td><td
> class="col-prev_close cell-raw:58.049999"><span
> class="wrapper " data-model="name:DatumMo
Doesn't Yahoo make the data available as CSV? That would be the way to go
then.
Anyway, regular expressions are definitely the wrong tool here, and reading
the file one line at a time only makes it worse.
> import re, os
> line_in = file.readline()
# read in humongous html line
> stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">', line_in)
> #scan to SWKS"> in data
#line, stock
should be SWKS
> low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in)
#want to
> pick up from
#SWKS">,
low_52 should be 23.27
>
> I am trying to figure out if each re.match starts scanning at the
> beginning of the same line over and over or does each scan start at the
> end of the last match. It appears to start over??
>
> This is stock:
> <_sre.SRE_Match object; span=(0, 47), match=' <tr
> data-row-symbol="SWKS">'> This is low_52:
> <_sre.SRE_Match object; span=(0, 502875), match=' <tr
> data-row-symbol="SWKS"><t>
> If necessary, how do I pick up and move forward to the point right after
> the previous match? File.tell() and file.__sizeof__(), don't seem to play
> a useful role.
You should try BeautifulSoup. Let's play:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""<tr data-row-symbol="SWKS"><td class="col-symbol
txt"><span class="wrapper " data-model="name:DatumModel;id:null;" data-
tmpl=""><a data-ylk="cat:portfolio;cpos:1"
href="http://finance.yahoo.com/q?s=SWKS" data-
rapid_p="18">SWKS</a></span></td><td class="col-fiftytwo_week_low cell-
raw:23.270000"><span class="wrapper " data-
model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close cell-
raw:58.049999">""")
>>> soup.find("tr")
<tr data-row-symbol="SWKS"><td class="col-symbol txt"><span class="wrapper "
data-model="name:DatumModel;id:null;" data-tmpl=""><a data-rapid_p="18"
data-ylk="cat:portfolio;cpos:1"
href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span></td><td class="col-
fiftytwo_week_low cell-raw:23.270000"><span class="wrapper " data-
model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close cell-
raw:58.049999"></td></tr>
>>> tr = soup.find("tr")
>>> tr["data-row-symbol"]
'SWKS'
>>> tr.find_all("span")
[<span class="wrapper " data-model="name:DatumModel;id:null;" data-
tmpl=""><a data-rapid_p="18" data-ylk="cat:portfolio;cpos:1"
href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span>, <span
class="wrapper " data-model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span>]
>>> span = tr.find_all("span")[1]
>>> span["data-model"]
'name:DatumModel;id:SWKS:qsi:wk52:low;'
>>> span.text
'23.27'
Note that normally soup would hold the complete html and you'd need a few
more iterations to get to the element of interest.
More information about the Tutor
mailing list