[Tutor] search/match file position q

Peter Otten __peter__ at web.de
Tue Oct 7 12:49:39 CEST 2014


Clayton Kirkwood wrote:

> I was trying to keep it generic.
> Wrapped data file:
>                    <tr data-row-symbol="SWKS"><td class="col-symbol
>                    txt"><span class="wrapper "
>                    data-model="name:DatumModel;id:null;" data-tmpl=""><a
>                    data-ylk="cat:portfolio;cpos:1"
>                    href="http://finance.yahoo.com/q?s=SWKS"
>                    data-rapid_p="18">SWKS</a></span></td><td
>                    class="col-fiftytwo_week_low cell-raw:23.270000"><span
>                    class="wrapper "
>                    data-model="name:DatumModel;id:SWKS:qsi:wk52:low;"
>                    data-tmpl="change:yfin.datum">23.27</span></td><td
>                    class="col-prev_close cell-raw:58.049999"><span
>                    class="wrapper " data-model="name:DatumMo

Doesn't Yahoo make the data available as CSV? That would be the way to go 
then.

Anyway, regular expressions are definitely the wrong tool here, and reading 
the file one line at a time only makes it worse.

> import re, os
>     line_in = file.readline()						
	# read in humongous html line
>         stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">', line_in)
>         #scan to SWKS"> in data 						
							#line, stock 
should be SWKS
>         low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in)		
#want to
>         pick up from							
							#SWKS">, 
low_52 should be 23.27
> 
> I am trying to figure out if each re.match starts scanning at the
> beginning of the same line over and over or does each scan start at the
> end of the last match. It appears to start over??
> 
> This is stock:
> <_sre.SRE_Match object; span=(0, 47), match='                    <tr
> data-row-symbol="SWKS">'> This is low_52:
> <_sre.SRE_Match object; span=(0, 502875), match='                    <tr
> data-row-symbol="SWKS"><t>
> If necessary, how do I pick up and move forward to the point right after
> the previous match?  File.tell() and file.__sizeof__(), don't seem to play
> a useful role.

You should try BeautifulSoup. Let's play:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""<tr data-row-symbol="SWKS"><td class="col-symbol 
txt"><span class="wrapper " data-model="name:DatumModel;id:null;" data-
tmpl=""><a data-ylk="cat:portfolio;cpos:1" 
href="http://finance.yahoo.com/q?s=SWKS" data-
rapid_p="18">SWKS</a></span></td><td class="col-fiftytwo_week_low cell-
raw:23.270000"><span class="wrapper " data-
model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close cell-
raw:58.049999">""")
>>> soup.find("tr")
<tr data-row-symbol="SWKS"><td class="col-symbol txt"><span class="wrapper " 
data-model="name:DatumModel;id:null;" data-tmpl=""><a data-rapid_p="18" 
data-ylk="cat:portfolio;cpos:1" 
href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span></td><td class="col-
fiftytwo_week_low cell-raw:23.270000"><span class="wrapper " data-
model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close cell-
raw:58.049999"></td></tr>
>>> tr = soup.find("tr")
>>> tr["data-row-symbol"]
'SWKS'
>>> tr.find_all("span")
[<span class="wrapper " data-model="name:DatumModel;id:null;" data-
tmpl=""><a data-rapid_p="18" data-ylk="cat:portfolio;cpos:1" 
href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span>, <span 
class="wrapper " data-model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
tmpl="change:yfin.datum">23.27</span>]
>>> span = tr.find_all("span")[1]
>>> span["data-model"]
'name:DatumModel;id:SWKS:qsi:wk52:low;'
>>> span.text
'23.27'

Note that normally soup would hold the complete html and you'd need a few 
more iterations to get to the element of interest.



More information about the Tutor mailing list