[Tutor] search/match file position q

Clayton Kirkwood crk at godblessthe.us
Tue Oct 7 17:47:37 CEST 2014



!-----Original Message-----
!From: Tutor [mailto:tutor-bounces+crk=godblessthe.us at python.org] On
!Behalf Of Peter Otten
!Sent: Tuesday, October 07, 2014 3:50 AM
!To: tutor at python.org
!Subject: Re: [Tutor] search/match file position q
!
!Clayton Kirkwood wrote:
!
!> I was trying to keep it generic.
!> Wrapped data file:
!>                    <tr data-row-symbol="SWKS"><td class="col-symbol
!>                    txt"><span class="wrapper "
!>                    data-model="name:DatumModel;id:null;" data-
!tmpl=""><a
!>                    data-ylk="cat:portfolio;cpos:1"
!>                    href="http://finance.yahoo.com/q?s=SWKS"
!>                    data-rapid_p="18">SWKS</a></span></td><td
!>                    class="col-fiftytwo_week_low cell-
!raw:23.270000"><span
!>                    class="wrapper "
!>                    data-model="name:DatumModel;id:SWKS:qsi:wk52:low;"
!>                    data-tmpl="change:yfin.datum">23.27</span></td><td
!>                    class="col-prev_close cell-raw:58.049999"><span
!>                    class="wrapper " data-model="name:DatumMo
!
!Doesn't Yahoo make the data available as CSV? That would be the way to
!go then.


Yes, Yahoo has a few columns that are csv, but I have maybe 15 fields that
aren't provided. Besides, what fun would that be, I try to find tasks that
allow me to expand my knowledge"<)))

!
!Anyway, regular expressions are definitely the wrong tool here, and
!reading the file one line at a time only makes it worse.


Why is it making it only worse? I don't think a char by char would be
helpful, the line happens to be very long, and I don't have a way of peeking
around the corner to the next line so to speak. If I broke it into shorter
strings, it would be much more onerous to jump over the end of the current
to potentially many next strings.


!
!> import re, os
!>     line_in = file.readline()
!	# read in humongous html line
!>         stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">',
!line_in)
!>         #scan to SWKS"> in data
!							#line, stock
!should be SWKS
!>         low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in)
!#want to
!>         pick up from
!							#SWKS">,
!low_52 should be 23.27
!>
!> I am trying to figure out if each re.match starts scanning at the
!> beginning of the same line over and over or does each scan start at
!> the end of the last match. It appears to start over??
!>
!> This is stock:
!> <_sre.SRE_Match object; span=(0, 47), match='                    <tr
!> data-row-symbol="SWKS">'> This is low_52:
!> <_sre.SRE_Match object; span=(0, 502875), match='
!<tr
!> data-row-symbol="SWKS"><t>
!> If necessary, how do I pick up and move forward to the point right
!> after the previous match?  File.tell() and file.__sizeof__(), don't
!> seem to play a useful role.
!
!You should try BeautifulSoup. Let's play:
!
!>>> from bs4 import BeautifulSoup
!>>> soup = BeautifulSoup("""<tr data-row-symbol="SWKS"><td
!>>> class="col-symbol
!txt"><span class="wrapper " data-model="name:DatumModel;id:null;" data-
!tmpl=""><a data-ylk="cat:portfolio;cpos:1"
!href="http://finance.yahoo.com/q?s=SWKS" data-
!rapid_p="18">SWKS</a></span></td><td class="col-fiftytwo_week_low cell-
!raw:23.270000"><span class="wrapper " data-
!model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
!tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close
!cell-
!raw:58.049999">""")
!>>> soup.find("tr")
!<tr data-row-symbol="SWKS"><td class="col-symbol txt"><span
!class="wrapper "
!data-model="name:DatumModel;id:null;" data-tmpl=""><a data-rapid_p="18"
!data-ylk="cat:portfolio;cpos:1"
!href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span></td><td
!class="col- fiftytwo_week_low cell-raw:23.270000"><span class="wrapper "
!data- model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-
!tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close
!cell- raw:58.049999"></td></tr>
!>>> tr = soup.find("tr")
!>>> tr["data-row-symbol"]
!'SWKS'
!>>> tr.find_all("span")
![<span class="wrapper " data-model="name:DatumModel;id:null;" data-
!tmpl=""><a data-rapid_p="18" data-ylk="cat:portfolio;cpos:1"
!href="http://finance.yahoo.com/q?s=SWKS">SWKS</a></span>, <span
!class="wrapper " data-model="name:DatumModel;id:SWKS:qsi:wk52:low;"
!data- tmpl="change:yfin.datum">23.27</span>]
!>>> span = tr.find_all("span")[1]
!>>> span["data-model"]
!'name:DatumModel;id:SWKS:qsi:wk52:low;'
!>>> span.text
!'23.27'


So, what makes regex wrong for this job? question still remains: does the
search start at the beginning of the line each time or does it step forward
from the last search? I will check out beautiful soup as suggested in a
subsequent mail; I'd still like to finish this process:<}}

Clayton




More information about the Tutor mailing list