[Tutor] search/match file position q

Dave Angel davea at davea.name
Tue Oct 7 14:39:41 CEST 2014


"Clayton Kirkwood" <crk at godblessthe.us> Wrote in message:
> I was trying to keep it generic.
> Wrapped data file:
>                    <tr data-row-symbol="SWKS"><td class="col-symbol txt"><span class="wrapper " data-model="name:DatumModel;id:null;" data-tmpl=""><a data-ylk="cat:portfolio;cpos:1" href="http://finance.yahoo.com/q?s=SWKS" data-rapid_p="18">SWKS</a></span></td><td class="col-fiftytwo_week_low cell-raw:23.270000"><span class="wrapper " data-model="name:DatumModel;id:SWKS:qsi:wk52:low;" data-tmpl="change:yfin.datum">23.27</span></td><td class="col-prev_close cell-raw:58.049999"><span class="wrapper " data-model="name:DatumMo
> 
> 
> import re, os
>     line_in = file.readline()							# read in humongous html line
>         stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">', line_in)	#scan to SWKS"> in data 													#line, stock should be SWKS
>         low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in)		#want to pick up from														#SWKS">, low_52 should be 23.27
> 
> I am trying to figure out if each re.match starts scanning at the beginning of the same line over and over or does each scan start at the end of the last match. It appears to start over??
> 
> This is stock:
> <_sre.SRE_Match object; span=(0, 47), match='                    <tr data-row-symbol="SWKS">'> 
> This is low_52:
> <_sre.SRE_Match object; span=(0, 502875), match='                    <tr data-row-symbol="SWKS"><t>
> If necessary, how do I pick up and move forward to the point right after the previous match?  File.tell() and file.__sizeof__(), don't seem to play a useful role.
> 

The best solution is ANYTHING but html scraping.  If the website
 offers an api like csf, use it. Html is too prone to changing at
 the whim of the developers.

If you must use html, get beautiful soup. Regex can mess up
 suddenly even if the developers don't change anything. Regex
 should only be used on html if you're the one generating the
 website,  and you coordinate it to be parseable.

If regex were the best solution you could read the following
 example pasted from the online docs. re.findall searches a
 string, not a file, so file position is irrelevant.  The numbers
 below can be used to subscript your string, either for saving the
 results or for removing the part already searched.
 

Something like
  line_in = line_in[span [0] +span [1]: ]

Ref: https://docs.python.org/3.4/howto/regex.html

findall() has to create the entire list before it can be returned
 as the result. The finditer() method returns a sequence of match
 object instances as an iterator:

>>>
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator  
<callable_iterator object at 0x...>
>>> for match in iterator:
...     print(match.span())
...
(0, 2)
(22, 24)
(29, 31)



-- 
DaveA



More information about the Tutor mailing list