[Chicago] Parsing Metra's Online Schedule
Massimo Di Pierro
mdipierro at cs.depaul.edu
Thu Apr 10 01:57:28 CEST 2008
To avoid confusion. This is not the program for parsing metra schedule.
This is a general purpose scraper for extracting page layout, not
page content.
On Apr 9, 2008, at 5:23 PM, Cosmin Stejerean wrote:
> Thanks. I added your code to
> http://github.com/cosmin/metratime/commit/
> 60a7723281c98a5297b31222493bf2e9a8bf10e4
> This should work for most pages - as far as I remember there was a
> single page in the schedule that had some odd exception to it
> (something like an HTML comment in the middle of the schedule). But
> last time I looked at it was probably a year ago so it might have
> changed.
>
> I'll try to get all the data parsed out sometime today and probably
> publish it in JSON format if anyone else wants to do something clever
> with it (and I won't require the use of my web framework in exchange).
>
> - Cosmin
>
> On Wed, Apr 9, 2008 at 4:52 PM, Feihong Hsu <hsu.feihong at yahoo.com>
> wrote:
>> I don't ride Metra anymore so I don't have much motivation to help
>> with Cosmin's Metra Schedule App. However, I happened to have some
>> free time on my hands (I'm unemployed), so I thought it might be fun
>> to make a rudimentary parser. Surprisingly, I hardly did any real
>> HTML parsing, since Metra's pages actually use PRE tags instead of
>> TABLE tags to display the tabular parts. So it devolved into typical
>> regex hacking.
>>
>> The following code should not be construed as a complete solution.
>> All it does is parse the text inside the 3 PRE tags and put the data
>> into a single 2D matrix. I also included a small function that
>> creates an HTML table out of the data. I only tested my code on a
>> single page so far, but I think all the schedule pages are pretty
>> much the same. Basically, you can use this as a starting point.
>>
>> P.S. You need lxml to run the code.
>>
>> ------------------------------------------------------------
>> import re
>> import lxml.html as lh
>>
>> def get_rows(tree):
>> texts = [n.text_content() for n in tree.xpath('//pre')]
>>
>> trainNumRow = [' ']
>> ampmRow = [' ']
>> timeRows = [] # list of lists
>>
>> for i, text in enumerate(texts):
>> lines = [line for line in text.split('\n')
>> if line.strip()]
>>
>> trainNums = lines[0].split()
>> trainNumRow += trainNums
>> ampmRow += lines[1].split()
>>
>> for j, line in enumerate(lines[2:]):
>> matches = [m for m in re.finditer(r"x?\d+\:\d+|.---|\|",
>> line)]
>> if len(matches) != len(trainNums):
>> break
>>
>> pos = matches[0].start()
>> town = line[:pos].strip()
>> times = [m.group() for m in matches]
>>
>> if j >= len(timeRows):
>> timeRows.append([])
>>
>> timeRows[j] += [town]+times if i==0 else times
>>
>> yield trainNumRow
>> yield ampmRow
>> for row in timeRows:
>> yield row
>>
>> def make_table_file(filename, rows):
>> import codecs
>> fout = codecs.open(filename, 'w', 'utf-8')
>> fout.write('<table border="1">')
>> for row in rows:
>> fout.write('<tr>')
>> for v in row:
>> if v.endswith('---'): # get rid of the stupid \x97 char
>> v = '----'
>> fout.write('<td>%s</td>' % v)
>> fout.write('</tr>')
>> fout.write('</table>')
>> fout.close()
>>
>> if __name__ == '__main__':
>> tree = lh.parse('test.html')
>> rows = get_rows(tree)
>> make_table_file('table.html', rows)
>>
>>
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam? Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> http://mail.python.org/mailman/listinfo/chicago
>>
>
>
>
> --
> Cosmin Stejerean
> http://blog.offbytwo.com
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
More information about the Chicago
mailing list