[Chicago] Parsing Metra's Online Schedule

Massimo Di Pierro mdipierro at cs.depaul.edu
Thu Apr 10 01:57:28 CEST 2008


To avoid confusion. This is not the program for parsing metra schedule.
This is a general purpose scraper for extracting page layout, not  
page content.


On Apr 9, 2008, at 5:23 PM, Cosmin Stejerean wrote:

> Thanks. I added your code to
> http://github.com/cosmin/metratime/commit/ 
> 60a7723281c98a5297b31222493bf2e9a8bf10e4
> This should work for most pages - as far as I remember there was a
> single page in the schedule that had some odd exception to it
> (something like an HTML comment in the middle of the schedule). But
> last time I looked at it was probably a year ago so it might have
> changed.
>
> I'll try to get all the data parsed out sometime today and probably
> publish it in JSON format if anyone else wants to do something clever
> with it (and I won't require the use of my web framework in exchange).
>
> - Cosmin
>
> On Wed, Apr 9, 2008 at 4:52 PM, Feihong Hsu <hsu.feihong at yahoo.com>  
> wrote:
>> I don't ride Metra anymore so I don't have much motivation to help
>>  with Cosmin's Metra Schedule App. However, I happened to have some
>>  free time on my hands (I'm unemployed), so I thought it might be fun
>>  to make a rudimentary parser. Surprisingly, I hardly did any real
>>  HTML parsing, since Metra's pages actually use PRE tags instead of
>>  TABLE tags to display the tabular parts. So it devolved into typical
>>  regex hacking.
>>
>>  The following code should not be construed as a complete solution.
>>  All it does is parse the text inside the 3 PRE tags and put the data
>>  into a single 2D matrix. I also included a small function that
>>  creates an HTML table out of the data. I only tested my code on a
>>  single page so far, but I think all the schedule pages are pretty
>>  much the same. Basically, you can use this as a starting point.
>>
>>  P.S. You need lxml to run the code.
>>
>>  ------------------------------------------------------------
>>  import re
>>  import lxml.html as lh
>>
>>  def get_rows(tree):
>>     texts = [n.text_content() for n in tree.xpath('//pre')]
>>
>>     trainNumRow = [' ']
>>     ampmRow = [' ']
>>     timeRows = []   # list of lists
>>
>>     for i, text in enumerate(texts):
>>         lines = [line for line in text.split('\n')
>>                  if line.strip()]
>>
>>         trainNums = lines[0].split()
>>         trainNumRow += trainNums
>>         ampmRow += lines[1].split()
>>
>>         for j, line in enumerate(lines[2:]):
>>             matches = [m for m in re.finditer(r"x?\d+\:\d+|.---|\|",
>>  line)]
>>             if len(matches) != len(trainNums):
>>                 break
>>
>>             pos = matches[0].start()
>>             town = line[:pos].strip()
>>             times = [m.group() for m in matches]
>>
>>             if j >= len(timeRows):
>>                 timeRows.append([])
>>
>>             timeRows[j] += [town]+times if i==0 else times
>>
>>     yield trainNumRow
>>     yield ampmRow
>>     for row in timeRows:
>>         yield row
>>
>>  def make_table_file(filename, rows):
>>     import codecs
>>     fout = codecs.open(filename, 'w', 'utf-8')
>>     fout.write('<table border="1">')
>>     for row in rows:
>>         fout.write('<tr>')
>>         for v in row:
>>             if v.endswith('---'):   # get rid of the stupid \x97 char
>>                 v = '----'
>>             fout.write('<td>%s</td>' % v)
>>         fout.write('</tr>')
>>     fout.write('</table>')
>>     fout.close()
>>
>>  if __name__ == '__main__':
>>     tree = lh.parse('test.html')
>>     rows = get_rows(tree)
>>     make_table_file('table.html', rows)
>>
>>
>>
>>
>>  __________________________________________________
>>  Do You Yahoo!?
>>  Tired of spam?  Yahoo! Mail has the best spam protection around
>>  http://mail.yahoo.com
>>  _______________________________________________
>>  Chicago mailing list
>>  Chicago at python.org
>>  http://mail.python.org/mailman/listinfo/chicago
>>
>
>
>
> --
> Cosmin Stejerean
> http://blog.offbytwo.com
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago



More information about the Chicago mailing list