extract Infobox contents

Tue Apr 7 07:46:18 EDT 2009

On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
> On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain  
> <anishchapagain at gmail.com> wrote:
> 
> > Hi,
> > I was trying to extract wikipedia Infobox contents which is in format
> > like given below, from the opened URL page in Python.
> >
> > {{ Infobox Software
> > | name                   = Bash
> > | logo                   = [[Image:bash-org.png|165px]]
> > | screenshot             = [[Image:Bash demo.png|250px]]
> > | caption                = Screenshot of bash and [[Bourne shell|sh]]
> > sessions demonstrating some features
> > | developer              = [[Chet Ramey]]
> > | latest release version = 4.0
> > | latest release date    = {{release date|mf=yes|2009|02|20}}
> > | programming language   = [[C (programming language)|C]]
> > | operating system       = [[Cross-platform]]
> > | platform               = [[GNU]]
> > | language               = English, multilingual ([[gettext]])
> > | status                 = Active
> > | genre                  = [[Unix shell]]
> > | source model           = [[Free software]]
> > | license                = [[GNU General Public License]]
> > | website                = [http://tiswww.case.edu/php/chet/bash/
> > bashtop.html Home page]
> > }} //upto this line
> >
> > I need to extract all data between {{ Infobox ...to }}
> >
> > Thank's if anyone can help,
> > am trying with
> >
> > s1='{{ Infobox'
> > s2=len(s1)
> > pos1=data.find("{{ Infobox")
> > pos2=data.find("\n",pos2)
> >
> > pat1=data.find("}}")
> >
> > but am ending up getting one line at top only.
> 
> How are you getting your data?  Assuming that you can arrange to get
> it one line at a time, here's a quick and dirty way to extract the
> infoboxes on a page.
> 
> infoboxes = []
> infobox = []
> reading_infobox = False
> 
> for line in feed_me_lines_somehow():
>      if line.startswith("{{ Infobox"):
>          reading_infobox = True
>      if reading_infobox:
>          infobox.append(line)
>      if line.startswith("}}"):
>          reading_infobox = False
>          infoboxes.append(infobox)
> 	infobox = []
> 
> You end up with 'infoboxes' containing a list of all the infoboxes
> on the page, each held as a list of the lines of their content.
> For safety's sake you really should be using regular expressions
> rather than 'startswith', but I leave that as an exercise for the
> reader :-)
> 

I agree that startswith isn't the right option, but for matching two
constant characters, I don't think re is necessary.  I'd just do:

if '}}' in line:
    pass

Then, as the saying goes, you only have one problem.

Cheers,
Cliff