extract Infobox contents

J. Cliff Dyer jcd at sdf.lonestar.org
Wed Apr 8 17:02:30 EDT 2009


On Wed, 2009-04-08 at 01:57 +0100, Rhodri James wrote:
> On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer  
> <jcd at sdf.lonestar.org> wrote:
> 
> > On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
> >> On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
> >> <anishchapagain at gmail.com> wrote:
> >>
> >> > Hi,
> >> > I was trying to extract wikipedia Infobox contents which is in format
> >> > like given below, from the opened URL page in Python.
> >> >
> >> > {{ Infobox Software
> >> > | name                   = Bash
> [snip]
> >> > | latest release date    = {{release date|mf=yes|2009|02|20}}
> >> > | programming language   = [[C (programming language)|C]]
> >> > | operating system       = [[Cross-platform]]
> >> > | platform               = [[GNU]]
> >> > | language               = English, multilingual ([[gettext]])
> >> > | status                 = Active
> [snip some more]
> >> > }} //upto this line
> >> >
> >> > I need to extract all data between {{ Infobox ...to }}
> 
> [snip still more]
> 
> >> You end up with 'infoboxes' containing a list of all the infoboxes
> >> on the page, each held as a list of the lines of their content.
> >> For safety's sake you really should be using regular expressions
> >> rather than 'startswith', but I leave that as an exercise for the
> >> reader :-)
> >>
> >
> > I agree that startswith isn't the right option, but for matching two
> > constant characters, I don't think re is necessary.  I'd just do:
> >
> > if '}}' in line:
> >     pass
> >
> > Then, as the saying goes, you only have one problem.
> 
> That would be the problem of matching lines like:
> 
>   | latest release date    = {{release date|mf=yes|2009|02|20}}
> 
> would it? :-)
> 

That's the one.

> A quick bit of timing suggests that:
> 
>    if line.lstrip().startswith("}}"):
>      pass
> 
> is what we actually want.
> 

Indeed.  Thanks.




More information about the Python-list mailing list