extract Infobox contents
J. Clifford Dyer
jcd at sdf.lonestar.org
Tue Apr 7 07:46:18 EDT 2009
On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
> On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
> <anishchapagain at gmail.com> wrote:
>
> > Hi,
> > I was trying to extract wikipedia Infobox contents which is in format
> > like given below, from the opened URL page in Python.
> >
> > {{ Infobox Software
> > | name = Bash
> > | logo = [[Image:bash-org.png|165px]]
> > | screenshot = [[Image:Bash demo.png|250px]]
> > | caption = Screenshot of bash and [[Bourne shell|sh]]
> > sessions demonstrating some features
> > | developer = [[Chet Ramey]]
> > | latest release version = 4.0
> > | latest release date = {{release date|mf=yes|2009|02|20}}
> > | programming language = [[C (programming language)|C]]
> > | operating system = [[Cross-platform]]
> > | platform = [[GNU]]
> > | language = English, multilingual ([[gettext]])
> > | status = Active
> > | genre = [[Unix shell]]
> > | source model = [[Free software]]
> > | license = [[GNU General Public License]]
> > | website = [http://tiswww.case.edu/php/chet/bash/
> > bashtop.html Home page]
> > }} //upto this line
> >
> > I need to extract all data between {{ Infobox ...to }}
> >
> > Thank's if anyone can help,
> > am trying with
> >
> > s1='{{ Infobox'
> > s2=len(s1)
> > pos1=data.find("{{ Infobox")
> > pos2=data.find("\n",pos2)
> >
> > pat1=data.find("}}")
> >
> > but am ending up getting one line at top only.
>
> How are you getting your data? Assuming that you can arrange to get
> it one line at a time, here's a quick and dirty way to extract the
> infoboxes on a page.
>
> infoboxes = []
> infobox = []
> reading_infobox = False
>
> for line in feed_me_lines_somehow():
> if line.startswith("{{ Infobox"):
> reading_infobox = True
> if reading_infobox:
> infobox.append(line)
> if line.startswith("}}"):
> reading_infobox = False
> infoboxes.append(infobox)
> infobox = []
>
> You end up with 'infoboxes' containing a list of all the infoboxes
> on the page, each held as a list of the lines of their content.
> For safety's sake you really should be using regular expressions
> rather than 'startswith', but I leave that as an exercise for the
> reader :-)
>
I agree that startswith isn't the right option, but for matching two
constant characters, I don't think re is necessary. I'd just do:
if '}}' in line:
pass
Then, as the saying goes, you only have one problem.
Cheers,
Cliff
More information about the Python-list
mailing list