Text over multiple lines

William Park opengeometry at yahoo.ca
Mon Jun 21 07:06:50 CEST 2004

Rigga <Rigga at hasnomail.com> wrote:
> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
> > Rigga <Rigga at hasnomail.com> writes:
> >> I am using the HTMLParser to parse a web page, part of the routine
> >> I need to write (I am new to Python) involves looking for a
> >> particular tag and once I know the start and the end of the tag
> >> then to assign all the data in between the tags to a variable, this
> >> is easy if the tag starts and ends on the same line however how
> >> would I go about doing it if its split over two or more lines?
> > 
> > I often have variants of this problem too. The simplest way to make
> > it work is to read all the HTML in at once with a single call to
> > file.read(), and then use a regular expression. Note that you
> > probably don't need re.MULTILINE, although you should take a look at
> > what it means in the docs just to know.
> > 
> > This works fine as long as you expect your files to be relatively
> > small (under a meg or so).
> Im reading the entire file in to a variable at the moment and passing
> it through HTMLParser.  I have ran in to another problem that I am
> having a hard time working out, my data is in this format:
>         <TD><SPAN class=qv id=EmployeeNo
>         title="Employee Number">123456</SPAN></TD></TR>
> Some times the data is spread over 3 lines like:
>         <TD><SPAN class=qv id=BusinessName
>         title="Business Name">Some Shady Business
>         Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
> The data I need to get is the data enclosed in quotes after the word
> title= and data after the > and before the </SPAN, in the case aove
> would be: Some Shady Business Group Ltd.


1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

	<SPAN class=qv id=BusinessName
	title="Business Name">Some Shady Business
	Group Ltd.</SPAN>

    with parenthized groups giving

	submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
	submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

	title="Business Name"


    Write a Python script.

Bash solution:

    First, you need my patched Bash which can be found at


    You need to patch the Bash shell, and compile.  It has many Python
    features, particularly regex and array.  Shell solution is

	text='<TD><SPAN class=qv id=BusinessName
	title="Business Name">Some Shady Business
	Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

	newf () {	# Usage: newf match submatch1 submatch2
	    eval $2	# --> class, id, title
	    echo $title > title
	    echo $3 > name
	array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
	cat title
	cat name

    I can explain the steps, that it's rather long. :-)

William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
No, I will not fix your computer!  I'll reformat your harddisk, though.

More information about the Python-list mailing list