Text over multiple lines

William Park opengeometry at yahoo.ca
Mon Jun 21 01:06:50 EDT 2004


Rigga <Rigga at hasnomail.com> wrote:
> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
> 
> > Rigga <Rigga at hasnomail.com> writes:
> >> I am using the HTMLParser to parse a web page, part of the routine
> >> I need to write (I am new to Python) involves looking for a
> >> particular tag and once I know the start and the end of the tag
> >> then to assign all the data in between the tags to a variable, this
> >> is easy if the tag starts and ends on the same line however how
> >> would I go about doing it if its split over two or more lines?
> > 
> > I often have variants of this problem too. The simplest way to make
> > it work is to read all the HTML in at once with a single call to
> > file.read(), and then use a regular expression. Note that you
> > probably don't need re.MULTILINE, although you should take a look at
> > what it means in the docs just to know.
> > 
> > This works fine as long as you expect your files to be relatively
> > small (under a meg or so).
> 
> Im reading the entire file in to a variable at the moment and passing
> it through HTMLParser.  I have ran in to another problem that I am
> having a hard time working out, my data is in this format:
> 
>         <TD><SPAN class=qv id=EmployeeNo
>         title="Employee Number">123456</SPAN></TD></TR>
> 
> Some times the data is spread over 3 lines like:
> 
>         <TD><SPAN class=qv id=BusinessName
>         title="Business Name">Some Shady Business
>         Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
> 
> The data I need to get is the data enclosed in quotes after the word
> title= and data after the > and before the </SPAN, in the case aove
> would be: Some Shady Business Group Ltd.

Approach:

1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

	<SPAN class=qv id=BusinessName
	title="Business Name">Some Shady Business
	Group Ltd.</SPAN>

    with parenthized groups giving

	submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
	submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

	class=qv
	id=BusinessName
	title="Business Name"

Homework:

    Write a Python script.

Bash solution:

    First, you need my patched Bash which can be found at

	http://freshmeat.net/projects/bashdiff/

    You need to patch the Bash shell, and compile.  It has many Python
    features, particularly regex and array.  Shell solution is

	text='<TD><SPAN class=qv id=BusinessName
	title="Business Name">Some Shady Business
	Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

	newf () {	# Usage: newf match submatch1 submatch2
	    eval $2	# --> class, id, title
	    echo $title > title
	    echo $3 > name
	}
	x=()
	array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
	cat title
	cat name

    I can explain the steps, that it's rather long. :-)

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
No, I will not fix your computer!  I'll reformat your harddisk, though.



More information about the Python-list mailing list