Beginner Q. interrogate html object OR file search?

Mark G markgrahamnz at gmail.com
Thu Dec 3 00:32:38 EST 2009


On Dec 3, 4:19 pm, inhahe <inh... at gmail.com> wrote:
> or i guess you could go the middle-way and just use regex.
> people generally say don't use regex for html (regex can't do the
> nesting), but it's what i would do in this case.
> though i don't exactly understand the question, re the html file
> parsing script you say you have already, or how the date is 'modified
> from' the meta-data.
>
> On Wed, Dec 2, 2009 at 10:24 PM, Mark G <markgraha... at gmail.com> wrote:
> > Hi all,
>
> > I am new to python and don't yet know the libraries well. What would
> > be the best way to approach this problem: I have a html file parsing
> > script - the file sits on my harddrive. I want to extract the date
> > modified from the meta-data. Should I read through lines of the file
> > doing a string.find to look for the character patterns of the meta-
> > tag, or should I use a DOM type library to retrieve the html element I
> > want? Which is best practice? which occupies least code?
>
> > Regards, Mark
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
>

I'm tempted to use regex too. I have done a bit of perl & bash, and
that is how I would do it with those.

However, I thought there would be a smarter way to do it with
libraries. I have done some digging through the libraries and think I
can do it with xml.dom.minidom. Here is what I want to try...

	# if html file already exists, inherit the created date
        # 'output' is the filename for the parsed file
	html_xml = xml.dom.minidom.parse(output)
	for node in html_xml.getElementsByTagName('meta'):  # visit every
node <meta />
		#debug print node.toxml()
		metatag_type = nodes.attributes["name"]
		if metatag_type.name == "DC.Date.Modified":
			metatag_date = nodes.attributes["content"]
			date_created = metatag_date.value()
	print date_created

I haven't quite got up to hear in my debugging. I'll let you know if
it works...

RE: your questions. 1) I already have the script in bash - wanting to
convert to Python :) I'm half way through
I want to extract the value of the tag <metadata
name="DC.Date.Modified" value="2009-11-17">




More information about the Python-list mailing list