Beginner Q. interrogate html object OR file search?
markgrahamnz at gmail.com
Thu Dec 3 06:32:38 CET 2009
On Dec 3, 4:19 pm, inhahe <inh... at gmail.com> wrote:
> or i guess you could go the middle-way and just use regex.
> people generally say don't use regex for html (regex can't do the
> nesting), but it's what i would do in this case.
> though i don't exactly understand the question, re the html file
> parsing script you say you have already, or how the date is 'modified
> from' the meta-data.
> On Wed, Dec 2, 2009 at 10:24 PM, Mark G <markgraha... at gmail.com> wrote:
> > Hi all,
> > I am new to python and don't yet know the libraries well. What would
> > be the best way to approach this problem: I have a html file parsing
> > script - the file sits on my harddrive. I want to extract the date
> > modified from the meta-data. Should I read through lines of the file
> > doing a string.find to look for the character patterns of the meta-
> > tag, or should I use a DOM type library to retrieve the html element I
> > want? Which is best practice? which occupies least code?
> > Regards, Mark
> > --
I'm tempted to use regex too. I have done a bit of perl & bash, and
that is how I would do it with those.
However, I thought there would be a smarter way to do it with
libraries. I have done some digging through the libraries and think I
can do it with xml.dom.minidom. Here is what I want to try...
# if html file already exists, inherit the created date
# 'output' is the filename for the parsed file
html_xml = xml.dom.minidom.parse(output)
for node in html_xml.getElementsByTagName('meta'): # visit every
node <meta />
#debug print node.toxml()
metatag_type = nodes.attributes["name"]
if metatag_type.name == "DC.Date.Modified":
metatag_date = nodes.attributes["content"]
date_created = metatag_date.value()
I haven't quite got up to hear in my debugging. I'll let you know if
RE: your questions. 1) I already have the script in bash - wanting to
convert to Python :) I'm half way through
I want to extract the value of the tag <metadata
More information about the Python-list