Taking data from a text file to parse html page

Anthra Norell anthra.norell at tiscalinet.ch
Fri Aug 25 15:56:34 EDT 2006


Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same
directory. That's all.
      Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in "sys.path",
because that is where import looks. Check for that directory's presence in the sys.path list:

>>> sys.path
['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win',
'C:\\PYTHON24\\lib\\lib-tk'     (... etc)    ]

Supposing it isn't there, add it:

>>> sys.path.append ('/python/code/other_peoples_stuff')
>>> import SE

That should do it. Let me know if it works. Else just keep asking.

Frederic


----- Original Message -----
From: "DH" <dylanhughes at gmail.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Friday, August 25, 2006 4:40 AM
Subject: Re: Taking data from a text file to parse html page


> SE looks very helpful... I'm having a hell of a time installing it
> though:
>
> -----------------------------------------------------------------------------------------
>
> foo at foo:~/Desktop/SE-2.2$ sudo python SETUP.PY install
> running install
> running build
> running build_py
> file SEL.py (for module SEL) not found
> file SE.py (for module SE) not found
> file SEL.py (for module SEL) not found
> file SE.py (for module SE) not found
>
> ------------------------------------------------------------------------------------------
> Anthra Norell wrote:
> > You may also want to look at this stream editor:
> >
> > http://cheeseshop.python.org/pypi/SE/2.2%20beta
> >
> > It allows multiple replacements in a definition format of utmost simplicity:
> >
> > >>> your_example = '''
> > <div><p><em>"Python has been an important part of Google since the
> > beginning, and remains so as the system grows and evolves.
> > "</em></p>
> > <p>-- Peter Norvig, <a class="reference"
> > '''
> > >>> import SE
> > >>> Tag_Stripper = SE.SE ('''
> >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> >          "~<!--(.|\n)*?-->~="   # This pattern deletes comments entirely even if they nest tags
> >          ''')
> > >>> print Tag_Stripper (your_example)
> >
> > "Python has been an important part of Google since the
> > beginning, and remains so as the system grows and evolves.
> > "
> > -- Peter Norvig, <a class="reference"
> >
> > Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):
> >
> > Tag_Stripper = SE.SE ('''
> >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> >          "~<!--(.|\n)*?-->~="   # This pattern deletes commentsentirely even if they nest tags
> >          "<a class\="reference"="    # *** This deletes the fragment
> >          # "-- Peter Norvig, <a class\="reference"="  # Or like this if Peter Norvig has to go too
> >        ''')
> > >>> print Tag_Stripper (your_example)
> >
> > "Python has been an important part of Google since the
> > beginning, and remains so as the system grows and evolves.
> > "
> > -- Peter Norvig,
> >
> > " you can either translate or delete:
> >
> > Tag_Stripper = SE.SE ('''
> >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> >          "~<!--(.|\n)*?-->~="   # This pattern deletes commentsentirely even if they nest tags
> >          "<a class\="reference"="    # This deletes the fragment
> >          # "-- Peter Norvig, <a class=\\"reference\\"="  # Or like this if Peter Norvig has to go too
> >          htm2iso.se     # This is a file (contained in the SE package that translates all ampersand codes.
> >                               # Naming the file is all you need to do to include the replacements which it defines.
> >        ''')
> >
> > >>> print Tag_Stripper (your_example)
> >
> > 'Python has been an important part of Google since the
> > beginning, and remains so as the system grows and evolves.
> > '
> > -- Peter Norvig,
> >
> > If instead of "htm2iso.se" you write ""=" you delete it and your output will be:
> >
> > Python has been an important part of Google since the
> > beginning, and remains so as the system grows and evolves.
> >
> > -- Peter Norvig,
> >
> > Your Tag_Stripper also does files:
> >
> > >>> print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
> > 'my_file_without_tags'
> >
> >
> > A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
> > parser does a lot of work which you don't need.
> >
> > Regards
> >
> > Frederic
> >
> >
> > ----- Original Message -----
> > From: "DH" <dylanhughes at gmail.com>
> > Newsgroups: comp.lang.python
> > To: <python-list at python.org>
> > Sent: Thursday, August 24, 2006 7:41 PM
> > Subject: Re: Taking data from a text file to parse html page
> >
> >
> > > I found this
> > >
> >
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gst&q=replace+text+file&r
> > num=8#ad0ac6b1ac8cff51
> > >
> > > Credit Jeremy Moles
> > > -----------------------------------------------
> > >
> > > finds = ("{", "}", "(", ")")
> > > lines = file("foo.txt", "r").readlines()
> > >
> > > for line in lines:
> > >         for find in finds:
> > >                 if find in line:
> > >                         line.replace(find, "")
> > >
> > > print lines
> > >
> > > -----------------------------------------------
> > >
> > > I want something like
> > > -----------------------------------------------
> > >
> > > finds = file("replace.txt")
> > > lines = file("foo.txt", "r").readlines()
> > >
> > > for line in lines:
> > >         for find in finds:
> > >                 if find in line:
> > >                         line.replace(find, "")
> > >
> > > print lines
> > >
> > > -----------------------------------------------
> > >
> > >
> > >
> > > Fredrik Lundh wrote:
> > > > DH wrote:
> > > >
> > > > > I have a plain text file containing the html and words that I want
> > > > > removed(keywords) from the html file, after processing the html file it
> > > > > would save it as a plain text file.
> > > > >
> > > > > So the program would import the keywords, remove them from the html
> > > > > file and save the html  file as something.txt.
> > > > >
> > > > > I would post the data but it's secret. I can post an example:
> > > > >
> > > > > index.html (html page)
> > > > >
> > > > > "
> > > > > <div><p><em>"Python has been an important part of Google since the
> > > > > beginning, and remains so as the system grows and evolves.
> > > > > "</em></p>
> > > > > <p>-- Peter Norvig, <a class="reference"
> > > > > "
> > > > >
> > > > > replace.txt (keywords)
> > > > > "
> > > > > <div id="quote" class="homepage-box">
> > > > >
> > > > > <div><p><em>"
> > > > >
> > > > > "</em></p>
> > > > >
> > > > > <p>-- Peter Norvig, <a class="reference"
> > > > >
> > > > > "
> > > > >
> > > > > something.txt(file after editing)
> > > > >
> > > > > "
> > > > >
> > > > > Python has been an important part of Google since the beginning, and
> > > > > remains so as the system grows and evolves.
> > > > > "
> > > >
> > > > reading and writing files is described in the tutorial; see
> > > >
> > > >      http://pytut.infogami.com/node9.html
> > > >
> > > > (scroll down to "Reading and Writing Files")
> > > >
> > > > to do the replacement, you can use repeated calls to the "replace" method
> > > >
> > > >      http://pyref.infogami.com/str.replace
> > > >
> > > > but that may cause problems if the replacement text contains things that
> > > > should be replaced.  for an efficient way to do a "parallel" replace, see:
> > > >
> > > >      http://effbot.org/zone/python-replace.htm#multiple
> > > >
> > > >
> > > > </F>
> > >
> > > --
> > > http://mail.python.org/mailman/listinfo/python-list
>
> --
> http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list