[Tutor] HTML --> TXT?
Deirdre Saoirse
deirdre@deirdre.net
Wed, 29 Mar 2000 09:24:59 -0800 (PST)
On Wed, 29 Mar 2000, Curtis Larsen wrote:
> Is there a fairly simple Python-ish way to convert an HTML file to text?
>
> I'm not even talking about reformatting the text (where applicable)
> based on the tags -- though that'd be pretty cool -- I'm just looking
> for a good way to rip out the tags and make the file more
> human-friendly. HTMLLIB and URLLIB don't seem to have it -- is there
> another module that does this?
Funny you should ask. I had just done this late last night for my own
amusement. :) I picked a very simple way of solving the problem, which may
not be an optimal way, but it was good enough for my purposes.
#!/usr/bin/python
import string
def untag(line):
true = 1
false = 0
startTag = '<'
endTag = '>'
copyChar = true
result = ''
for i in line:
if i == startTag:
copyChar = false
elif i == endTag:
copyChar = true
elif copyChar == true:
result = result + i
return result
def untaglines(file):
result = ''
lines = f.readlines()
for i in lines:
result = result + untag(i)
return result
if __name__ == '__main__':
import sys
fname = sys.argv[1]
f = open(fname, 'r')
result = untaglines(f)
print result
--
_Deirdre * http://www.linuxcabal.org * http://www.deirdre.net
"The year after I was born, we walked on the moon. Now, 31 years later,
it's considered an impressive feat of science to grow tomatoes in low
Earth orbit." -- John Miles <ke5fx@qsl.net>