[Tutor] HTML --> TXT?

Deirdre Saoirse deirdre@deirdre.net
Wed, 29 Mar 2000 09:24:59 -0800 (PST)


On Wed, 29 Mar 2000, Curtis Larsen wrote:

> Is there a fairly simple Python-ish way to convert an HTML file to text?
> 
> I'm not even talking about reformatting the text (where applicable)
> based on the tags -- though that'd be pretty cool -- I'm just looking
> for a good way to rip out the tags and make the file more
> human-friendly.  HTMLLIB and URLLIB don't seem to have it -- is there
> another module that does this?

Funny you should ask. I had just done this late last night for my own
amusement. :) I picked a very simple way of solving the problem, which may
not be an optimal way, but it was good enough for my purposes.


#!/usr/bin/python

import string

def untag(line):
	true = 1
	false = 0
	startTag = '<'
	endTag = '>'
	copyChar = true
	result = ''
	
	for i in line:
		if i == startTag:
			copyChar = false
		elif i == endTag:
			copyChar = true
		elif copyChar == true:
			result = result + i

	return result


def untaglines(file):
	result = ''
	
	lines = f.readlines()
	
	for i in lines:
		result = result + untag(i)
	
	return result
	
if __name__ == '__main__':
	import sys
	fname = sys.argv[1]
	
	f = open(fname, 'r')
	
	result = untaglines(f)
	print result

-- 
_Deirdre   *   http://www.linuxcabal.org   *   http://www.deirdre.net
"The year after I was born, we walked on the moon. Now, 31 years later,
it's considered an impressive feat of science to grow tomatoes in low 
Earth orbit."                           -- John Miles <ke5fx@qsl.net>