question about

Felix Thibault felixt at dicksonstreet.com
Mon Dec 20 02:22:45 EST 1999


My basic question is: 
	If I have a large nested dictionary whose keys are different slices of a
	large string, which are all in a much smaller set of strings, is there an
	advantage to building my dictionary like:

		dict[names[text[leftslice:rightslice]]] = value 

	instead of:
	
		dict[text[leftslice:rightslice]] = value

	so that for all keys 'spam' use the same string instead of identical ones.

Thanks-
	Felix

Here are the details:

I have a class which takes in an html document like:

#test.html------------------------------------------------
<HTML>
<HEAD>
<TITLE>T E S T D O C</TITLE>
</HEAD>
<BODY>
<H1>This is only a test</H1>
<UL>
    <LI>This is only a list
    <LI>Me too!
</UL>
</BODY>
</BODY>
</HTML>

and returns a UserDict object like:

{'html': {'<range>': (0, 28),
             'body': {'<range>': (10, 24),
                           'h1': {'<range>': (12, 14)},
                           'ul': {'<range>': (16, 22),
                                    'li': {1: {'<range>': (20, 20)}, 
                                           0: {'<range>': (18, 18)}}}},
              'head': {'title': {'<range>': (4, 6)}, 
          	                '<range>': (2, 8)}}}

where the '<range>'s are the indexes of the tuples in a TextTools taglist
that go with <tag>
and </tag> (if it exists). The dictionary is made by recursion over a list
like [ ['html', 0, 28],...]. Originally I made the lists in this list from
the slice of the document where the tagname is:

	name = lowertext[tagtuple[3][0][1]:tagtuple[3][0][2]]

but since the TextTools documentation of the tagging engine begins "Marking
certain parts of a text should not involve storing hundreds of small
strings," this always bothered me. I already had a function that returned a
dictionary with all the html tag names and using it to change the line
above to:
	
	name = id[lowertext[tagtuple[3][0][1]:tagtuple[3][0][2]]

didn't seem to cost me anything, performance-wise. The document I've been
using for benchmarking has 998 tags, and about 20 tagnames. Am I gaining
anything by making the dic-
tionary the new way versus the original way?





More information about the Python-list mailing list