Binary search tree
Scott SA
pydev at rscorp.ab.ca
Mon Nov 12 04:17:28 EST 2007
On 11/12/07, Michel Albert (exhuma at gmail.com) wrote:
>On Nov 9, 11:45 pm, Bruno Desthuilliers
><bdesth.quelquech... at free.quelquepart.fr> wrote:
>> maxim.no... at gmail.com a Ècrit :
>>
>> > Hi,
>>
>> > I have to get list of URLs one by one and to find the URLs that I have
>> > more than one time(can't be more than twice).
>>
>> > I thought to put them into binary search tree, this way they'll be
>> > sorted and I'll be able to check if the URL already exist.
>>
>> What about a set ?
>>
>> s = set()
>> for url in urls:
>> if url in s:
>> print "already have ", url
>> else:
>> set.add(url)
>
>Interesting. For this case I usually used dicts. As in:
>
>d = {}
>for url in urls:
> if url in d:
> print "already have ", url
> else:
> d[url] = 1
>
>Now, I can see that this method has some superfluous data (the `1`
>that is assigned to the dict). So I suppose this is less memory
>efficient. But is this slower then? As both implementations use hashes
>of the URL to access the data. Just asking out of curiosity ;)
I'm pretty new at python but found this recipie in a cookbook (thanks O'Reilly for a useful book). I also found it online later, when the cookbook wasn't handy:
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66516>
<http://mail.python.org/pipermail/tutor/2005-March/036966.html>
... google for more
It uses the 'setdefault' function and works _really_ well.
<http://docs.python.org/lib/typesmapping.html>
Basically looks like this:
unique_urls = {}
for url in urls:
unique_urls.setdefault(url)
['www.somewhere.com', 'www.somewherelse.com', 'www.somewherelse.com']
... creates a dict of None's*
{'www.somewhere.com': None, 'www.somewherelse.com': None}
and "harvesting" is trivial:
url_keys = unique_urls.keys()
If you would find an index and count useful, you could do something like this:
for idx in range(len(urls)):
unique_urls.setdefault(urls[idx],[]).append(idx)
...
{'www.somewhere.com': [0], 'www.somewherelse.com': [1,2]}
later, the url's, their index values and counts could be found by:
url_keys = unique_urls.keys()
for url_key in url_keys:
url_idxs = unique_urls[url_key]
url_hits = len(url_idxs)
Depending upon the data you may have associated with your URLs, or whatever else you're working with, this is a very sweet 'shortcut'.
I hope this helps,
Scott
* To be trully Python-esque, 'habits' would be employed and the value would have been 'Nun', but I guess the namesake can only be carried so far (^8
More information about the Python-list
mailing list