Need simple algorithm for newbie

Alex Martelli aleax at aleax.it
Mon Nov 4 07:48:49 EST 2002


Jason Tudisco wrote:

> I have a list of domains... Some of the domain names in the list look
> like this:
> 
> groups.goodle.com
> 
> The information I want is just google.com. I need to know the best way
> to do this.. for .com .net .org only.. and to strip the rest of the
> garbage.. like in this case.. get rid of groups in groups.google.com

This is slightly underspecified.  If what you want is: given a
list of strings, get for each the longest trailing substring 
that contains just one dot, assuming each string has at least
one dot:

newlist = []
for item in oldlist:
    pieces = item.split('.')
    newlist.append('.'.join(pieces[-2:]))

which can easily be compacted into a list-comprehension:

newlist = [ ''.join(item.split('.')[-2:]) for item in oldlist ]


This, however, has nothing to say about the "for .com .net
.org only" clause in your specs.  If that means you only want
in the newlist those items which end in .com, .net, or .org,
then:

of_interest = dict(zip('com net org'.split(), range(3))
newlist = []
for item in oldlist:
    pieces = item.split('.')
    if pieces[-1] in of_interest:
        newlist.append('.'.join(pieces[-2:]))

this is getting a bit too hairy for it to make sense to
squash it into a list-comprehension.


> I need to parse though a huge list so it has to be optimized algorithm
> 
> No need to write complete code.. Just get me in the right direccion..
> Still learning python and I am not sure what would be the fastest way
> to go about it..

I'm not sure either -- I just doubt there's much performance to
be gained by any of the several other possible approaches, such
as finding the second '.' from the right, RE's, and so on.  No
matter how huge is your list, saveing (say) 10% of runtime is no
big deal, and I don't think there can be much more than this in
terms of optimization one way or another.  The right approach
is: first, try the simplest thing that can possibly work (and
things can't be much simpler than the above snippets); measure
what performance you get this way; if this is unacceptable to
the customer, whip out the hotshot profiler to find out where
the time is going, and focus on the actual bottleneck (may not
be where you'd expect it to be: always MEASURE rather than GUESS).

Remember that, for just about any code, being inside a function
is a substantial speedup (because local variables are looked up
much faster than global ones), and so is using Python 2.3 (not
acceptable for production code yet -- it's still a pre-alpha from
CVS -- but, very promising for the near-future), or, if you're
stuck with 2.2.2 (as you should be for production-level code),
so it using -O on your command line.  These are "optimization
tricks" that aren't tricky, cost just about nothing, and get you
a few percent speedup (up to 10% or even more in some cases)
as easily as apple pie.


Alex




More information about the Python-list mailing list