[Python-Dev] RE: companies data for sorting comparisons

Tim Peters tim@zope.com
Sat, 10 Aug 2002 02:56:10 -0400


Update:  With the last batch of checkins, all sorts on Kevin's company
database are faster (a little to a killer lot) under 2.3a0 than under 2.2.1.

A reminder of what this looks like:

> A record looks like this after running his script to turn them
> into Python dicts:
>
>   {'Address': '395 Page Mill Road\nPalo Alto, CA 94306',
>    'Company': 'Agilent Technologies Inc.',
>    'Exchange': 'NYSE',
>    'NumberOfEmployees': '41,000',
>    'Phone': '(650) 752-5000',
>    'Profile': 'http://biz.yahoo.com/p/a/a.html',
>    'Symbol': 'A',
>    'Web': 'http://www.agilent.com'}
>
> It appears to me that the XML file is maintained by hand, in order
> of ticker symbol.  But people make mistakes when alphabetizing
> by hand, and there are 37 indices i such that
>
>     data[i]['Symbol'] > data[i+1]['Symbol']
>
> So it's "almost sorted" by that measure ...
> The proper order of Yahoo profile URLs is also strongly correlated
> with ticker symbol, while both the company name and web address
> look weakly correlated
> [and Address, NumberOfEmployess, and Phone are essentially
>  randomly ordered]

Here are the latest (and I expect the last) timings, in milliseconds per
sort, on the list of (key, index, record) tuples

    values = [(x.get(fieldname), i, x) for i, x in enumerate(data)]

[I wrote a little generator to simulate 2.3's enumerate() in 2.2.1]

There are 6635 companies in the database, but not all fields are present in
all records; .get() plugs in a key of None for those cases, and the index is
to prevent equal-key cases from falling into breaking the tie via expensive
dict comparison (each record x is a dict!):

Sorting on field 'Address'
    2.2.1:  41.57
    2.3a0:  40.96

Sorting on field 'Company'
    2.2.1:  40.14
    2.3a0:  29.79

Sorting on field 'Exchange'
    2.2.1:  53.83
    2.3a0:  24.79

Sorting on field 'NumberOfEmployees'
    2.2.1:  47.89
    2.3a0:  45.74

Sorting on field 'Phone'
    2.2.1:  48.09
    2.3a0:  47.15

Sorting on field 'Profile'
    2.2.1:  58.41
    2.3a0:   8.77

Sorting on field 'Symbol'
    2.2.1:  40.78
    2.3a0:   6.30

Sorting on field 'Web'
    2.2.1:  46.79
    2.3a0:  35.64

This may have been sorted more times by now than any other database on Earth
<wink>.