[Chicago] Closest Index

Sun Jan 6 23:33:52 CET 2013

On Sun, Jan 6, 2013 at 7:32 AM, Oren Livne <livne at uchicago.edu> wrote:
> Hi Brian,
>
> I would love to! Unfortunately I can never attend on Thursday nights due to
> another obligation. If I ever get the chance I'll let you know. In fact I
> think the discussion should be expanded more generally to python problems
> arising in genetic applications.
>
> Shelia: the data sets are public. The A-array is in each of the files of
> http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz
>
> The B-arrays are the subset of positions on the product
> http://www.affymetrix.com/browse/products.jsp?productId=131532&navMode=34000&navAction=jump&aId=productsNav#1_1
> I don't know if they have a public download for their marker list. Or maybe
> the AWS data set has them - look for Affymetrix chip 5.0 or 6.0.
>
> Yes, there would be natural applications for map-reduce parallelization. Not
> this particular task, but other far-more extensive tasks. Would be great to
> discuss in the ChiPy meeting. This is truly a great mailing list.

Hmm, since you are already in the large-data-set regime you might want
to look into the trie structure [1]. It has a better performance in
big-O terms but a larger coefficient. Since you're data is in the
millions (billions?) then this might be worth it, especially since
this is genome data (which I'm guessing is a hash table-type
structure).

[1] - http://en.wikipedia.org/wiki/Trie