Greetings,
We're released a new version of PySnpTools (0.4.19). Here are the new features:
* Bug fixed in the "Bgen" reader<https://fastlmm.github.io/PySnpTools/#module-pysnptools.distreader>. It has now has been tested on files as large 487,400 individuals x 4,840,000 SNPs (the size of the UK Biobank imputed genotype data). It should work with even larger files.
* New option when reading from "Bed"<https://fastlmm.github.io/PySnpTools/#snpreader-bed> files directly into "int8" arrays. This saves 3x memory and time compared to reading into a "float32" array and converting.
* Carl
Carl Kadie, Ph.D.
FaST-LMM & PySnpTools Team
(Microsoft Research, retired)
https://fastlmm.github.io/
Join the FaST-LMM user discussion and announcement list via email<mailto:fastlmm-user-join@python.org?subject=Subscribe> (or use web sign up<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pyth…>)
I’m happy to announce a new releases of FaST-LMM<https://pypi.org/project/fastlmm/> and PySnpTools<https://pypi.org/project/pysnptools/>. (This release been my “work” since I retired last summer.)
The new releases updates both packages to work with the newest version of Pandas, Numpy, and Scikit-learn.
The new FaST-LMM release includes single_snp_scale, which allows FaST-LMM to use a cluster and scale to 1 million individuals. See Kadie and Heckerman, bioRxiv 2018<https://www.biorxiv.org/content/10.1101/154682v2> for background. Similar tools would require 100,000 computers to scale this much, but FaST-LMM needs “only” a cluster of 100 computers. (The code can run on any cluster but to run on a particular cluster we must create a module detailing how to automate batch jobs and move files.)
The new PySnpTools release adds support for cluster-sized data. Including:
* snpreader.SnpGen<https://fastlmm.github.io/PySnpTools/#pysnptools.snpreader.SnpGen>: Generate synthetic SNP data on the fly.
* snpreader.SnpMemMap<https://fastlmm.github.io/PySnpTools/#pysnptools.snpreader.SnpMemMap>: Support larger in-memory data via on-disk memory mapping.
* snpreader.DistributedBed<https://fastlmm.github.io/PySnpTools/#pysnptools.snpreader.DistributedBed>: Split Bed<https://fastlmm.github.io/PySnpTools/#pysnptools.snpreader.Bed>-like data into multiple files for more efficient cluster use
* util.mapreduce1<https://fastlmm.github.io/PySnpTools/#module-pysnptools.util.mapreduce1>: Run loops in parallel on multiple processes, threads, or clusters
* util.filecache<https://fastlmm.github.io/PySnpTools/#module-pysnptools.util.filecache>: Automatically copy files to and from any remote storage.
FaST-LMM and PySnpTools were originally developed and open sourced at Microsoft Research. Active development has now based at https://fastlmm.github.io/.
Roadmap:
I plan to continue working on FaST-LMM and PySnpTools. We’d like to run a giant job on real, rather than synthetic, data. We like to compare it other fast methods that we suspect sacrifice accuracy. I’d like to port it from Python 2 to Python 3. (More todo’s: analyze multiple traits in one run, analyze pairs of DNA locations using the single-DNA-location tools, …)
Contacts:
Email the developers at fastlmm-dev(a)python.org<mailto:fastlmm-dev@python.org>.
Join<mailto:fastlmm-user-join@python.org?subject=Subscribe> the user discussion and announcement list (or use web sign up<https://mail.python.org/mailman3/lists/fastlmm-user.python.org>).
Yours,
Carl
Carl Kadie, Ph.D.
FaST-LMM Team
Greetings,
I'm happy to announce that the latest versions of PySnpTools and Bed-Reader now work with Python 3.9 (as well as 3.7 and 3.8). See https://fastlmm.github.io/.
(FaST-LMM is still just 3.7 and 3.8. We're waiting for Anaconda to release Miniconda 3.9.)
* Carl
Carl Kadie, Ph.D.
FaST-LMM & PySnpTools Team<https://fastlmm.github.io/>
(Microsoft Research, retired)
https://www.linkedin.com/in/carlk/
Greetings,
The latest version of FaST-LMM<https://pypi.org/project/fastlmm/> includes GPU support. Specifically, you can do GWAS on a GPU via the single_snp function.
To use it,
* You must have an NVIDIA GPU.
* Install the cupy library into your Python environment (which requires the CUDA library). See https://cupy.dev/ for details.
* When you call single_snp, set the new 'xp' parameter to 'cupy'. Alternatively, set the ARRAY_MODULE environment variable to cupy.
In my experience, running on a GPU is faster than running on a single CPU, but not as fast as running on multiple CPUs. On my computer (with 6 processors and an RTX 2060 GPU), I was able, with difficulty, to get a 15% to 20% speed up by splitting the work to run on both the GPU and the CPUs.
If you have multiple GPUs or much faster GPUs than mine, let us know and we can advise perhaps getting you more speed up.
Aside: I have a new article in Towards Data Science (medium.com) about writing GPU-optional Python code (free link<https://towardsdatascience.com/gpu-optional-python-be36a02b634d?sk=38843937…>)
* Carl
Carl Kadie, Ph.D.
FaST-LMM & PySnpTools Team<https://fastlmm.github.io/>
(Microsoft Research, retired)
https://www.linkedin.com/in/carlk/
Join the FaST-LMM user discussion and announcement list via email<mailto:fastlmm-user-join@python.org?subject=Subscribe> (or use web sign up<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pyth…>)