[Chicago] topics!
Clyde Forrester
clydeforrester at gmail.com
Wed Jan 4 19:21:46 CET 2012
"I had to solve some interesting problems..."
Not being able to suck a 240MB file into memory, clean it, join it
(another 240MB), and make a reverse complement (another 240MB), was one
of the problems. So I went the other way and used an algorithm which
would act one character at a time, in a single pass, allowing for
multiple framings and such. The advantage is that it is very memory
efficient. It also wound up being very scalable.
As for the file layout: there are 25 text files. I loop through each
file, each line, each character, in a single pass. It makes for a good
textbook case example.
The downside, for now, is that I can't do fuzzy matches the way I would
like to. To solve that, I will probably build a machine with 16GB of
memory, which will enable me to suck in the largest file at least 3
times over. Sometimes brute force is the path of least resistance. Wake
me up when I can afford it.
Clyde
Joshua Herman wrote:
> Clyde forgot to mention that since he couldn't load the whole human
> genome into memory he actually searches through the file on disk. At
> least when I talked with him I think that is what it does.
> ---Profile:---
> http://www.google.com/profiles/zitterbewegung
>
>
>
>
> On Wed, Jan 4, 2012 at 10:31 AM, Clyde Forrester
> <clydeforrester at gmail.com> wrote:
>> I recently wrote a program to count the occurrences of "GATACCA" in the
>> human genome. I can do a brief talk on that. I had to solve some interesting
>> problems, and it provides an interesting example of text file reading,
>> compound lists, and some objectish methods.
>>
>> Clyde
More information about the Chicago
mailing list