[Chicago] topics!

Clyde Forrester clydeforrester at gmail.com
Wed Jan 4 19:21:46 CET 2012


"I had to solve some interesting problems..."

Not being able to suck a 240MB file into memory, clean it, join it 
(another 240MB), and make a reverse complement (another 240MB), was one 
of the problems. So I went the other way and used an algorithm which 
would act one character at a time, in a single pass, allowing for 
multiple framings and such. The advantage is that it is very memory 
efficient. It also wound up being very scalable.

As for the file layout: there are 25 text files. I loop through each 
file, each line, each character, in a single pass. It makes for a good 
textbook case example.

The downside, for now, is that I can't do fuzzy matches the way I would 
like to. To solve that, I will probably build a machine with 16GB of 
memory, which will enable me to suck in the largest file at least 3 
times over. Sometimes brute force is the path of least resistance. Wake 
me up when I can afford it.

Clyde

Joshua Herman wrote:
> Clyde forgot to mention that since he couldn't load the whole human
> genome into memory he actually searches through the file on disk. At
> least when I talked with him I think that is what it does.
> ---Profile:---
> http://www.google.com/profiles/zitterbewegung
> 
> 
> 
> 
> On Wed, Jan 4, 2012 at 10:31 AM, Clyde Forrester
> <clydeforrester at gmail.com> wrote:
>> I recently wrote a program to count the occurrences of "GATACCA" in the
>> human genome. I can do a brief talk on that. I had to solve some interesting
>> problems, and it provides an interesting example of text file reading,
>> compound lists, and some objectish methods.
>>
>> Clyde



More information about the Chicago mailing list