[Tutor] database question [files can be iterated on]

Mon Jul 28 15:31:47 2003

On Mon, 28 Jul 2003 jpollack@socrates.Berkeley.EDU wrote:

> > Out of curiosity, do you mean GBrowse from the GMOD project?
> >
>      http://gmod.sourceforge.net/
>
> No, the Genome Browser at UCSC.  I'll have a look at the site above... is
> it a collection of python tools for bioinformatics?

Unfortunately not Python.  GMOD is an umbrella project headed by many of
the main biological databases.  I'm working on the Publication Literature
component ("Pubsearch"), which is mostly in Java (with some Jython and
Perl sprinked in there...)

I'm doing a CC to the rest of the Tutor group, since your next question is
a very good one:

> > ###
> > for line in myfile:
> >     print line
> > ###
> >
> > will work perfectly well regardless of file size, because we only keep a
> > single line in memory at a time.
>
> Ok.  Stupid Question time.  in the above example, doesn't "myfile" have
> to be read in as a file ahead of time?  (e.g.
> myfile=open('blahblah.txt').read() ?)

No, it's actually perfectly fine to just open up a file:

###
myfile = open('blahblah.txt')
###

myfile is an open file, and we can think of it as a source of information.
We can then progressively read its information from it in a loop.  We can
do it line by line like this:

###
for line in myfile:
    ....
###

It's important to see that this process is very different from:

###
myfile = open('blahblah.txt').read()
###

There might be a slight confusion here because we're using a variable
named 'myfile', but it's not a file at all, but a string of the contents
of that file.  Hmmm... that sounds confusing.  *grin*

Ok, let's try this.  It helps to see if it we split up the expression into
two lines:

###
myfile = open('blahblah.txt')
contents = myfile.read()
###

In this case, we're asking Python to read the whole contents of myfile.
Does the distinction make sense?

The same reasoning applies to:

###
myfile = open('blahblah.txt')
lines = myfile.readlines()
###

and is something we should try to avoid, if we expect our files to be
huge.  'lines' here will become a list of all the lines in myfile, and
that's bad for memory reasons.

The key idea is that we'd like to treat our file as an "iterator", where
we only need to pay attention to the current line, or get the "next" line.
This "iterator" approach saves memory, since an iterator-based approach
only concentrates on a single line at a time.

Python's files support iterators --- most things can be turned into
iterators by using the iter() builtin function.  Explicitly, this looks
like:

###
myfile = open('blahblah.txt')
for line in iter(myfile):
    ...
###

And the 'for' loop has been designed to "march" across any iterator, so
things work out.  In fact, as of Python 2.2, it implicitely does an iter()
on the object it's marching against.  Now we can shorten it to:

###
myfile = open('blahblah.txt')
for line in myfile:
    ...
###

David Mertz has written a small introduction into the concept of an
iterator:

    http://www-106.ibm.com/developerworks/library/l-pycon.html

Please feel free to ask questions about this.

> What are you working on at Berkeley?  It's always good to know some
> locals.  :)

I'm actually working at the Carnegie Institute of Washington:

    http://carnegiedpb.stanford.edu/

So I'm on the other side of the Bay at the moment.  Sorry!  But we're
going to have a BayPiggies meeting sometime next month; drop by if you
have time:

    http://www.baypiggies.net/

> Thanks for the reply, I shall try and get mySQL working along with the
> python interface.  Incidentally, do you know if mySQL has a Windows
> version?

You can check on:

    http://mysql.com/

I'm pretty sure that the MySQL 4 server can run on Windows.  But how well
it runs is another question.  *grin*

Best of wishes to you!