[Tutor] little something in the way of file parsing

Fri, 19 Jul 2002 11:49:55 -0700 (PDT)

So in Debian we have a file called the 'available' file.  It lists the packages
that are in a particular Debian release and looks like this:

<snip>
Package: telnet
Priority: standard
Section: net
Installed-Size: 208
Maintainer: Herbert Xu <herbert@debian.org>
Architecture: i386
Source: netkit-telnet
Version: 0.17-18
Replaces: netstd
Provides: telnet-client
Depends: libc6 (>= 2.2.4-4), libncurses5 (>= 5.2.20020112a-1)
Filename: pool/main/n/netkit-telnet/telnet_0.17-18_i386.deb
Size: 70736
MD5sum: 7eb82b4facdabe95a8235993abe210f6
Description: The telnet client.
 The telnet command is used for interactive communication with another host
 using the TELNET protocol.
Task: unix-server

Package: gnushogi
Priority: optional
Section: games
Installed-Size: 402
Maintainer: Brian Mays <brian@debian.org>
Architecture: i386
Version: 1.3-3
Depends: libc6 (>= 2.2.4-4), libncurses5 (>= 5.2.20020112a-1)
Suggests: xshogi
Filename: pool/main/g/gnushogi/gnushogi_1.3-3_i386.deb
Size: 228332
MD5sum: 4a7bf0a6cce8436c6d74438a2d613152
Description: A program to play shogi, the Japanese version of chess.
 Gnushogi plays a game of Japanese chess (shogi) against the user or it
 plays against itself.  Gnushogi is an modified version of the gnuchess
 program.  It has a simple alpha-numeric board display, or it can use
 the xshogi program under the X Window System.
</snip>

One stanza after another.  As preparation for a tool that would allow me to do
better things than grep on it I wrote the following bit of python.

I am posting this because it shows some of the powers python has for rapid
coding and parsing.  Thought some of the lurkers might enjoy it.  Hope someone
learns something from it.

<code>

#!/usr/bin/python

import string

class Package:
    pass

availfile = '/var/lib/dpkg/available'

fd = open(availfile)

package_list = []
package = ''

# note I use the readline() idiom because there is currently 10 thousand plus
# entries in the file which equates to some 90,000 lines.

while 1:
    line = fd.readline()
    if not line: break

    line = string.rstrip(line)
    if not line:
        package_list.append(package)
        continue                     # end of package stanza

    if line[0] == ' ':
        if not hasattr(package, 'description'):
            setattr(package, 'description', '')
        package.description += line[1:]
        continue

    # the depends line occasionally has a line like
    # Depends: zlib1g (>= 1:1.1.3) which would break the split() so I use the
    # optional maxsplit option to ask for only the first colon
    tag, value = string.split(line, ':', 1)
    value = value[1:]

    tag = string.lower(tag)
    if tag == 'package':             # start a new package
        package = Package()

    # the Description format is the first line with Description: is the short
    # 'synopsis' the following lines are reads a paragraphs of a longer
    # description.  paragraphs are separated with '.' to make parsing easier
    if tag == 'description':
        tag = 'short'                # rename tag to allow description as long

    setattr(package, tag, value)

priorities = {}
sections = {}
maintainers = {}
sources = {}
tasks = {}

for package in package_list:
    priorities.setdefault(package.priority, []).append(package)
    sections.setdefault(package.section, []).append(package)
    maintainers.setdefault(package.maintainer, []).append(package)
    if hasattr(package, 'source'):
        sources.setdefault(package.source, []).append(package)
    if hasattr(package, 'task'):
        tasks.setdefault(package.task, []).append(package)

print 'Summary:'
print '%d packages' % len(package_list)
print '%d sources' % len(sources)
print '%d priorities' % len(priorities)
print '%d sections' % len(sections)
print '%d maintainers' % len(maintainers)
print '%d tasks' % len(tasks)
<code>

At this point I have a list of package classes and several dictionaries holding
lists of these packages.  There is only one instance of the actual package in
memory though, the rest are references handled by python's garbage collector. 
Most handy.

I could now add a gui to this which would show a tree of maintainers, sections,
tasks, whatever.  Or I could simply walk the package list and display the
synopsis.

Or fun things like who maintains more packages.  Which section has the least
packages.  What maintainer(s) is most important to Debian (he has the most
packages in the most critical section).

What I like about this solution is the empty Package class which gets filled as
we parse.  This makes it easy for the program to grow and change as the file
format changes (if that is needed).

All told this is about 30 minutes of work.