[Chicago] Python Development in Chicago

Michael Urman murman at gmail.com
Sun Oct 28 02:55:18 CET 2007


I feel like hearing myself type, so I'm going to give this a thorough
washing. I don't pretend to define pythonic, so others may well
disagree with me. The one biggest thing I see missing is the use of
iterators. They really are the best thing since the last best thing
since sliced bread. The final major issue which I do not address is
the use of raw text output for XML creation. This is always a bad idea
since XML is a binary format. But I'm too lazy to look back up how
generate XML with a library (xml.etree is generally a good choice) so
I'll leave that alone.

On 10/26/07, Kevin L. Stern <kevin.l.stern at gmail.com> wrote:
> import re, mmap, os
>
> class token:
>         def __init__(self):
>                 self.type = None
>                 self.data = None

The way you use class token later makes the instantiation of self.type
and self.data unnecessary. I would suggest either adding type and data
as parameters to the __init__ function (saves lines later), defining
defaults in the class body and omitting __init__ (simplifies this
class), or even omitting the body completely (simplifies it more).

Class tokenizer seems a little multifaceted - does it tokenize, or
does it process tokens? Since it actually is doing both, is it instead
your main program class? I'll take it in pieces.

> class tokenizer:
>         def __init__(self, inmap, out):
>                 self.inmap = inmap
>                 self.out = out
>
>         def nextToken(self):
>                 line = inmap.readline()
>                  if re.search("^graph\s.*\s{$", line):
>                         ident = line[6:len(line)-3]
>                         result = token()
>                         result.type = 'graph'
>                          result.data = ident
>                         return result
>                 elif re.search("^\s*subgraph.*{$", line):
>                         parse = re.search("subgraph\s.*\s{$", line).group()
>                         ident = line[10:len(line)-3]
>                         result = token()
>                         result.type = 'subgraph'
>                         result.data = ident
>                         return result
>                 elif re.search("^\s*}$", line):
>                         result = token()
>                         result.type = 'endgroup'
>                         return result
>                 elif re.search("^\s*n\d+\s--\sn\d+;$", line):
>                         parse = re.search("n\d+\s--\sn\d+", line).group()
>                         split = parse.partition('--')
>                         first = re.search("\d+", split[0]).group()
>                         last = re.search("\d+", split[2]).group()
>                         result = token()
>                         result.type = 'edge'
>                         result.data = [first,last]
>                         return result
>                 return None

  I would suggest refactoring the nextToken function as a generator
function. Assuming a token __init__ which takes type and data, it
might look something like this (untested):

    def tokenize_att(source):
        """tokenize_att iterates over a source and returns an iterator
over the resulting tokens"""
        for line in source:
            line = line.strip()
            if line.startswith('graph') and line.endswith('{'):
                yield token('graph', line.split()[1])
            elif line.startswith('subgraph') and line.endswith('{'):
                yield token('subgraph', line.split()[1])
            elif line == '}':
                yield token('endgroup', None)
            elif line.startswith('n'):
                parts = map(str.strip, line.split('--'))
                yield token('edge', (parts[0][1:], parts[1][1:]))

The key points here are:
   * taking an iterable (probably a file) instead of looking it up on self
   * use of startswith and split instead of a regexp
   * lack of data verification; you might want to check length of
split lines, for example.
   * edge token could be yield token('edge', tuple([p[1:] for p in
parts])) if you're feeling punchy

>         def processToken(self, t):
>                 if not t:
>                         return
>                 if t.type == 'graph':
>                         self.out.write('<graph id="%s">\n' % t.data)
>                 elif t.type == 'subgraph':
>                         self.out.write('<graph id="%s">\n' % t.data)
>                         self.sg += 1
>                 elif t.type == 'endgroup':
>                         self.out.write ('</graph>\n')
>                         if self.sg > 0:
>                                 self.sg -= 1
>                 elif t.type == 'edge':
>                         self.out.write('<edge source="%s" target="%s"/>\n' %
> (t.data[0], t.data[1]))

Similarly processToken could become generator function tokens_to_xml:

    def tokens_to_xml(tokens):
        """iterate over tokens, and return an iterator providing lines of xml"""
        for token in tokens:
            if token.type in ('graph', 'subgraph'):
                yield '<graph id="%s">' % token.data
            elif token.type == 'endgroup':
                yield '</graph>'
            elif token.type == 'edge':
                yield '<edge source="%s" target="%s"/>' % token.data

Key points:
   * dropped used of self.sg since it didn't seem to do anything
   * correspondingly collapsed graph and subgraph; it's easy enough to undo this
   * use of a tuple instead of an array for edge let it be passed to % directly

>
>         def go(self):
>                 self.sg = 0
>                 self.out.write("""<?xml version="1.0" encoding="UTF-8"?>
> <graphml>
> """)
>                 while self.inmap.tell() <  self.inmap.size ():
>                         lex.processToken(lex.nextToken())
>
>                 self.out.write("</graphml>")

go feels a little weird, probably mostly because it's referring to lex
which came out of nowhere. Assuming that's self, it's okay, but still
odd. How about att_to_graphml:

    def att_to_graphml(infile, outfile):
        print >> outfile, '<?xml version="1.0" encoding="UTF-8"?>'
        print >> outfile, '<graphml>'
        for xml in tokens_to_xml(tokenized(infile)):
            print >> outfile, xml
        print >> outfile, '</graphml>'

Key points:
   * refactoring input/output to parameters - now tokenizer's __init__
is useless; in fact the whole class feels like overkill.
   * contentious use of print>>out instead of out.write

> try:
>         infile = "ug.txt"
>         insize = os.path.getsize(infile)
>         fd = open(infile, "r+")
>         inmap = mmap.mmap(fd.fileno(), insize, None, mmap.ACCESS_READ)
>         outfile = "out.txt"
>         out = open(outfile, "r+")
>         lex = tokenizer(inmap, out)
>         lex.go()
> except IOError:
>         print "IO Error Occurred"
> finally:
>         inmap.close()
>         out.close()

This is way too overengineered for the common case. You might need
stuff like mmap occasionally, but not normally. Let's make it easy:

    att_to_graphml(open("ug.txt"), open("out.txt", "w"))

Or if you really need to worry about the files being left open a little longer:

    infile = open("ug.txt")
    outfile = open("out.txt", "w")
    try:
        att_to_graphml(infile, outfile)
    finally:
        infile.close()
        outfile.close()

Key points:
   * simpler file handling - no longer need os or mmap (with no re
above; no imports are necessary)
   * no tearing down things in the finally clause that might not have
been set up


That's my version of pythonic. When I originally read your code, the
only things I noticed were the regexps, mmap, and the funky use of
line[6:len(line)-3] Tim mentioned. So despite the complete difference
between your code and my version of pythonic, it didn't feel
newbie-ish at all.

Michael, delurking just a little
-- 
Michael Urman


More information about the Chicago mailing list