Python Regular Expressions

Wed Jun 22 11:05:39 EDT 2011

Andy Barnes wrote:

> Hi,
> 
> I am hoping someone here can help me with a problem I'd like to
> resolve with Python. I have used it before for some other projects but
> have never needed to use Regular Expressions before. It's quite
> possible I am following completley the wrong tack for this task (so
> any advice appreciated).
> 
> I have a source file in csv format that follows certain rules. I
> basically want to parse the source file and spit out a second file
> built from some rules and the content of the first file.
> 
> Source File Format:
> 
> Name, Type, Create, Study, Read, Teach, Prerequisite
> # column headers
> 
> Distil Mana, Lore, n/a, 70, 38, 21
> Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
> Talismantic Lore, Lore, n/a, 150, 100, 50
> Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
> Theurgic Lore
> 
> The input file I have has over 700 unique entries. I have tried to
> cover the four main exceptions above. Before I detail them - this is
> what I would like the above input file, to be output as (dot
> diagramming language incase anyone recognises it):
> 
> Name, Type, Create, Study, Read, Teach, Prerequisite
> # column headers
> 
> DistilMana [label="{ Distil Mana |{Type|7}|{70|38|21}}"];
> TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
> DistilMana -> TheurgicLore;
> TalismanticLore [label="{ Talismantic Lore |{Lore|n/a}|{150|100|
> 50}}"];
> AdvanvedTalismanticLore [label="{ Advanced Talismantic Lore |{Lore|n/
> a}|{100|60|30}}"];
> TalismanticLore -> AdvanvedTalismanticLore;
> TheurgicLore -> AdvanvedTalismanticLore;
> 
> It's quite a complicated find and replace operation that can be broken
> down into some easy stages. The main thing the sample above showed was
> that some of the entries won't list any prerequisits - these only need
> the descriptor entry creating. Some of them have more than one
> prerequisite. A line is needed for each prerequisite listed, linking
> it to it's parent.
> 
> You can also see that the 'name' needs to have spaces removed and it's
> repeated a few times in the process. I Hope it's easy to see what I am
> trying to achieve from the above. I'd be very happy to accept
> assistance in automating the conversion of my ever expanding csv file,
> into the dot format described above.

Forget about regexes. If there's any complexity it's in writing the output 
rather than reading the input file. You can tackle that by putting your data 
into a dictionary and using a format string:

import sys

def camelized(s):
    return "".join(s.split())

template = """%(camel)s [label="{ %(name)s |{%(type)s|%(create)s}|
{%(study)s|%(read)s|%(teach)s}}"];"""

def process(instream, outstream):
    instream = (line for line in instream if not (line.isspace() or 
line.startswith("#")))
    rows = (map(str.strip, line.split(",")) for line in instream)
    headers = map(str.lower, next(rows))
    for row in rows:
        rowdict = dict(zip(headers, row))
        camel = rowdict["camel"] = camelized(rowdict["name"])
        print template % rowdict
        for for_lack_of_better_name in row[len(headers)-1:]:
            print "%s -> %s;" % (camelized(for_lack_of_better_name), camel)

if __name__ == "__main__":
    from StringIO import StringIO
    instream = StringIO("""\
Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore, 
Theurgic Lore
""")
    process(instream, sys.stdout)