Using re to get data from text file: SOLVED

Jocknerd jocknerd1 at yahoo.com
Fri Sep 10 20:15:54 CEST 2004


On Fri, 10 Sep 2004 14:53:32 +0000, William Park wrote:

> Jocknerd <jocknerd1 at yahoo.com> wrote:
>> I'm a Python newbie and I'm having trouble with Regular Expressions when
>> reading in a text file.  Here is a sample layout of the input file:
>> 
>> 09/04/2004  Virginia              44   Temple               14
>> 09/04/2004  LSU                   22   Oregon State         21
>> 09/09/2004  Troy State            24   Missouri             14
>> 
>> As you can see, the text file contains a list of games.  Each game has a
>> date, a winning team, the winning team's score, the losing team, and the
>> losing team's score.  If I set up my program to import the data with fixed
>> length format's its no problem.  But some of my text files have different
>> layouts.  For instance, some only have one space between a team name and
>> their score.
>> 
>> Here's how I read in the file using fixed length fields:
>> 
>> filename = sys.argv[1]
>> file = open (filename, 'r')
>> 
>> schedule = []     # make a list called schedule
>> 
>> while True:
>>     line = file.readline()
>>     if not line: break
>>     game = {}     # make a dictionary called game
>>     game['date']   = line[0:10]   # fixed length field
>>     game['team1']  = string.strip (line[12:40])
>>     game['score1'] = line[40:42]
>>     game['team2']  = string.strip (line[44:72])
>>     game['score2'] = line[72:74]
>>     schedule.append(game)
>> 
>> file.close()
>> 
>> Note:  I'm stripping whitespace from the team names because I don't want
>> the team name to actually be a fixed length.
>> 
>> How would I set this up to read in the data using Regular expressions?
>> 
>> I've tried this:
>> 
>> while True:
>>     line = file.readline ()
>>     if not line: break
>>     game = {}
>>     datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
>> 
>> Here's where I get stuck.  What do I do from here?  I just don't know how
>> to import the text and assign it to the proper fields using the re module.
> 
> 
> Your format is a bit complicated since team's name can be variable
> words.  But, I'm assuming that they don't have any digit as part of
> their name.  So, use '\d+' to separate the fields.  Eg.
>     re.split ('\d+', line)
>     re.split ('(\d+)', line)
>     re.split ('(\d+)', line[10:])

Couldn't figure out re.split.  Didn't seem to do what I wanted. Here's
what did work:

#!/usr/bin/python

import re
filename = sys.argv[1]
file = open (filename, 'r')

schedule = []

pattern = re.compile(r'^(.*\D\d+\D\d+)\D(.*)\D(.*\d+)\D(.*)\D(.*\d+)(.*)$')
while True:
    line = file.readline()
    if not line: break
    g = {}
    g['date'], g['team1'], g['score1'], g['team2'],
    g['score2'],g['location'] = pattern.search(line).groups()
    schedule.append(g)
file.close()

for game in schedule:
    print game['date'], game['team1'], game['score1'], game['team2'],
    game['score2']





More information about the Python-list mailing list