ASCII delimited....a follow up

Roger Irwin irwin at mail.com
Wed Nov 17 15:14:06 CET 1999


Thanks to everybody who sent me suggestions for reading delimited ASCII files
last week. As there was no standard funtion for doing what I wanted I wrote my
own, but as this could be very useful to a lot of other people (if nothing alse
for reading in data from spreadsheets), I thought others might like a copy of
this simple function:

(Note, this is more orientated towards the requirements of general data tables
rather than mathematical data sets).

pdict={}

def parsedata(ftr,delim=';',offset=0,strip=1):
	"Read a delimited ASCII file into a dictionary"

	"After opening the file and reading in any header lines,"
	"this function can be used with ftr= to the open file object"
	"The delim parameter is obvious whilst the offset points to"
	"the field to be used as the key."
	"If strip is true, then if a field starts with a quotation"
	"the first and last character will ripped off the field."
	"A line with no delimiter will be regarded as a comment"
	"Stops reading as soon as a control character is encountered"

	"Returns number of records read"
	
	counter=0
	while 1:
		foo=ftr.readline()
		if foo=='':
			break		
		start=0
		buf=[]
		parsetab=[]

		for each in range(len(foo)):
			if foo[each]==delim:
				if (foo[start]=='"')|(foo[start]=="'"):
					parsetab.append(foo[start+1:each-1])
				else:
					parsetab.append(foo[start:each])
				start=each+1
			elif foo[each]<' ':
				if start:
					if (foo[start]=='"')|(foo[start]=="'"):
						parsetab.append(foo[start+1:each-1])
					else:
						parsetab.append(foo[start:each])
				break
		
		if start:
			pdict[parsetab[offset]]=parsetab	
			counter=counter+1
	return counter
	

Note that all fields get read as strings. If you need to do sums on a value
then it is better (IMHO) to do a specific conversion dependent on requirements
after the file is read in, like using getline/sscanf in C. Many programs (well,
mostly spreadsheets) put quotation marks round strings in deliminted files
whilst numbers are left as is. These quotation marks will get included in the
string when reading into Python. I thought about doing something
hyper-programmable to allow for all possibilities of removing these quoteation
marks, but then I decided this was overkill, took the KISS option, and did a
simple test whereby if the first character of a field is a quotation mark then
the first and last character will be stripped. This will cater for virtually
all real world requirments, but it is overidable in awkward cases.

Now I never wanted to be a programmer, I always wanted to be a lumberjack, but
instead I am a newbie Pythonite. As such I would appreciate feedback on the
way I have done this, particularly the way I have used pdict{}.

I wanted to have the possibility of reading several files into one dictionary,
whatsmore the files may have header records, so rather than have the parser
actually open and close the file, that is done by the calling
function, which may also read in any header records before passing
the file object pointer to the function. 

Now, if I had been in C, I would probably have passed a pointer to an array of
structures, but Python does not seem to encourage passing by reference in
function calls (or have I missed something), so I used a specific temporary
dictionary location. This way I can open several files in succession and
concatanate their contents into the same dictionary. If I then want to start a
new dataset (and this code were in 'module'), I would do:

firstdataset=module.pdict
module.pdict={}

And thus carry on reading a new lot of data. Now I assume that Python would not
have copied the disctionary, but simply passed a pointer to firstdataset, and
then created a new data object for module.pdict when I made the assignment. Is
this correct? Is there a better way to do this?

BTW, python would seem to be a great language for spreadsheets. Are there
spreadsheets with Python bindings, or perhaps even an extensible spreadsheet
written in Python?





More information about the Python-list mailing list