parsing CSV files with quotes
cpr at emsoftware.com
Thu Mar 30 19:18:13 CEST 2000
Since I've written a CSV/TSV scanning module (in C) for one of our products,
I'd recommend a simple state machine scanner. The rules are more complex
than meets the eye, if you really want to handle all the random cases that
dumb programs like Access or Excel output.
I've thought about providing a Python wrapper for this scanning wrapper (and
will eventually). If you want to do it yourself, I'd be happy to send you
the C code.
/ Chris Ryland, President / Em Software, Inc. / www.emsoftware.com
- - -
"Warren Postma" <embed at geocities.com> wrote in message
news:BDLE4.1642$HG1.47883 at nnrp1.uunet.ca...
> Suppose I have a CSV file where line 1 is the column names, and lines 2..n
> are comma separated variables, where all String fields are quoted like
> ID, NAME, AGE
> 1, "Postma, Warren", 30
> 2, "Twain, Shania", 31
> 3, "Nelson, Willy", 57
> 4, "Austin, \"Stone Cold\" Steve", 34
> So, the obvious thing I tried is:
> import string
> >>> print string.splitfields("4, \"Austin, \\\"Stone Cold\\\" Steve,
> ['4', ' "Austin', ' \\"Stone Cold\\" Steve', ' 34']
> Hmm. Interesting. So I tried this:
> >>> print string.splitfields(r'4, "Austin, \"Stone Cold\" Steve", 34')
> ['4,', '"Austin,', '\\"Stone', 'Cold\\"', 'Steve",', '34']
> I'm getting close, I can feel it!
> The Rules:
> 1. All integer and other fields are output as ascii.
> 2. String fields have quotes. Commas are allowed inside the quotes.
> 3. Quotes inside quotes are escaped by a backslash
> 4. Backslashes are themselves quoted by a backslash
> Is this complex enough that I basically need the "parser" module of
> Problem is I'm scared of it. Anyone got any Parser Tutorials Howtos/Links?
> Or is this beasty solveable by judicious use of Regular Expressions?
> While I'm taking up bandwidth, I'll ask another silly question:
> Is there a "compressed dbShelve" out there anywhere? In this case I just
> want to store arrays and dictionaries of built-in Python types, in a
> compressed manner, in a bsd database. Anyone heard of something like this?
More information about the Python-list