[Tutor] Wanted: module to parse out a CSV line

Magnus Lycka magnus@thinkware.se
Wed Dec 11 05:33:01 2002


At 00:42 2002-12-11 -0800, Terry Carroll wrote:
>I'm writing one of my first Python apps

Welcome Terry! I hope you will enjoy it!

(Dave, there is something here that looks like a bug in CSV to me.
Care to comment?)

>(I've used perl up  until
>now) and need to parse out lines of comma-separated values (CSV).  I'm on
>a Windows/XP system.

I was just about to suggest that you used one of the three
modules below. It's nice to see that someone has made the
effort to search the net before asking here! :)

I'm afraid you should have tried a bit harder with these modules.
They can all solve your problem (?), but maybe they could be a little
better documented, and one of them could be in the standard library
I think.

>I've found some Python CSV support, but nothing that will work for me:

>  1. ASV, from <http://tratt.net/laurie/python/asv/>
>     Nice, but it reads in an entire file that is assumed to be
>     CSV-formatted.

There is an input_from_file method, but you don't have to use
that. Use input instead.

>  That's not my case, I have a single variable I need
>     to parse out (yeah, it comes from a file, but not all lines in the
>     file are CSV).

 >>> import ASV
 >>> asv = ASV.ASV()
 >>> asv.input('A, 232, "Title", "Smith, Adam", "1, 2, 3, 4"', ASV.CSV())
 >>> print asv
[['A', '232', 'Title', 'Smith, Adam', '1, 2, 3, 4']]

>  2. A CSV module from
>     <http://www.object-craft.com.au/projects/csv/documentation.html>
>     Perfect!  Exactly what I need.  Except the install fails looking for a
>     program named cl.exe; I think it's a compiler, which I don't have.

This module is implemented in C to make it really fast even
for very large files. But look at the download page:
http://www.object-craft.com.au/projects/csv/download.html

If you are using Win32, you can use one of the following binaries:
Win32 Python 2.1 binary: csv.pyd 20K Nov 20 2002
Win32 Python 2.2 binary: csv.pyd 20K Nov 20 2002

 >>> import csv
 >>> csv.parser().parse('A, 232, "Title", "Smith, Adam", "1, 2, 3, 4"')
['A', ' 232', ' "Title"', ' "Smith', ' Adam"', ' "1', ' 2', ' 3', ' 4"']

Not quite...but...

 >>> csv.parser().parse('A,232,"Title","Smith, Adam","1, 2, 3, 4"')
['A', '232', 'Title', 'Smith, Adam', '1, 2, 3, 4']

It seems the space after the comma confuses CSV regarding the use
of double quotes. I've seen a lot of files with whitespace after
the comma, so this is not what I would like. And the parser won't
accept field_sep = ', ', it has to be a single character.

>  3. Python-DSV, at <http://python-dsv.sourceforge.net/>
>     This looks like some whole separate program, rather than something
>     that I can just call in to parse out a single line.  It also looks
>     like it goes after a whole file at once.  Hard to tell -- there's no
>     docs, unless (I presume) I install it.

The documentation is in the form of a documentation string in the source.
It shows you what to do.

Basic use:
     from DSV import DSV
     data = file.read() # file.read() returns a string, so this is what you 
need
     qualifier = DSV.guessTextQualifier(data) # optional
     data = DSV.organizeIntoLines(data, textQualifier = qualifier)
     delimiter = DSV.guessDelimiter(data) # optional
     data = DSV.importDSV(data, delimiter = delimiter, textQualifier = 
qualifier)
     hasHeader = DSV.guessHeaders(data) # optional

You can skip the guessing games, and run the two functions that matters.

 >>> from DSV import DSV
 >>> data = 'A, 232, "Title", "Smith, Adam", "1, 2, 3, 4"'
 >>> data = DSV.organizeIntoLines(data, textQualifier = '"')
 >>> data = DSV.importDSV(data, delimiter = ',', textQualifier = '"')
 >>> print data
[['A', ' 232', 'Title', 'Smith, Adam', '1, 2, 3, 4']]

As you see, like csv, but unlike asv, it won't strip the leading space
from before 232. I'm pretty sure this is intentional. Whether it's a
bug or a feature in your eyes is a different issue...

The reason that the "organizeIntoLines" step (which you can bypass by
putting your string in a list I guess) exists is because programs like
Excel will produce CSV files with line breaks inside "-delimited strings.
So a logical line might span several physical lines.

I think it would be a good thing to have parsers/importers/exporters for
both CSV (and fixed format) in the standard library. We just need some
kind of consensus on how they should behave I guess...


-- 
Magnus Lycka, Thinkware AB
Alvans vag 99, SE-907 50 UMEA, SWEDEN
phone: int+46 70 582 80 65, fax: int+46 70 612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se