[Csv] Module question...

Thu Jan 30 12:31:06 CET 2003

>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

>> From: Andrew McNamara
>> 
>> >> The way we've speced it, the module only deals with file
>> objects. I >> wonder if there's any need to deal with strings,
>> rather than files?
>> 
>> BTW, I'm asking this because it's something that will come back to
>> haunt us if we get it wrong - it's something we need to make the
>> right call on.

Kevin> Agreed, in fact I'm now reconsidering my position.

When I originally wrote the Object Craft parser I thought about these
things too.  I eventually settled on the current interface.  To use
the stuff in CVS now, this is what the interface looks like:

    csvreader = _csv.parser()
    for line in file("some.csv"):
        row = csvreader.parse(line)
        if row:
            process(row)

The reason I settled on this interface was that it placed only the
performance critical code into the extension module.  All policy
decisions about where the CSV data would come from were pushed back
into the application.

The current PEP is only a slight variation on this, but it is a nice
variation.  The variation pushes the conditional in the loop into the
reader and thereby exposes a much nicer interface.

Hmmm...  The argument to the PEP reader() should not be a file object,
it should be an iterator which returns lines.  There really is no
reason why it should not handle the following:

    lines = ('1,2,3,"""I see,""\n',
             'said the blind man","as he picked up his\n',
             'hammer and saw"\n')

    csvreader = csv.reader(lines)
    for row in csvreader:
        process(row)

>> >One other possibility would be for the parser to only deal with
>> >one row at a time, leaving it up to the user code to feed the
>> >parser the row strings. But given the various possible line
>> >endings for a row of data and the fact that a column of a row may
>> >contain a line ending, not to mention all the other escape
>> >character issues we've discussed, this would be error-prone.
>>
>> This is the way the Object Craft module has worked - it works well
>> enough, and the universal end-of-line stuff in 2.3 makes it more
>> seamless. Not saying I'm wedded to this scheme, but I'd just like
>> to have clear why we've chosen one over the other.

You might have missed it but the Object Craft parser is designed to be
fed one line at a time.  It actually raises an exception if you pass
more than one line to it.  Internally it collects fields from lines
until it detects end of record, at which point it returns the record
to the caller.

>> I'm trying to think of an example where operating on a file-like
>> object would be too restricting, and I can't - oh, here's one: what
>> if you wanted to do some pre-processing on the data (say it was
>> uuencoded)?

I think this could be solved by changing the reader() fileobj argument
to an iterable.

>> >The solution was to simply accept a file-like object and let the
>> >parser do the interpretation of a record. By having the parser
>> >present an iterable interface, the user code still gets the
>> >convenience of processing per row if needed or if no processing is
>> >desired a result list can easily be obtained.

Is this the same thing as what I said above?

>> Should the object just be defined as an iteratable, and leave
>> closing, etc, up to the user of the module? One downside of this is
>> you can't rewind an iterator, so things like the sniffer would be
>> SOL. We can't ensure that the passed file is rewindable
>> either. Hmmm.

Application code will just have to be aware of this and arrange to do
something like the following:

    sniffer_input = [fileobj.readline() for i in range(20)]

    dialect = csvutils.sniff(sniffer_input)
    csvreader = csv.reader(sniffer_input, dialect=dialect)
    for row in csvreader:
        process(row)

Then we have two problems (our principle weapons are surprise and fear):

   * The sniffer_input might have a partial record (multi-line record
     spanning last line read out of file).

   * We do not have a way to continue using a reader with additional
     input.

   * The list comprehension be longer than the file :-)

This could be solved by exposing a further method on the reader.

    sniffer_input = [fileobj.readline() for i in range(20)]

    dialect = csvutils.sniff(sniffer_input)
    csvreader = csv.reader(sniffer_input, dialect=dialect)
    for row in  csvreader:
        process(row)
    # now continue on with the rest of the file
    csvreader.use(fileobj)
    for row in  csvreader:
        process(row)

Given the above, is it reasonable to say that the above logic could be
hardened and placed into a csvutils function?

- Dave

-- 
http://www.object-craft.com.au