[Csv] Re: PEP 305 - Comments (really long post)
Skip Montanaro
skip at pobox.com
Thu Feb 6 17:07:01 CET 2003
Carlos> 1) There is one reason left to convert numbers before returning
Carlos> them, and this has a lot to do with information that is
Carlos> discarded in the process. Let us follow this example:
Carlos> "row 1";10 --> ("row 1", "10")
Carlos> The second item of the returned tuple is a string, as you stated
Carlos> in your answer. The problem is that my application has no way to
Carlos> know if the value was originally written in the csv file with or
Carlos> without quotes; this information is lost because all values are
Carlos> 'normalized' by the csv library.
Carlos,
You're interpreting the quote character incorrectly. Quotes are necessary
only to disambiguate fields which contain the delimiter character. There is
no restriction that they be used minimally, however. Your example can just
as easily (and just as correctly) have been written as any of the following:
"row 1";10
"row 1";"10"
row 1;10
row 1;"10"
All have precisely the same meaning.
We do have plans to implement a csvutils module. One of the things it will
contain is a "sniffer" (actually, it may contain multiple sniffers to sniff
out different properties of the file). One thing a sniffer might do is try
to determine column types by looking at a relatively short prefix of a CSV
file (20 rows or so). This may be helpful to you in situations where your
application doesn't know the type information, but in general, your
application should know column types better than the csv module.
Carlos> "1", "Project phase", 2000
Carlos> "1.1", "Requirement analysis", 1000
Carlos> "1.1", "Architectural design", 1000
Carlos> In this case, MS Excel will detect that the first column as a
Carlos> string, but will convert values in the third one to numeric
Carlos> format.
Perhaps, but Microsoft has the advantage of arrogance. ;-) MS is the
800-pound gorilla, and can thus assume that any CSV data which is fed to
Excel must be in a format Excel understands. We don't have that luxury. We
want to make sure people can read CSV data generated by many different
applications, many of which are incompatible with Excel's assumptions.
Carlos> There are few solutions for this problem, none of them fully
Carlos> satisfactory:
...
There's the key: "none of them fully satisfactory". If there was a
satisfactory solution, we'd be more open to extracting type information from
the raw data. Since there isn't we will limit this csv module's to just
parsing the data.
Carlos> 2) In your answer, you cite the case where some numeric values
Carlos> can be hex, or whatever base it is. Well, I don't agree with
Carlos> your argument. One of the Python's mottos is "to make simple
Carlos> things simple". The simplest case are base 10 integers; if the
Carlos> library can deal with them in a sane way, you're solving the
Carlos> problems of the vast majority of the users. Special cases are
Carlos> just that, special, and will be treated in a special fashion
Carlos> anyway.
True, the simplest case is base 10. However, like I said above, many
different applications may be the source of this data (or may want to read
the CSV data we write). It's just not possible to be all things to all
people. We're doing what we feel we can do better than anyone else.
Carlos> 3) I'm not sure if str() is localized for floats. Using the
Carlos> standard installation of PythonWin with a fully localized copy
Carlos> of Windows, it still uses periods as decimal point - not
Carlos> commas. I didn't try to change the locale manually (I never did
Carlos> that before for Python); I'll try and tell you what happens.
That would be much appreciated. Another area we need to deal with but which
we have avoided so far is Unicode.
Carlos> 4) I'm not convinced that passing a binary file is a good
Carlos> idea. Reading the PEP I assumed that the csvreader constructor
Carlos> just takes any object that can return lines. Well, binary file
Carlos> objects do not meet this definition. It would make the system
Carlos> much less flexible, making it more difficult to pass arbitrary
Carlos> iterables to the csv library.
The reader takes an iterable object. If that object has a binary mode flag
we expect it to have been given. This stuff all works fine now. I don't
anticipate changes.
Carlos> For the sake of simplicity and clarity, why not leave the line
Carlos> termination option out of the csv library, in such a way that it
Carlos> can be implemented in the file object passed to the reader?
Because we might be generating CSV files on a Linux system (LF line
terminator) which is supposed to be consumed by a user on a Mac OS 8 system
running ClarisWorks 4 which (being the feeble tool it was) doesn't know
diddley squat about LF line terminators. Accordingly, we have to set the
lineterminator to CR. We can't do that with text mode files. Nor can we
assume that a person still running CW4 and Mac OS 8 will have any sort of
file conversion tools available.
Carlos> 5) I agree that fixed width text files are different beasts.
Carlos> Anyway, it should be possible to implement it using the same
Carlos> interface (or API, whatever you like calling it). Things like
Carlos> that make the learning curve smoother. But we can leave this
Carlos> discussion for a later time.
Sure, but "same API" != "same module". ;-)
Carlos> Thanks for your comments, and please forgive my insistence :-)
No problem. Just don't move to New Zealand and change your name to
Graham. ;-) [see the recent python-dev flamefest about a native code
compiler for Python]
Skip
More information about the Csv
mailing list