Mailman 3 csv dialect enhancement - Python-ideas

11 Jan 2013

      There is a common dialect of CSV, often used in database
applications [*1], that distinguishes between an empty
(quoted) string,

  e.g., the second field in  "abc","",3

and an empty field,

  e.g., the second field in "abc",,3

This distinction is needed to specify or tell the 
difference between 0-length strings and NULLs, when sending
csv data to or receiving it from a database application.

AFAICT, Python's csv module does not distinguish between
empty fields and empty quoted strings.  Both of the examples 
above, when parsed by csv.Reader, will return ['abc', '', 3] 
(or possibly '3' for the last item depending on options).  
Similarly, csv.Writer produces the same output csv text 
(nothing or a quoted empty string depending on Dialect.quoting) 
for row items '' or None.

csv.Reader could distinguish between the above cases by
using an empty string ('') to report an empty (quoted) string 
field, and None to report an empty field.  Thus the second 
example would produce ['abc', None, 3] (or ...,'3').  Similarly,
csv.Writer could produce alternate text (nothing or a quoted 
empty string) depending on whether a row item was None or 
an empty string.

I propose that a new dialect attribute be added, "nulls" [*2],
which when false (default) will cause csv to behave as it currently
does.  When true it will have the following effect:

Reader:
  When two adjacent delimiters occur, or two white-space
  separated delimiters when Dialect.skipinitialspace is true,
  a value of None will be returned for that field.

Writer:
  When a None is present in the the list of items being
  formatted, it will result in an empty output field
  (two adjacent delimiters) regardless of other options
  (eg a QUOTE_ALL setting.)

Sniffer:
  Will set "nulls" to True when both adjacent delimiters and
  quoted empty strings are seen in the input text. 
  (Perhaps this behaviour needs to be optional for backward
  compatibility reasons?)

I think this will allow the csv module to generate the csv
dialect(s) commonly used by databases applications. 

A specific use case:

I am migrating data from a MS Access database to Postgresql.
I run a tool that extracts table data from Access and correctly
produces CSV files in the dialect used by Postgresql with some 
(nullable) column values having empty fields and other non-
nullable column values having empty string fields.

But I need to modify some values before import.  So I write a
Python program that parses the csv data, modifies some of it
and writes it back out, using the csv module.  But the result 
is that all empty fields and empty strings are written out 
identically as one or the other (the distinction is not preserved).  
Result is that information is lost and the output cannot be 
used.  I would be able to do this if the csv module provide a
"nulls" option as proposed above.

AFAICT, Python's csv module does not distinguish between
empty fields and empty quoted strings.
Both of the examples above, when parsed by csv.Reader,
will return ['abc', '', 3] (or possibly '3' for the last item
depending on options).  Similarly, csv.Writer produces the
same output csv text (nothing or a quoted empty string
depending on Dialect.quoting) for row items '' or None.

csv.Reader could distinguish between the above cases by 
using an empty string ('') to report an empty (quoted) string field, 
and None to report an empty field.  Thus the second example 
would produce ['abc', None, 3] (or ...,'3').  Similarly, csv.Writer
could produce alternate text (nothing or a quoted empty string)
depending on whether a row item was None or an empty string.

I propose that a new dialect attribute be added, "nulls" [*2],
which when false (default) will cause csv to behave as it currently 
does.  When true it will have the following effect:

Reader: 
  When two adjacent delimiters occur, or two white-space 
  separated delimiters when Dialect.skipinitialspace is true, 
  a value of None will be returned for that field. 

Writer: 
  When a None is present in the the list of items being
  formatted, it will result in an empty output field 
  (two adjacent delimiters) regardless of other options
  (eg a QUOTE_ALL setting.)

Sniffer:
  Will set "nulls" to True when both adjacent delimiters and
  quoted empty strings are seen in the input text.  
  (Perhaps this behaviour needs to be optional for backward
  compatibility reasons?)

I think this will allow the csv module to generate the csv 
dialect(s) required for databases applications.  

A specific use case:

I am migrating data from a MS Access database to Postgresql.
I run a tool that extracts table data from Access and produces
CSV files in the dialect used by Postgresql with some (nullable)
column values having empty fields and other non-nullable column
values having empty string fields.

But I need to modify some values before import.  So I write a 
Python program that parses the csv data, modifies some of it 
and writes it out, using the csv module.  But the result is that 
all empty fields and empty strings are written out identically
as one or the other (the distinction is not preserved).  Result
is that information is lost and the output cannot be used. 
I would be able to do this if the csv module provide a 
"nulls" option as proposed above.

----
[*1] One of the two most important open-source databases,
Postgresql, uses this dialect.  See:
  http://www.postgresql.org/docs/9.2/interactive/sql-copy.html#AEN66692
I don't know about the other.

[*2] I don't really care what the attribute name is; I chose 
"nulls" as a trial balloon because I wanted to avoid something 
with "none" in it to avoid confusion with QUOTE_NONE.

csv dialect enhancement

rurpy＠yahoo.com

tags

participants (4)