csv dialect enhancement
There is a common dialect of CSV, often used in database applications [*1], that distinguishes between an empty (quoted) string, e.g., the second field in "abc","",3 and an empty field, e.g., the second field in "abc",,3 This distinction is needed to specify or tell the difference between 0-length strings and NULLs, when sending csv data to or receiving it from a database application. AFAICT, Python's csv module does not distinguish between empty fields and empty quoted strings. Both of the examples above, when parsed by csv.Reader, will return ['abc', '', 3] (or possibly '3' for the last item depending on options). Similarly, csv.Writer produces the same output csv text (nothing or a quoted empty string depending on Dialect.quoting) for row items '' or None. csv.Reader could distinguish between the above cases by using an empty string ('') to report an empty (quoted) string field, and None to report an empty field. Thus the second example would produce ['abc', None, 3] (or ...,'3'). Similarly, csv.Writer could produce alternate text (nothing or a quoted empty string) depending on whether a row item was None or an empty string. I propose that a new dialect attribute be added, "nulls" [*2], which when false (default) will cause csv to behave as it currently does. When true it will have the following effect: Reader: When two adjacent delimiters occur, or two white-space separated delimiters when Dialect.skipinitialspace is true, a value of None will be returned for that field. Writer: When a None is present in the the list of items being formatted, it will result in an empty output field (two adjacent delimiters) regardless of other options (eg a QUOTE_ALL setting.) Sniffer: Will set "nulls" to True when both adjacent delimiters and quoted empty strings are seen in the input text. (Perhaps this behaviour needs to be optional for backward compatibility reasons?) I think this will allow the csv module to generate the csv dialect(s) commonly used by databases applications. A specific use case: I am migrating data from a MS Access database to Postgresql. I run a tool that extracts table data from Access and correctly produces CSV files in the dialect used by Postgresql with some (nullable) column values having empty fields and other non- nullable column values having empty string fields. But I need to modify some values before import. So I write a Python program that parses the csv data, modifies some of it and writes it back out, using the csv module. But the result is that all empty fields and empty strings are written out identically as one or the other (the distinction is not preserved). Result is that information is lost and the output cannot be used. I would be able to do this if the csv module provide a "nulls" option as proposed above. AFAICT, Python's csv module does not distinguish between empty fields and empty quoted strings. Both of the examples above, when parsed by csv.Reader, will return ['abc', '', 3] (or possibly '3' for the last item depending on options). Similarly, csv.Writer produces the same output csv text (nothing or a quoted empty string depending on Dialect.quoting) for row items '' or None. csv.Reader could distinguish between the above cases by using an empty string ('') to report an empty (quoted) string field, and None to report an empty field. Thus the second example would produce ['abc', None, 3] (or ...,'3'). Similarly, csv.Writer could produce alternate text (nothing or a quoted empty string) depending on whether a row item was None or an empty string. I propose that a new dialect attribute be added, "nulls" [*2], which when false (default) will cause csv to behave as it currently does. When true it will have the following effect: Reader: When two adjacent delimiters occur, or two white-space separated delimiters when Dialect.skipinitialspace is true, a value of None will be returned for that field. Writer: When a None is present in the the list of items being formatted, it will result in an empty output field (two adjacent delimiters) regardless of other options (eg a QUOTE_ALL setting.) Sniffer: Will set "nulls" to True when both adjacent delimiters and quoted empty strings are seen in the input text. (Perhaps this behaviour needs to be optional for backward compatibility reasons?) I think this will allow the csv module to generate the csv dialect(s) required for databases applications. A specific use case: I am migrating data from a MS Access database to Postgresql. I run a tool that extracts table data from Access and produces CSV files in the dialect used by Postgresql with some (nullable) column values having empty fields and other non-nullable column values having empty string fields. But I need to modify some values before import. So I write a Python program that parses the csv data, modifies some of it and writes it out, using the csv module. But the result is that all empty fields and empty strings are written out identically as one or the other (the distinction is not preserved). Result is that information is lost and the output cannot be used. I would be able to do this if the csv module provide a "nulls" option as proposed above. ---- [*1] One of the two most important open-source databases, Postgresql, uses this dialect. See: http://www.postgresql.org/docs/9.2/interactive/sql-copy.html#AEN66692 I don't know about the other. [*2] I don't really care what the attribute name is; I chose "nulls" as a trial balloon because I wanted to avoid something with "none" in it to avoid confusion with QUOTE_NONE.
On 01/11/2013 09:16 AM, rurpy@yahoo.com wrote:
I propose that a new dialect attribute be added, "nulls", which when false (default) will cause csv to behave as it currently does. When true it will have the following effect:
Reader: When two adjacent delimiters occur, or two white-space separated delimiters when Dialect.skipinitialspace is true, a value of None will be returned for that field.
Writer: When a None is present in the the list of items being formatted, it will result in an empty output field (two adjacent delimiters) regardless of other options (eg a QUOTE_ALL setting.)
Sniffer: Will set "nulls" to True when both adjacent delimiters and quoted empty strings are seen in the input text. (Perhaps this behaviour needs to be optional for backward compatibility reasons?)
+1
[Sorry for the duplicated text in the previous post, please ignore that one in favor of this one] There is a common dialect of CSV, often used in database applications [*1], that distinguishes between an empty (quoted) string, e.g., the second field in "abc","",3 and an empty field, e.g., the second field in "abc",,3 This distinction is needed to specify or tell the difference between 0-length strings and NULLs, when sending csv data to or receiving it from a database application. AFAICT, Python's csv module does not distinguish between empty fields and empty quoted strings. Both of the examples above, when parsed by csv.Reader, will return ['abc', '', 3] (or possibly '3' for the last item depending on options). Similarly, csv.Writer produces the same output csv text (nothing or a quoted empty string depending on Dialect.quoting) for row items '' or None. csv.Reader could distinguish between the above cases by using an empty string ('') to report an empty (quoted) string field, and None to report an empty field. Thus the second example would produce ['abc', None, 3] (or ...,'3'). Similarly, csv.Writer could produce alternate text (nothing or a quoted empty string) depending on whether a row item was None or an empty string. I propose that a new dialect attribute be added, "nulls" [*2], which when false (default) will cause csv to behave as it currently does. When true it will have the following effect: Reader: When two adjacent delimiters occur, or two white-space separated delimiters when Dialect.skipinitialspace is true, a value of None will be returned for that field. Writer: When a None is present in the the list of items being formatted, it will result in an empty output field (two adjacent delimiters) regardless of other options (eg a QUOTE_ALL setting.) Sniffer: Will set "nulls" to True when both adjacent delimiters and quoted empty strings are seen in the input text. (Perhaps this behaviour needs to be optional for backward compatibility reasons?) I think this will allow the csv module to generate the csv dialect(s) commonly used by databases applications. A specific use case: I am migrating data from a MS Access database to Postgresql. I run a tool that extracts table data from Access and correctly produces CSV files in the dialect used by Postgresql with some (nullable) column values having empty fields and other non- nullable column values having empty string fields. But I need to modify some values before import. So I write a Python program that parses the csv data, modifies some of it and writes it back out, using the csv module. But the result is that all empty fields and empty strings are written out identically as one or the other (the distinction is not preserved). Result is that information is lost and the output cannot be used. I would be able to do this if the csv module provide a "nulls" option as proposed above. ---- [*1] One of the two most important open-source databases, Postgresql, uses this dialect. See: http://www.postgresql.org/docs/9.2/interactive/sql-copy.html#AEN66692 I don't know about the other. [*2] I don't really care what the attribute name is; I chose "nulls" as a trial balloon because I wanted to avoid something with "none" in it to avoid confusion with QUOTE_NONE.
On Fri, Jan 11, 2013 at 6:51 PM, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
[Sorry for the duplicated text in the previous post, please ignore that one in favor of this one]
There is a common dialect of CSV, often used in database applications [*1], that distinguishes between an empty (quoted) string,
How many DBMS have this dialect? e.g. MySQL want \N for null values, in other databases this is not even possible. Anyway I think that should be implemented because it may have different uses.
[*2] I don't really care what the attribute name is; I chose "nulls" as a trial balloon because I wanted to avoid something with "none" in it to avoid confusion with QUOTE_NONE.
+1
On Sat, Jan 12, 2013 at 4:16 AM, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
There is a common dialect of CSV, often used in database applications [*1], that distinguishes between an empty (quoted) string,
e.g., the second field in "abc","",3
and an empty field,
e.g., the second field in "abc",,3
This distinction is needed to specify or tell the difference between 0-length strings and NULLs, when sending csv data to or receiving it from a database application.
Ugh, this is exactly the sort of thing that my boss didn't believe happened. He thinks that CSV is the same the world over, except for a few really old or arcane programs that can be completely ignored. Took a lot of arguing before we agreed to disagree on that one... As an explicitly-requestable dialect, looks good.
Sniffer: Will set "nulls" to True when both adjacent delimiters and quoted empty strings are seen in the input text. (Perhaps this behaviour needs to be optional for backward compatibility reasons?)
Yes, and make it optional. I think the interpretation of ,,,, as empty strings is the more common, since CSV is often used in contexts that don't have a concept of NULL (spreadsheets mainly); this ought, then, to be the default, but one quick option can add recognition of this. So, +1 on the whole idea. ChrisA
participants (4)
-
Chris Angelico
-
Ethan Furman
-
Federico Reghenzani
-
rurpy@yahoo.com