[Python-Dev] csv module TODO list

Thu Jan 6 17:05:05 CET 2005

Andrew McNamara wrote:
> Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv
> files, so a reasonable starting point would be the ability to read and
> parse, as well as the ability to generate, one of these.

I see. That would be reasonable, indeed. Notice that this is not so much
a "Unicode issue", but more an "encoding" issue. If you solve the
"arbitrary encodings" problem, you solve UTF-16 as a side effect.

> The reader interface currently returns a row at a time, consuming as many
> lines from the supplied iterable (with the most common iterable being
> a file). This suggests to me that we will need an optional "encoding"
> argument to the reader constructor, and that the reader will need to
> decode the source lines.

Ok. In this context, I see two possible implementation strategies:
1. Implement the csv module two times: once for bytes, and once for
    Unicode characters. It is likely that the source code would be
    the same for each case; you just need to make sure the "Dialect
    and Formatting Parameters" change their width accordingly.
    If you use the SRE approach, you would do

    #define CSV_ITEM_T char
    #define CSV_NAME_PREFIX byte_
    #include "csvimpl.c"
    #define CSV_ITEM_T Py_Unicode
    #define CSV_NAME_PREFIX unicode_
    #include "csvimpl.c"

2. Use just the existing _csv module, and represent non-byte encodings
    as UTF-8. This will work as long as the delimiters and other markup
    characters have always a single byte in UTF-8, which is the case
    for "':\, as well as for \r and \n. Then, wenn processing using
    an explicit encoding, first convert the input into Unicode objects.
    Then encode the Unicode objects into UTF-8, and pass it to _csv.
    For the results you get back, convert each element back from UTF-8
    to a Unicode object.

This could be implemented as

def reader(f, encoding=None):
     if encoding is None: return _csv.reader(f)
     enc, dec, reader, writer = codecs.lookup(encoding)
     utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8")
     # Make a recoder which can only read
     utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None)
     csv_reader = _csv.reader(utf8_stream)
     # For performance reasons, map_result could be implemented in C
     def map_result(t):
         result = [None]*len(t)
         for i, val in enumerate(t):
             result[i] = utf8_dec(val)
         return tuple(result)
     return itertools.imap(map_result, csv_reader)
# This code is untested

This approach has the disadvantage of performing three recodings:
from input charset to Unicode, from Unicode to UTF-8, from UTF-8
to Unicode. One could:
- skip the initial recoding if the encoding is already known
   to be _csv-safe (i.e. if it is a pure ASCII superset).
   This would be valid for ASCII, iso-8859-n, UTF-8, ...
- offer the user to keep the results in the input encoding,
   instead of always returning Unicode objects.

Apart from this disadvantage, I think this gives people what they want:
they can specify the encoding of the input, and they get the results not
only csv-separated, but also unicode-decode. This approach is the same
that is used for Python source code encodings: the source is first
recoded into UTF-8, then parsed, then recoded back.

> That said, I'm hardly a unicode expert, so I
> may be overlooking something (could a utf-16 encoded character span a
> line break, for example).

This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE).
There are issues that Unicode has additional line break characters,
which is probably irrelevant.

Regards,
Martin