[Csv] Re: [Python-Dev] csv module TODO list
"Martin v. Löwis"
martin at v.loewis.de
Thu Jan 6 17:05:05 CET 2005
Andrew McNamara wrote:
> Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv
> files, so a reasonable starting point would be the ability to read and
> parse, as well as the ability to generate, one of these.
I see. That would be reasonable, indeed. Notice that this is not so much
a "Unicode issue", but more an "encoding" issue. If you solve the
"arbitrary encodings" problem, you solve UTF-16 as a side effect.
> The reader interface currently returns a row at a time, consuming as many
> lines from the supplied iterable (with the most common iterable being
> a file). This suggests to me that we will need an optional "encoding"
> argument to the reader constructor, and that the reader will need to
> decode the source lines.
Ok. In this context, I see two possible implementation strategies:
1. Implement the csv module two times: once for bytes, and once for
Unicode characters. It is likely that the source code would be
the same for each case; you just need to make sure the "Dialect
and Formatting Parameters" change their width accordingly.
If you use the SRE approach, you would do
#define CSV_ITEM_T char
#define CSV_NAME_PREFIX byte_
#include "csvimpl.c"
#define CSV_ITEM_T Py_Unicode
#define CSV_NAME_PREFIX unicode_
#include "csvimpl.c"
2. Use just the existing _csv module, and represent non-byte encodings
as UTF-8. This will work as long as the delimiters and other markup
characters have always a single byte in UTF-8, which is the case
for "':\, as well as for \r and \n. Then, wenn processing using
an explicit encoding, first convert the input into Unicode objects.
Then encode the Unicode objects into UTF-8, and pass it to _csv.
For the results you get back, convert each element back from UTF-8
to a Unicode object.
This could be implemented as
def reader(f, encoding=None):
if encoding is None: return _csv.reader(f)
enc, dec, reader, writer = codecs.lookup(encoding)
utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8")
# Make a recoder which can only read
utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None)
csv_reader = _csv.reader(utf8_stream)
# For performance reasons, map_result could be implemented in C
def map_result(t):
result = [None]*len(t)
for i, val in enumerate(t):
result[i] = utf8_dec(val)
return tuple(result)
return itertools.imap(map_result, csv_reader)
# This code is untested
This approach has the disadvantage of performing three recodings:
from input charset to Unicode, from Unicode to UTF-8, from UTF-8
to Unicode. One could:
- skip the initial recoding if the encoding is already known
to be _csv-safe (i.e. if it is a pure ASCII superset).
This would be valid for ASCII, iso-8859-n, UTF-8, ...
- offer the user to keep the results in the input encoding,
instead of always returning Unicode objects.
Apart from this disadvantage, I think this gives people what they want:
they can specify the encoding of the input, and they get the results not
only csv-separated, but also unicode-decode. This approach is the same
that is used for Python source code encodings: the source is first
recoded into UTF-8, then parsed, then recoded back.
> That said, I'm hardly a unicode expert, so I
> may be overlooking something (could a utf-16 encoded character span a
> line break, for example).
This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE).
There are issues that Unicode has additional line break characters,
which is probably irrelevant.
Regards,
Martin
More information about the Csv
mailing list