[Tutor] Reading CSV files in Pandas

Mon Oct 21 23:42:29 CEST 2013

On Mon, Oct 21, 2013 at 11:57 AM, Manish Tripathi <tr.manish at gmail.com>wrote:

> It's pipeline data so must have been generated through Siebel and sent as
> excel csv.
>
>
I am assuming that you are talking about "Siebel Analytics", some kind of
analysis software from Oracle:

    http://en.wikipedia.org/wiki/Siebel_Systems

That would be fine, except that knowing it comes out of Siebel is no
guarantee that the output you're consuming is well-formed Excel CSV.  For
example, I see things like this:

    http://spendolini.blogspot.com/2006/04/custom-export-to-csv.html

where the generated output is "ad-hoc".

-----------

Hmmm... but let's assume for the moment that your data is ok.  Could the
problem be in pandas?  Let's follow this line of logic, and see where it
takes us.

Given the structure of the error you're seeing, I have to assume that
pandas is trying to decode the bytes, and runs into an issue, though the
exact position where it's running into an error is in question.  In fact,
looking at:

    https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1357

for example, the library appears to be trying to decode line-by-line under
certain situations.  If it runs into an error, it will report an offset
into a particular line.

Wow.  That can be very bad, if I'm reading that right.  It does not give
that offset from the perspective of the whole file.  But it's worse because
it's unsound.  The code _should_ be doing the decoding from the perspective
of the whole file, not at the level of single lines.  It needs to be using
codecs.open(), and let codecs.open() handle the details of
byte->unicode-string decoding.  Otherwise, by that time, it's way too late:
we've just taken an interpretation of the bytes that's potentially invalid.
 Example: if we're working with UTF-16, and we got into this code path,
it'd be really bad.

It's hard to tell whether or not we're taking that code path.  I'm
following the definition of read_csv from:

    https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L409

to:

   https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L282

to:

    https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L184

to:

    https://github.com/pydata/pandas/blob/master/pandas/io/common.py#L100

Ok, at that point, they appear to try to decode the entire file.  Somewhat
good so far.  Though, technically, pandas should be using codecs.open():

http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

and because they aren't, they appears to suck the entire file into memory
with StringIO.  Yikes.

Now the pandas library must make sure _not_ to decode() again, because
decoding is not an idempotent operation.

As a concrete example:

##############################################################
>>> 'foobar'.decode('utf-16')
u'\u6f66\u626f\u7261'
>>> 'foobar'.decode('utf-16').decode('utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
##############################################################

This is reminiscent of the kind of error you're encountering, though I'm
not sure if this is the same situation.

Unfortunately, I'm running out of time to analyze this further.  If you
could upload your data file somewhere, someone else here may have time to
investigate the error you're seeing in more detail.  From reading the
Pandas code, I'm discouraged by the code quality: I do think that there's a
potential of a bug in the library.  The code is a heck of a lot more
complicated than I think it needs to be.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131021/df41ba70/attachment.html>