<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 21, 2013 at 11:57 AM, Manish Tripathi <span dir="ltr"><<a href="mailto:tr.manish@gmail.com" target="_blank">tr.manish@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">It's pipeline data so must have been generated through Siebel and sent as excel csv. </div>
<div class=""><div class="h5"><div class="gmail_extra"><br></div></div></div></blockquote><div><br></div><div>I am assuming that you are talking about "Siebel Analytics", some kind of analysis software from Oracle:</div>
<div><br></div><div> <a href="http://en.wikipedia.org/wiki/Siebel_Systems">http://en.wikipedia.org/wiki/Siebel_Systems</a><br></div><div><br></div><div>That would be fine, except that knowing it comes out of Siebel is no guarantee that the output you're consuming is well-formed Excel CSV. For example, I see things like this:<br>
</div><div><br></div><div> <a href="http://spendolini.blogspot.com/2006/04/custom-export-to-csv.html">http://spendolini.blogspot.com/2006/04/custom-export-to-csv.html</a><br></div><div><br></div><div>where the generated output is "ad-hoc".</div>
<div><br></div><div><br></div><div><br></div><div>-----------</div><div><br></div><div>Hmmm... but let's assume for the moment that your data is ok. Could the problem be in pandas? Let's follow this line of logic, and see where it takes us.</div>
<div><br></div><div>Given the structure of the error you're seeing, I have to assume that pandas is trying to decode the bytes, and runs into an issue, though the exact position where it's running into an error is in question. In fact, looking at:</div>
<div><br></div><div> <a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1357">https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1357</a><br></div><div><br></div><div>for example, the library appears to be trying to decode line-by-line under certain situations. If it runs into an error, it will report an offset into a particular line.</div>
<div><br></div><div>Wow. That can be very bad, if I'm reading that right. It does not give that offset from the perspective of the whole file. But it's worse because it's unsound. The code _should_ be doing the decoding from the perspective of the whole file, not at the level of single lines. It needs to be using codecs.open(), and let codecs.open() handle the details of byte->unicode-string decoding. Otherwise, by that time, it's way too late: we've just taken an interpretation of the bytes that's potentially invalid. Example: if we're working with UTF-16, and we got into this code path, it'd be really bad.</div>
<div><br></div><div><br></div><div>It's hard to tell whether or not we're taking that code path. I'm following the definition of read_csv from:</div><div><br></div><div> <a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L409">https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L409</a><br>
</div><div><br></div><div>to:</div><div><br></div><div> <a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L282">https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L282</a><br></div>
<div><br></div><div>to:</div><div><br></div><div> <a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L184">https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L184</a><br></div><div>
<br></div><div>to:</div><div><br></div><div> <a href="https://github.com/pydata/pandas/blob/master/pandas/io/common.py#L100">https://github.com/pydata/pandas/blob/master/pandas/io/common.py#L100</a><br></div><div><br>
</div>
<div><br></div><div><br></div><div>Ok, at that point, they appear to try to decode the entire file. Somewhat good so far. Though, technically, pandas should be using codecs.open():</div><div><div><br></div><div> <a href="http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data">http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data</a><br>
</div><div><br></div><div>and because they aren't, they appears to suck the entire file into memory with StringIO. Yikes.</div></div><div><br></div><div><br></div><div>Now the pandas library must make sure _not_ to decode() again, because decoding is not an idempotent operation.</div>
<div><br></div><div>As a concrete example:</div><div><br></div><div>##############################################################</div><div><div><div>>>> 'foobar'.decode('utf-16')</div><div>u'\u6f66\u626f\u7261'</div>
<div>>>> 'foobar'.decode('utf-16').decode('utf-16')</div><div>Traceback (most recent call last):</div><div> File "<stdin>", line 1, in <module></div><div> File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode</div>
<div> return codecs.utf_16_decode(input, errors, True)</div><div>UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)</div></div><div>##############################################################<br>
</div></div><div><div></div></div><div><br></div><div>This is reminiscent of the kind of error you're encountering, though I'm not sure if this is the same situation.</div><div><br></div><div><br></div><div><br></div>
<div>Unfortunately, I'm running out of time to analyze this further. If you could upload your data file somewhere, someone else here may have time to investigate the error you're seeing in more detail. From reading the Pandas code, I'm discouraged by the code quality: I do think that there's a potential of a bug in the library. The code is a heck of a lot more complicated than I think it needs to be.</div>
</div></div></div>