Parsing Excel spreadsheets
steve at holdenweb.com
Fri Jan 2 17:32:39 CET 2009
> On Dec 31 2008, 9:56 am, John Machin <sjmac... at lexicon.net> wrote:
>> On Dec 31 2008, 4:02 pm, brooklineTom <Brookline... at gmail.com> wrote:
>>> andyh... at gmail.com wrote:
>>>> Can anybody recommend an approach for loading and parsing Excel
>>>> spreadsheets in Python. Any well known/recommended libraries for this?
>>>> The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
>>>> but I'd rather get some more input before going with something I don't
>>> I save the spreadsheets (in Excel) in xml format.
>> Which means that you need to be on a Windows box with a licensed copy
>> of Excel. I presume you talking about using Excel 2003 and saving as
>> "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
>> COM script? What is the largest xls file that you've saved as xml, how
>> big was the xml file, and how long did it take to parse the xml file?
>> Do you extract formatting information or just cell contents?
> 1. The OP requested Excel files, by construction those must be
> generated with a licensed copy of Excel. I did the actual processing
> on both linux and windoze platforms.
Well, even if Andy meant "Excel files" rather than "Excel-formatted
files" there are many ways to come by these without having a licensed
copy of Excel. FTP and email attachment come to mind most readily.
How then to convert those to XML without Excel?
[...]> I looked, briefly, at xlrd. I found and scanned a few alternatives,
> though I don't remember what the others were. I needed something I
> could incorporate into my own application framework, and I knew I
> didn't need most of the formatting information. I'm not in any way
> criticizing xlrd, it's simply that, based on its API summary, it seems
> focused on problems I didn't have to solve. I knew that I needed only
> a small subset of the xlrd behavior, and I concluded (perhaps
> incorrectly) that it would be easier to roll my own parser than find,
> extract, and then port (to my own framework) the corresponding parts
> of xlrd.
My own case was similar, in that I only needed the value data. The
approach I took was to install xlrd and use it. Job done.
> I needed to extract the content of each row, cell by cell, and build
> data objects (in my framework) with the content of various cells. I
> also needed to build an "exception file" containing malformed entries
> that I could re-open with Excel after my code finished, so that the
> bogus entries could be manually corrected. What I mean by "malformed"
> entry is, for example, an address field that fails to correctly
> geocode or comment fields with confused utf8/unicode contents. My
> focus was on data content, as opposed to presentation. I needed to
> crack the cells into things like "string", "boolean", "float", and so
> Most importantly, I needed to do this one entry at a time -- I did
> *not* want to load the entire spreadsheet at once.
My data files weren't that large (IIRC the largest spreadsheet was about
6MB), so I was quite happy to load the whole thing in memory, iterate
over it and then write the results to the database as they were extracted.
> I'm not saying that this couldn't be done with xlrd; only that I chose
> to roll my own and had minimal difficulty doing so.
Given the constraints of your problem it seems like an intelligent approach.
> I hope this helps!
I'm sure it will.
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list