Parsing Excel spreadsheets
BrooklineTom at gmail.com
Wed Jan 7 22:34:23 CET 2009
On Jan 2, 7:04 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 3, 2:01 am, brooklineTom <Brookline... at gmail.com> wrote:
> My point was that however the original XLS files were created or
> acquired, the first step in your solution involves converting the XLS
> file to "XML Spreadsheet" format, which requires a copy of Excel on a
> Windows box. Many people start with an XLS file, no Excel and no
> Windows box, no COM, and users who can't be relied on to open a file
> and save it in the right format with the right name and extension.
True enough. I develop on a WinXP box and have Excel. I just used it.
> BTW, did you consider opening the XLS files with OpenOffice.org's Calc
> and saving it in their default ods format (chunks of XML in a zip
No. As I said, I have Excel.
> > 3. The largest file I used was about 228M,
> Is that the XLS file or the XML file?
That's the xml file size. I just *love* file-bloat, don't you? :-)
> > I knew that I needed only
> > a small subset of the xlrd behavior, and I concluded (perhaps
> > incorrectly) that it would be easier to roll my own parser than find,
> > extract, and then port (to my own framework) the corresponding parts
> > of xlrd.
> Possibly incorrectly. If approached at the time, I would have said:
> (a) if desperate to DIY:
> (a1) ignore any code for old Excel versions (self.biff_version < 80)
> (a2) ignore any code for extracting formatting info
> (a3) find the Sheet.put_cell* methods in sheet.py e.g.
> def put_cell(self, rowx, colx, ctype, value, xf_index):
> ignore the xf_index arg and subvert them to your own needs instead of
> filling up a big rectangular arena with data
> (b) if not really so desperate, talk to me about implementing an
> option in xlrd where callers can specify a callback to be used instead
> of the Sheet.put_cell* methods
> (c) What is all this "port to my own framework" caper anyway? If you
> need to extract data from a database, do you rummage in their code
> libraries and port the relevant bits to your own framework?
I'm sure xlrd is fine, I wasn't any way suggesting that any
"improvements" are needed. I was just explaining what I did.
> "Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
> XML tells you the type (String, Boolean, ...). xlrd tells you the type
> (XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).
Sure, once you've sorted through the excess MS-specific stuff, dealt
with the runs of empty cells, and so on.
I needed to write an exception file, containing the original
spreadsheet entries that failed, so that our curators could find and
fix (or delete) them. Thus, I already had to do row-by-row handling of
the input, I had to catch and handle exceptions, I had to know and
remember enough about the original SS format/layout to replicate it in
the exception file, and so on.
> I think so. Thanks. You didn't directly address the "steep learning
> curve" question, but you explained enough of where you came from.
> It's a matter of "horses for courses". I guess some people might
> regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
> slightly non-horizontal learning curve :-)
Yes. Sadly, I had already climbed the dom/minidom/pullparser curve for
all the other stuff I have to do. In the space I work in, xml handling
is pervasive enough that I already have all that stuff "in my hands"
I wonder if the OP ever read any of this.... :-)
More information about the Python-list