[lxml-dev] read .xlsx spreadsheets with lxml ?

Folks, has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? A simple API along the lines of csv would be nice:
doc = openxmllib.openXmlDocument( path= "...xlsx" ) for row in doc: for col in row: # num / string
(Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.)
Thanks, cheers -- denis

On Tue, 16 Nov 2010 12:03:08 +0000 (UTC) denis denis-bz-py@t-online.de wrote:
Folks, has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? A simple API along the lines of csv would be nice:
No. First I'd try http://pypi.python.org/pypi/xlrd - I haven't used it for .xlsx, but it works well for .xls and I think it also supports .xlsx.
Small code example below.
Cheers -Terry
import xlrd
from collections import defaultdict
filename = "MasterDatabase.xls"
book = xlrd.open_workbook(filename)
cnt = defaultdict(lambda: 0)
for sheet in book.sheets(): print("{0.name:>20s} {0.nrows}".format(sheet))
sheet0 = book.sheet_by_index(0) for row in range(sheet0.nrows): cnt[sheet0.cell(row,0).value] += 1

Hi,
Folks, has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? A simple API along the lines of csv would be nice:
doc = openxmllib.openXmlDocument( path= "...xlsx" ) for row in doc: for col in row: # num / string
(Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.)
Well, the lxml APIs are simple enough for handling the XML *inside* the .xlsx zip archive. Don't know how complicated the structure of the file itself can get.
Here's the lxml.objectify notion:
$ unzip Foo.xlsx $ python -i -c 'from lxml import etree, objectify'
root = objectify.parse("./tmp/Foo/xl/worksheets/sheet1.xml").getroot() print root.tag
{http://schemas.openxmlformats.org/spreadsheetml/2006/main%7Dworksheet
for row in root.sheetData.row:
... for c in row.c: ... print "%s: %s" % (c.get('r'), c.v) ... A1: 0 B1: 1 C1: 2 A2: 1 B2: 2 C2: 3 A3: 4 B3: 5 C3: 6
Holger

Thanks Holger, Thanks Terry,
I was really looking for someone who's *used* lxml (or ...) on big Microsoft xlsx spreadsheets. I gather from http://en.wikipedia.org/wiki/Office_Open_XML that the format is messy -- Part 1 (Fundamentals and Markup Language Reference) This part has 5560 pages ?!
Bytheway xlrd 0.7.1 -> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 429, in open_workbook biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1545, in getbof bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1539, in bof_error raise XLRDError('Unsupported format, or corrupt file: ' + msg) xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'PK\x03\x04\x14\x00\x06\x00'
cheers -- denis

On 16/11/2010 16:59, denis wrote:
Thanks Holger, Thanks Terry,
I was really looking for someone who's *used* lxml (or ...) on big Microsoft xlsx spreadsheets.
John Machin over on the python-excel group has done just this. He has some alpha code that I know he'd like to see merged into the xlrd code base but he's looking for some serious testers.
Follow the birdy on www.python-excel.org for group joining...
Chris
participants (4)
-
Chris Withers
-
denis
-
jholg@gmx.de
-
Terry Brown