[lxml-dev] read .xlsx spreadsheets with lxml ?

Folks, has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? A simple API along the lines of csv would be nice: doc = openxmllib.openXmlDocument( path= "...xlsx" ) for row in doc: for col in row: # num / string (Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.) Thanks, cheers -- denis

On Tue, 16 Nov 2010 12:03:08 +0000 (UTC) denis <denis-bz-py@t-online.de> wrote:
No. First I'd try http://pypi.python.org/pypi/xlrd - I haven't used it for .xlsx, but it works well for .xls and I think it also supports .xlsx. Small code example below. Cheers -Terry import xlrd from collections import defaultdict filename = "MasterDatabase.xls" book = xlrd.open_workbook(filename) cnt = defaultdict(lambda: 0) for sheet in book.sheets(): print("{0.name:>20s} {0.nrows}".format(sheet)) sheet0 = book.sheet_by_index(0) for row in range(sheet0.nrows): cnt[sheet0.cell(row,0).value] += 1

Hi,
Well, the lxml APIs are simple enough for handling the XML *inside* the .xlsx zip archive. Don't know how complicated the structure of the file itself can get. Here's the lxml.objectify notion: $ unzip Foo.xlsx $ python -i -c 'from lxml import etree, objectify'
Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail

Thanks Holger, Thanks Terry, I was really looking for someone who's *used* lxml (or ...) on big Microsoft xlsx spreadsheets. I gather from http://en.wikipedia.org/wiki/Office_Open_XML that the format is messy -- Part 1 (Fundamentals and Markup Language Reference) This part has 5560 pages ?! Bytheway xlrd 0.7.1 -> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 429, in open_workbook biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1545, in getbof bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1539, in bof_error raise XLRDError('Unsupported format, or corrupt file: ' + msg) xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'PK\x03\x04\x14\x00\x06\x00' cheers -- denis

On 16/11/2010 16:59, denis wrote:
John Machin over on the python-excel group has done just this. He has some alpha code that I know he'd like to see merged into the xlrd code base but he's looking for some serious testers. Follow the birdy on www.python-excel.org for group joining... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

On Tue, 16 Nov 2010 12:03:08 +0000 (UTC) denis <denis-bz-py@t-online.de> wrote:
No. First I'd try http://pypi.python.org/pypi/xlrd - I haven't used it for .xlsx, but it works well for .xls and I think it also supports .xlsx. Small code example below. Cheers -Terry import xlrd from collections import defaultdict filename = "MasterDatabase.xls" book = xlrd.open_workbook(filename) cnt = defaultdict(lambda: 0) for sheet in book.sheets(): print("{0.name:>20s} {0.nrows}".format(sheet)) sheet0 = book.sheet_by_index(0) for row in range(sheet0.nrows): cnt[sheet0.cell(row,0).value] += 1

Hi,
Well, the lxml APIs are simple enough for handling the XML *inside* the .xlsx zip archive. Don't know how complicated the structure of the file itself can get. Here's the lxml.objectify notion: $ unzip Foo.xlsx $ python -i -c 'from lxml import etree, objectify'
Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail

Thanks Holger, Thanks Terry, I was really looking for someone who's *used* lxml (or ...) on big Microsoft xlsx spreadsheets. I gather from http://en.wikipedia.org/wiki/Office_Open_XML that the format is messy -- Part 1 (Fundamentals and Markup Language Reference) This part has 5560 pages ?! Bytheway xlrd 0.7.1 -> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 429, in open_workbook biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1545, in getbof bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1539, in bof_error raise XLRDError('Unsupported format, or corrupt file: ' + msg) xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'PK\x03\x04\x14\x00\x06\x00' cheers -- denis

On 16/11/2010 16:59, denis wrote:
John Machin over on the python-excel group has done just this. He has some alpha code that I know he'd like to see merged into the xlrd code base but he's looking for some serious testers. Follow the birdy on www.python-excel.org for group joining... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk
participants (4)
-
Chris Withers
-
denis
-
jholg@gmx.de
-
Terry Brown