Out of memory while reading excel file
Peter Otten
__peter__ at web.de
Thu May 11 03:53:11 EDT 2017
Mahmood Naderan via Python-list wrote:
> I wrote this:
>
> a = np.zeros((p.max_row, p.max_column), dtype=object)
> for y, row in enumerate(p.rows):
> for cell in row:
> print (cell.value)
> a[y] = cell.value
In the line above you overwrite the row in the numpy array with the cell
value. In combination with numpy's "broadcasting" you end up with all values
in a row set to the rightmost cell in the spreadsheet row, just like in
>>> import numpy
>>> a = numpy.array([[0, 0, 0]])
>>> a
array([[0, 0, 0]])
>>> for x in 1, 2, 3:
... a[0] = x
...
>>> a
array([[3, 3, 3]])
The correct code:
for y, row in enumerate(ws.rows):
a[y] = [cell.value for cell in row]
I think I posted it before ;)
> print (a[y])
>
>
> For one of the cells, I see
>
> NM_198576.3
> ['NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3']
>
>
> These are 50 NM_198576.3 in a[y] and 50 is the number of columns in my
> excel file (p.max_column)
>
>
>
> The excel file looks like
>
> CHR1 11,202,100 NM_198576.3 PASS 3.08932 G|B|C -
> . . .
>
>
>
> Note that in each row, some cells are '-' or '.' only. I want to read all
> cells as string. Then I will write the matrix in a file and my main code
> (java) will process that. I chose openpyxl for reading excel files,
> because Apache POI (a java package for manipulating excel files) consumes
> huge memory even for medium files.
>
> So my python script only transforms an xlsx file to a txt file keeping the
> cell positions and formats.
What kind of text file?
> Any suggestion?
In that case there's no need to load the data into memory. For example, to
convert xlsx to csv:
#!/usr/bin/env python3
from openpyxl import load_workbook
import csv
source = "beta.xlsx"
dest = "gamma.csv"
sheet = 'alpha'
wb = load_workbook(filename=source, read_only=True)
ws = wb[sheet]
with open(dest, "w") as outstream:
csv.writer(outstream).writerows(
[cell.value for cell in row]
for row in ws.rows
)
More information about the Python-list
mailing list