best way to read a huge ascii file.
Rolando Espinoza
darkrho at gmail.com
Wed Nov 30 17:17:21 EST 2016
Hi,
Yes, working with binary formats is the way to go when you have large data.
But for further
reference, Dask[1] fits perfectly for your use case, see below how I
process a 7Gb
text file under 17 seconds (in a laptop: mbp + quad-core + ssd).
# Create roughly ~7Gb worth text data.
In [40]: import numpy as np
In [41]: x = np.random.random((60, 5000000))
In [42]: %time np.savetxt('data.txt', x)
CPU times: user 4min 28s, sys: 14.8 s, total: 4min 43s
Wall time: 5min
In [43]: %time y = np.loadtxt('data.txt')
CPU times: user 6min 31s, sys: 1min, total: 7min 31s
Wall time: 7min 44s
# Then we proceed to use dask to read the big file. The key here is to
# use a block size so we process the file in ~120Mb chunks (approx. one
line).
# Dask uses by default the line separator \n to ensure the partitions don't
break
# the lines.
In [1]: import dask.bag
In [2]: data = dask.bag.read_text('data.txt', blocksize=120*1024*1024)
In [3]: data
dask.bag<bag-fro..., npartitions=60>
# Rather than passing the entire 100+Mb line to np.loadtxt, we slice the
first 128 bytes
# which is enough to grab the first 4 columns.
# You could further speed up this by not reading the entire line but
instead read just
# 128 bytes from each line offset.
In [4]: from io import StringIO
In [5]: def to_array(line):
...: return np.loadtxt(StringIO(line[:128]))[:4]
...:
...:
In [6]: %time y = np.asarray(data.map(to_array).compute())
y.shape
CPU times: user 190 ms, sys: 60.8 ms, total: 251 ms
Wall time: 16.9 s
In [7]: y.shape
(60, 4)
In [8]: y[:2, :]
array([[ 0.17329305, 0.36584998, 0.01356046, 0.6814617 ],
[ 0.3352684 , 0.83274823, 0.24399607, 0.30103352]])
You can also use dask to convert the entire file to hdf5.
Regards,
[1] http://dask.pydata.org/
Rolando
On Wed, Nov 30, 2016 at 1:16 PM, Heli <hemla21 at gmail.com> wrote:
> Hi all,
>
> Writing my ASCII file once to either of pickle or npy or hdf data types
> and then working afterwards on the result binary file reduced the read time
> from 80(min) to 2 seconds.
>
> Thanks everyone for your help.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list