How to read from a file to an arbitrary delimiter efficiently?
bc at freeuk.com
Sat Feb 27 15:03:42 EST 2016
On 27/02/2016 16:35, BartC wrote:
> On 25/02/2016 06:50, Steven D'Aprano wrote:
>> I have a need to read to an arbitrary delimiter, which might be any of a
>> (small) set of characters. For the sake of the exercise, lets say it is
>> either ! or ? (for example).
> However those aren't the main reasons for the poor speed. The limiting
> factor here is reading one byte at a time. Just a loop like this:
> while f.read(1):
> without doing anything else, seems to take most of the time. (3.6
> seconds, compared with 5.6 seconds of your readchunks() on a 6MB version
> of your test file, on Python 2.7. readlines() took about 0.2 seconds.)
> Any faster solutions would need to read more than one byte at a time.
I've done some more test using Python 3.4, with the same 200,000 line
6MB test file:
0.25 seconds Scan the file with 'for line in f'
2.25 seconds Scan the file with your readlines() routine
4.0 seconds Scan the file with your readchunks() routine
0.65 seconds Scan the file with using a buffer
This latter test uses a 64-byte buffer, reading not more than an extra
63 bytes, but resetting the file position to just past the end of of
each identified chunk so that any subsequent read works as expected.
This test (the code is too untidy to post) only checks for two specific
delimiters (not an arbitrary string fill of them). (It also counts EOF
as a valid delimiter so counts one more chunk.)
Increasing the buffer size doesn't help, and beyond 256 bytes slowed
things down (for this input) as it spends too long rereading data.
More information about the Python-list