[Tutor] multiprocessing question
Dave Angel
davea at davea.name
Fri Nov 28 10:59:05 CET 2014
On 11/27/2014 05:55 PM, Dave Angel wrote:
> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>>
>>
> for line in self.data:
> if not line:
> break
> local_lookup.append(record_start)
> if len(local_lookup) > 100:
> self.lookup.extend(local_lookup)
> local_lookup = []
> record_start += len(line)
> print(len(local_lookup))
>
> I still have to emphasize that record_start is just wrong. You must use
> ftell() if you're planning to use fseek() on a text file.
>
> You can also probably speed the process up a good deal by passing the
> filename to the other process, rather than opening the file in the
> original process. That will eliminate sharing the self.data across the
> process boundary.
>
To emphasize again, in version 3:
https://docs.python.org/3.4/tutorial/inputoutput.html#methods-of-file-objects
"""In text files (those opened without a b in the mode string), only
seeks relative to the beginning of the file are allowed (the exception
being seeking to the very file end with seek(0, 2)) and the only valid
offset values are those returned from the f.tell(), or zero. Any other
offset value produces undefined behaviour."""
All the discussion about byte-compatible, ASCII equivalent, etc. is
besides the point. (Although I'm surprised nobody has pointed out that
in Windows, a newline is two bytes long even if the file is entirely
ASCII.) If you want to seek() later, then use tell() now. In a binary
open, there may be other ways, but in a text file...
Perhaps the reason you're resisting it is you're assuming that tell() is
slow. It's not. it's probably faster than trying to sum the bytes the
way you're doing.
--
DaveA
More information about the Tutor
mailing list