[Tutor] multiprocessing question

Dave Angel davea at davea.name
Fri Nov 28 10:59:05 CET 2014


On 11/27/2014 05:55 PM, Dave Angel wrote:
> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>>
>>
>          for line in self.data:
>              if not line:
>                  break
>              local_lookup.append(record_start)
>              if len(local_lookup) > 100:
>                  self.lookup.extend(local_lookup)
>                  local_lookup = []
>              record_start += len(line)
>          print(len(local_lookup))
>
> I still have to emphasize that record_start is just wrong.  You must use
> ftell() if you're planning to use fseek() on a text file.
>
> You can also probably speed the process up  a good deal by passing the
> filename to the other process, rather than opening the file in the
> original process.  That will eliminate sharing the self.data across the
> process boundary.
>

To emphasize again, in version 3:

 
https://docs.python.org/3.4/tutorial/inputoutput.html#methods-of-file-objects

"""In text files (those opened without a b in the mode string), only 
seeks relative to the beginning of the file are allowed (the exception 
being seeking to the very file end with seek(0, 2)) and the only valid 
offset values are those returned from the f.tell(), or zero. Any other 
offset value produces undefined behaviour."""

All the discussion about byte-compatible, ASCII equivalent, etc. is 
besides the point.  (Although I'm surprised nobody has pointed out that 
in Windows, a newline is two bytes long even if the file is entirely 
ASCII.)  If you want to seek() later, then use tell() now.  In a binary 
open, there may be other ways, but in a text file...

Perhaps the reason you're resisting it is you're assuming that tell() is 
slow.  It's not.  it's probably faster than trying to sum the bytes the 
way you're doing.

-- 
DaveA


More information about the Tutor mailing list