[Tutor] multiprocessing question

Cameron Simpson cs at zip.com.au
Fri Nov 28 09:15:30 CET 2014


On 27Nov2014 20:40, Albert-Jan Roskam <fomcl at yahoo.com.dmarc.invalid> wrote:
>> From: eryksun <eryksun at gmail.com>
>>Binary mode ensures the offsets are valid for use with
>>the mmap object in __getitem__. This requires an ASCII compatible
>
>>encoding such as UTF-8.
>
>What do you mean exactly with "ascii compatible"? Does it mean 'superset of ascii', such as utf-8, windows-1252, latin-1? Hmmm, but Asian encodings like cp874 and shift-JIS are thai/japanese on top of ascii, so this makes me doubt. In my code I am using icu to guess the encoding; I simply put 'utf-8' in the sample code for brevity.

He probably means "an encoding with just one byte per character". ASCII is one 
such encoding, and so are windows-1252, latin-1. It is purely so that if you 
want to compute start of line in memory from number of characters you can (1 to 
1). But if you have scanned the file, decoding as you go and noting the _byte_ 
offset before reading each line then you don't need to do this. Just seek and 
read/decode.

>>Also, iterate in a for loop instead of calling readline in a while loop.
>>2.x file.__next__ uses a read-ahead buffer to improve performance.
>>To see this, check tell() in a for loop.
>
>Wow, great tip. I just modified some sample code that I post shortly.

Note that the readahead stuff might mank the use of tell() to record the offset 
before reading each line.

Cheers,
Cameron Simpson <cs at zip.com.au>

The first ninety percent of the task takes ninety percent of the time, and
the last ten percent takes the other ninety percent.


More information about the Tutor mailing list