[Tutor] multiprocessing question
Albert-Jan Roskam
fomcl at yahoo.com
Thu Nov 27 21:40:01 CET 2014
>________________________________
> From: eryksun <eryksun at gmail.com>
>To: Python Mailing List <tutor at python.org>
>Sent: Tuesday, November 25, 2014 6:41 AM
>Subject: Re: [Tutor] multiprocessing question
>
>
>On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <cs at zip.com.au> wrote:
>>
>> A remark about the create_lookup() function on pastebin: you go:
>>
>> record_start += len(line)
>>
>> This presumes that a single text character on a line consumes a single byte
>> or memory or file disc space. However, your data file is utf-8 encoded, and
>> some characters may be more than one byte or storage. This means that your
>> record_start values will not be useful because they are character counts,
>> not byte counts, and you need byte counts to offset into a file if you are
>> doing random access.
>
>mmap.readline returns a byte string, so len(line) is a byte count.
>That said, CsvIter._get_row_lookup shouldn't use the mmap
>object. Limit its use to __getitem__.
Ok, thanks, I will modify the code.
>In CsvIter.__getitem__, I don't see the need to wrap the line in a
>filelike object. It's clearly documented that csv.reader takes an
>iterable object, such as a list. For example:
>
> # 2.x csv lacks unicode support
> line = self.data[start:end].strip()
> row = next(csv.reader([line]))
> return [cell.decode('utf-8') for cell in row]
>
> # 3.x csv requires unicode
> line = self.data[start:end].strip()
> row = next(csv.reader([line.decode('utf-8')]))
> return row
Nice, thank you! I indeed wanted to write the code for use in Python 2.7 and 3.3+.
>CsvIter._get_row_lookup should work on a regular file from built-in
>open (not codecs.open), opened in binary mode. I/O on a regular file
>will release the GIL back to the main thread. mmap objects don't do
>this.
Will io.open also work? Until today I thought that Python 3's open was what is codecs.open in Python 2 (probably because Python3 is all about ustrings, and py3-open has an encoding argument).
>
>Binary mode ensures the offsets are valid for use with
>the mmap object in __getitem__. This requires an ASCII compatible
>encoding such as UTF-8.
What do you mean exactly with "ascii compatible"? Does it mean 'superset of ascii', such as utf-8, windows-1252, latin-1? Hmmm, but Asian encodings like cp874 and shift-JIS are thai/japanese on top of ascii, so this makes me doubt. In my code I am using icu to guess the encoding; I simply put 'utf-8' in the sample code for brevity.
>
>Also, iterate in a for loop instead of calling readline in a while loop.
>2.x file.__next__ uses a read-ahead buffer to improve performance.
>To see this, check tell() in a for loop.
Wow, great tip. I just modified some sample code that I post shortly.
>
>
>_______________________________________________
>Tutor maillist - Tutor at python.org
>To unsubscribe or change subscription options:
>https://mail.python.org/mailman/listinfo/tutor
>
>
>
More information about the Tutor
mailing list