[Tutor] multiprocessing question

Thu Nov 27 21:40:01 CET 2014

>________________________________
> From: eryksun <eryksun at gmail.com>
>To: Python Mailing List <tutor at python.org> 
>Sent: Tuesday, November 25, 2014 6:41 AM
>Subject: Re: [Tutor] multiprocessing question
> 
>
>On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <cs at zip.com.au> wrote:
>>
>> A remark about the create_lookup() function on pastebin: you go:
>>
>>  record_start += len(line)
>>
>> This presumes that a single text character on a line consumes a single byte
>> or memory or file disc space. However, your data file is utf-8 encoded, and
>> some characters may be more than one byte or storage. This means that your
>> record_start values will not be useful because they are character counts,
>> not byte counts, and you need byte counts to offset into a file if you are
>> doing random access.
>
>mmap.readline returns a byte string, so len(line) is a byte count.
>That said, CsvIter._get_row_lookup shouldn't use the mmap
>object. Limit its use to __getitem__.

Ok, thanks, I will modify the code.

>In CsvIter.__getitem__, I don't see the need to wrap the line in a
>filelike object. It's clearly documented that csv.reader takes an
>iterable object, such as a list. For example:
>
>    # 2.x csv lacks unicode support
>    line = self.data[start:end].strip()
>    row = next(csv.reader([line]))
>    return [cell.decode('utf-8') for cell in row]
>
>    # 3.x csv requires unicode
>    line = self.data[start:end].strip()
>    row = next(csv.reader([line.decode('utf-8')]))
>    return row

Nice, thank you! I indeed wanted to write the code for use in Python 2.7 and 3.3+.

>CsvIter._get_row_lookup should work on a regular file from built-in
>open (not codecs.open), opened in binary mode. I/O on a regular file
>will release the GIL back to the main thread. mmap objects don't do

>this.

Will io.open also work? Until today I thought that Python 3's open was what is codecs.open in Python 2 (probably because Python3 is all about ustrings, and py3-open has an encoding argument).

>
>Binary mode ensures the offsets are valid for use with
>the mmap object in __getitem__. This requires an ASCII compatible

>encoding such as UTF-8.

What do you mean exactly with "ascii compatible"? Does it mean 'superset of ascii', such as utf-8, windows-1252, latin-1? Hmmm, but Asian encodings like cp874 and shift-JIS are thai/japanese on top of ascii, so this makes me doubt. In my code I am using icu to guess the encoding; I simply put 'utf-8' in the sample code for brevity.

>
>Also, iterate in a for loop instead of calling readline in a while loop.
>2.x file.__next__ uses a read-ahead buffer to improve performance.
>To see this, check tell() in a for loop.

Wow, great tip. I just modified some sample code that I post shortly.

>
>
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>To unsubscribe or change subscription options:
>https://mail.python.org/mailman/listinfo/tutor
>
>
>