how to fast processing one million strings to remove quotes
Nick Mellor
thebalancepro at gmail.com
Wed Aug 2 23:16:02 EDT 2017
On Thursday, 3 August 2017 01:05:57 UTC+10, Daiyue Weng wrote:
> Hi, I am trying to removing extra quotes from a large set of strings (a
> list of strings), so for each original string, it looks like,
>
> """str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
>
>
> I like to remove the start and end quotes and extra pairs of quotes on each
> string value, so the result will look like,
>
> "str_value1","str_value2","str_value3",1,"str_value4"
>
>
> and then join each string by a new line.
>
> I have tried the following code,
>
> for line in str_lines[1:]:
> strip_start_end_quotes = line[1:-1]
> splited_line_rem_quotes =
> strip_start_end_quotes.replace('\"\"', '"')
> str_lines[str_lines.index(line)] = splited_line_rem_quotes
>
> for_pandas_new_headers_str = '\n'.join(splited_lines)
>
> but it is really slow (running for ages) if the list contains over 1
> million string lines. I am thinking about a fast way to do that.
>
> I also tried to multiprocessing this task by
>
> def preprocess_data_str_line(data_str_lines):
> """
>
> :param data_str_lines:
> :return:
> """
> for line in data_str_lines:
> strip_start_end_quotes = line[1:-1]
> splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"',
> '"')
> data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes
>
> return data_str_lines
>
>
> def multi_process_prepcocess_data_str(data_str_lines):
> """
>
> :param data_str_lines:
> :return:
> """
> # if cpu load < 25% and 4GB of ram free use 3 cores
> # if cpu load < 70% and 4GB of ram free use 2 cores
> cores_to_use = how_many_core()
>
> data_str_blocks = slice_list(data_str_lines, cores_to_use)
>
> for block in data_str_blocks:
> # spawn processes for each data string block assigned to every cpu
> core
> p = multiprocessing.Process(target=preprocess_data_str_line,
> args=(block,))
> p.start()
>
> but I don't know how to concatenate the results back into the list so that
> I can join the strings in the list by new lines.
>
> So, ideally, I am thinking about using multiprocessing + a fast function to
> preprocessing each line to speed up the whole process.
>
> cheers
Hi MRAB,
My first thought is to use split/join to solve this problem, but you would need to decide what to do with the non-strings in your 1,000,000 element list. You also need to be sure that the pipe character | is in none of your strings.
split_on_dbl_dbl_quote = original_list.join('|').split('""')
remove_dbl_dbl_quotes_and_outer_quotes = split_on_dbl_dbl_quote[::2].join('').split('|')
You need to be sure of your data: [::2] (return just even-numbered elements) relies on all double-double-quotes both opening and closing within the same string.
This runs in under a second for a million strings but does affect *all* elements, not just strings. The non-strings would become strings after the second statement.
As to multi-processing: I would be looking at well-optimised single-thread solutions like split/join before I consider MP. If you can fit the problem to a split-join it'll be much simpler and more "pythonic".
Cheers,
Nick
More information about the Python-list
mailing list