how to fast processing one million strings to remove quotes
Daiyue Weng
daiyueweng at gmail.com
Wed Aug 2 11:05:24 EDT 2017
Hi, I am trying to removing extra quotes from a large set of strings (a
list of strings), so for each original string, it looks like,
"""str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
I like to remove the start and end quotes and extra pairs of quotes on each
string value, so the result will look like,
"str_value1","str_value2","str_value3",1,"str_value4"
and then join each string by a new line.
I have tried the following code,
for line in str_lines[1:]:
strip_start_end_quotes = line[1:-1]
splited_line_rem_quotes =
strip_start_end_quotes.replace('\"\"', '"')
str_lines[str_lines.index(line)] = splited_line_rem_quotes
for_pandas_new_headers_str = '\n'.join(splited_lines)
but it is really slow (running for ages) if the list contains over 1
million string lines. I am thinking about a fast way to do that.
I also tried to multiprocessing this task by
def preprocess_data_str_line(data_str_lines):
"""
:param data_str_lines:
:return:
"""
for line in data_str_lines:
strip_start_end_quotes = line[1:-1]
splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"',
'"')
data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes
return data_str_lines
def multi_process_prepcocess_data_str(data_str_lines):
"""
:param data_str_lines:
:return:
"""
# if cpu load < 25% and 4GB of ram free use 3 cores
# if cpu load < 70% and 4GB of ram free use 2 cores
cores_to_use = how_many_core()
data_str_blocks = slice_list(data_str_lines, cores_to_use)
for block in data_str_blocks:
# spawn processes for each data string block assigned to every cpu
core
p = multiprocessing.Process(target=preprocess_data_str_line,
args=(block,))
p.start()
but I don't know how to concatenate the results back into the list so that
I can join the strings in the list by new lines.
So, ideally, I am thinking about using multiprocessing + a fast function to
preprocessing each line to speed up the whole process.
cheers
More information about the Python-list
mailing list