String concatenation - which is the fastest way ?

Stefan Behnel stefan_ml at behnel.de
Wed Aug 10 12:20:10 EDT 2011


przemolicc at poczta.fm, 10.08.2011 15:31:
> On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote:
>> On Wed, Aug 10, 2011 at 12:17 PM,<przemolicc at poczta.fm>  wrote:
>>> I'd like to write a python (2.6/2.7) script which connects to database, fetches
>>> hundreds of thousands of rows, concat them (basically: create XML)
>>> and then put the result into another table. Do I have any choice
>>> regarding string concatenation in Python from the performance point of view ?
>>> Since the number of rows is big I'd like to use the fastest possible library
>>> (if there is any choice). Can you recommend me something ?
>>
>> First off, I have no idea why you would want to create an XML dump of
>> hundreds of thousands of rows, only to store it in another table.
>> However, if that is your intention, list joining is about as efficient
>> as you're going to get in Python:
>>
>> lst=["asdf","qwer","zxcv"] # feel free to add 399,997 more list entries
>> xml="<foo>"+"</foo><foo>".join(lst)+"</foo>"
>>
>> This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which
>> may or may not be what you're after.
>
> since this process (XML building) is running now inside database (using native SQL commands)
> and is one-thread task it is quite slow. What I wanted to do is to spawn several python subprocesses in parallel which
> will concat subset of the whole table (and then merge all of them at the end).
> Basically:
> - fetch all rows from the database (up to 1 million): what is recommended data type ?
> - spawn X python processes each one:
>      - concat its own subset
> - merge the result from all the subprocesses
>
> This task is running on a server which has many but slow cores and I am trying to divide this task
> into many subtasks.

Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL) 
come with built-in Python support.

You still didn't provide enough information to make me understand why you 
need XML in between one database and another (or the same?), but if you go 
that route, you can just read data through multiple connections in multiple 
threads (or processes), have each build up one (or more) XML entries, and 
then push those into a queue. Then another thread (or more than one) can 
read from that queue and write the XML items into a file (or another 
database) as they come in.

If your data has a considerable size, I wouldn't use string concatenation 
or joining at all (note that it requires 2x the memory during 
concatenation), but rather write it into a file, or even just process the 
data on the fly, i.e. write it back into the target table right away. 
Reading a file back in after the fact is much more resource friendly than 
keeping huge amounts of data in memory. And disk speed is usually not a 
problem when streaming data from disk into a database.

It may also be worth considering to write out literal SQL insert statements 
instead of XML. Most databases have a decent bulk upload tool.

Stefan




More information about the Python-list mailing list