[Tutor] Handling hundreds of thousands of inserts with MySQLdb

Kent Johnson kent37 at tds.net
Sat Sep 2 18:37:57 CEST 2006


Gonzillaaa wrote:
> Hello all. I post this here since is my first attempt to solve a  
> problem with python.
>
> I have fairly big log files that I'm doing some pre-processing to, to  
> cleanup the data before they go into a MySQL database. After  
> processing the files look something like this:
>
> 17, , 2006-8-21 12:04:29, 0, 3.0846, 25.105, 918, -0.12183, 0.20305,  
> 25.389, 25.254, 180
> 18, , 2006-8-21 12:05:20, 17, 3.1705, 23.62, 949, 0.015228, 0.040609,  
> 24.984, 110.2, 186
> 17, , 2006-8-21 12:07:30, 0, 3.0846, 25.353, 939, -0.1269, 0.20305,  
> 25.254, 25.254, 293
> 18, , 2006-8-21 12:08:23, 17, 3.1705, 23.538, 958, 0.015228,  
> 0.045685, 24.984, 110.2, 188
> 16, , 2006-8-21 12:09:21, 17, 3.0922, 24.691, 969, 0.26904, 0.10152,  
> 25.389, 25.389, 175
>
> then I have written another script to which I pass the filename as  
> argument to insert that data into the db. The problem I'm getting is  
> that some of the files contain 300000 records aprox. I have tried two  
> approaches to inset the data but both have failed
>
> 1- use cursor.executemany which throws an error like this also if I  
> understand it correctly executemany() does one insert at a time which  
> seems hardly efficient.
>
> File "./xbow_MySQL_insert.py", line 39, in MySQLInsert
>      cursor.executemany("INSERT INTO table \
>    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
> python2.4/site-packages/MySQLdb/cursors.py", line 216, in executemany
>      r = self._query(',\n'.join(q))
>    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
> python2.4/site-packages/MySQLdb/cursors.py", line 309, in _query
>      rowcount = self._do_query(q)
>    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
> python2.4/site-packages/MySQLdb/cursors.py", line 273, in _do_query
>      db.query(q)
> _mysql_exceptions.OperationalError: (1153, "Got a packet bigger than  
> 'max_allowed_packet' bytes")
>
>   
 From the error it looks like all the data is being sent at once, not in 
multple execute calls, so executemany() looks promising. It seems to be 
choking on the full data set. What if you try it with, e.g., 100 records 
at once instead of all 300,000? Something like this:

def MySQLInsert(mysql_data):
	db = MySQLdb.connect(host, user, password, database)
	cursor = db.cursor()
	
	while mysql_data:
		small_data, mysql_data = mysql_data[:100], mysql_data[100:]
		cursor.executemany("INSERT INTO table \
	(id, sample, sample_time, parent, voltage, temp, light, accel_x,  
accel_y, mag_x, mag_y, mic) \
	VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",  small_data)

	db.close()

You can tune this by picking the grouping size that gives the best 
performance.

Kent
> 2- use cursor.execute with a DELAYED INSERT but I think I'm just  
> passing the wrong arguments to it:
>
> File "./xbow_MySQL_insert.py", line 43, in MySQLInsert
>      cursor.execute("INSERT DELAYED INTO arup_03 \
>    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ 
> python2.4/site-packages/MySQLdb/cursors.py", line 148, in execute
>      query = query % db.literal(args)
> TypeError: not all arguments converted during string formatting
>
>
> Another solution I thought of but I'm not sure how to approach is to  
> break the data into chunks and do delayed inserts with those chunks,  
> although I'm not sure how to approach this (breaking the data up, and  
> getting delayed inserts to work).
>
> Here is the code so far, any comments on general improvements are  
> also welcome.
>
> Many thanks,
>
> Gonzalo.
>
> #######################################
>
> import sys,string,MySQLdb
>
> #db config
> host, user, password, database = "localhost", "user", "pass",  
> "xbow_data"
>
> def MySQLInsert(mysql_data):
> 	db = MySQLdb.connect(host, user, password, database)
> 	cursor = db.cursor()
> 	
> 	cursor.executemany("INSERT INTO table \
> 	(id, sample, sample_time, parent, voltage, temp, light, accel_x,  
> accel_y, mag_x, mag_y, mic) \
> 	VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",  mysql_data)
>
> 	# cursor.execute("INSERT DELAYED INTO table \
> 	# (id, sample, sample_time, parent, voltage, temp, light, accel_x,  
> accel_y, mag_x, mag_y, mic) \
> 	# VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",   
> mysql_data)
>
> 	db.close()
> 	print "done inserting into db"
>
> def returnTuple(formatted_data):
> 	_data = []
> 	for line in formatted_data:
> 		line = line.split(',')
> 		line = [item.strip() for item in line]
> 		_data.append(line)
> 	
> 	return [tuple(line) for line in _data]
> 	
>
> def Main():
> 	#check if filename was passed when the program was called
> 	if (len(sys.argv) < 2):
> 		print "Usage:" + sys.argv[0] + " <filename> "
> 		sys.exit()
> 		
> 	#open file to work on
> 	filename = sys.argv[1]
> 	raw_data = open(filename,"r",1)
> 	
> 	# Get data on file converted into a series of tuples to be inserted  
> into db
> 	mysql_data =  returnTuple(raw_data)
> 	
> 	MySQLInsert(mysql_data)
> 	
> 	#close file
> 	raw_data.close()
> 	
> 	#print result
> 	print "done! " + str(len(mysql_data)) + " lines inserted"
>
> if __name__ == '__main__': Main()
>
>
>
>
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>
>   




More information about the Tutor mailing list