How to find bad row with db api executemany()?
Roy Smith
roy at panix.com
Fri Mar 29 20:41:03 EDT 2013
In article <mailman.3971.1364595940.2939.python-list at python.org>,
Dennis Lee Bieber <wlfraed at ix.netcom.com> wrote:
> If using MySQLdb, there isn't all that much difference... MySQLdb is
> still compatible with MySQL v4 (and maybe even v3), and since those
> versions don't have "prepared statements", .executemany() essentially
> turns into something that creates a newline delimited "list" of
> "identical" (but for argument substitution) statements and submits that
> to MySQL.
Shockingly, that does appear to be the case. I had thought during my
initial testing that I was seeing far greater throughput, but as I got
more into the project and started doing some side-by-side comparisons,
it the differences went away.
We're sucking in a pretty huge amount of data. The source document is a
7 GB gzipped XML file. I'm not sure how big it is uncompressed (we use
gzip.GzipFile to uncompress on the fly) but I'm guessing something like
a 30x compression ratio so 200 GB? The last time we imported the whole
set, it ran for 21 days!
It turns out, the problems we were seeing were all inserts into a new
table we added. Apparently, the default charset is latin-1 and we
didn't notice that when we altered the schema! Once I noticed that all
the other tables were utf-8 and changed this one to be that, the
problems went away.
Sadly, I ended up resorting to a truly ugly hack to diagnose the
problem. I catch the exception and parse the text message. Yuck.
try:
self.executemany(self.sql_statement, self.sql_params)
except MySQLdb.Error as ex:
code, message = ex.args
m = re.search(r".* at row (\d+)$", message)
if m:
i = int(m.group(1)) - 1 # Python is 0-index, SQL, 1-index
The other truly horrible part of the project was when I decided it was
bad for my import script to have too much schema knowledge hard-wired
in. So, I decided to use SQLAlchemy to introspect the database and
discover the column names, types, and defaults. It turns out, if an
integer column has a default (say, 0), the introspected data comes back
with the default as the string, '0'. WTF???
Does Postgress's Python adapter handle executemany() in a sane way?
We're not wedded to MySQL in any way. We use it for exactly this one
process. We get these XML dumps from a supplier's SQL-Server database.
We stage the data in MySQL, then export what we need into MongoDB. We
could easily swap out the MySQL staging for Postgress if that worked
better.
Hmmm, we do take advantage of REPLACE INTO, which I think is a
non-standard MySQL addition. Not sure if Postgress supports that.
More information about the Python-list
mailing list