Wikipedia XML Dump
Rustom Mody
rustompmody at gmail.com
Tue Jan 28 20:52:48 EST 2014
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
> hi,
> On 01/29/14 00:31, Kevin Glover wrote:
> > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?
> in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
> be your only option.
Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk
More information about the Python-list
mailing list