Issue with xml iterparse

Thu Jun 3 16:59:57 EDT 2010

On Thu, Jun 3, 2010 at 1:44 PM, bfrederi <brfredericks at gmail.com> wrote:
> I am using lxml iterparse and running into a very obscure error. When
> I run iterparse on a file, it will occasionally return an element that
> has a element.text == None when the element clearly has text in it.
>
> I copy and pasted the problem xml into a python string, used StringIO
> to create a file-like object out of it, and ran a test using iterparse
> with expected output, and it ran perfectly fine. So it only happens
> when I try to run iterparse on the actual file.
>
> So then I tried opening the file, reading the data, turning that data
> into a file-like object using StringIO, then running iterparse on it,
> and the same problem (element.text == None) occurred.
>
> I even tried this:
> f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
> file_data = f.read()
> file_like_object = StringIO.StringIO(file_data)
> for event, element in iterparse(file_like_object, events=("start",
> "end")):

IIRC, XML parsers operate on bytes directly (since they have to
determine the encoding themselves anyway), not pre-decoded Unicode
characters, so I think your manual UTF-8 decoding could be the
problem.
Have you tried simply:

f = open(abbyy_filename, 'r')
for event, element in iterparse(f, events=("start", "end")):
    #whatever

?

Apologies if you already have, but since you didn't include the
original, albeit probably trivial, error-causing code, this relatively
simple error couldn't be ruled out.

Cheers,
Chris
--
http://blog.rebertia.com