[Tutor] XML parsing when elements contain foreign characters
Steven D'Aprano
steve at pearwood.info
Thu Jan 9 12:42:31 CET 2014
On Thu, Jan 09, 2014 at 09:50:24AM +0100, Garry Bettle wrote:
> I'm trying to parse some XML and I'm struggling to reference elements that
> contain foreign characters.
I see from your use of print that you're using Python 2. That means that
strings '' are actually byte-strings, not text-strings. That makes it
really easy for mojibake to creep into your program.
Even though you define a coding line for your file (UTF-8, well done!)
that only effects how Python reads the source code, not how it runs the
code. So when you have this line:
stock=product.getElementsByTagName('AntalPåLager')[0].firstChild.nodeValue
the tag name 'AntalPåLager' is a *byte* string, not the text that you
include in your file. Let's see what Python does with it in version 2.7.
This is what I get on my default system:
py> s = 'AntalPåLager'
py> print repr(s)
'AntalP\xc3\xa5Lager'
You might get something different.
What are those two weird escaped bytes doing in there, instead of å ?
They come about because the string s is treated as bytes rather than
characters. Python 2 tries really hard to hide this fact from you -- for
instance, it shows some bytes as ASCII characters A, n, t, a, etc. But
you can't escape from the fact that they're actually bytes, eventually
it will cause a problem, and here it is:
> Traceback (most recent call last):
> File "C:\Python27\Testing Zizzi.py", line 16, in <module>
>
> stock=product.getElementsByTagName('AntalPÃ¥Lager')[0].firstChild.nodeValue
> IndexError: list index out of range
See the tag name printed in the error message? 'AntalPÃ¥Lager'. That is
a classic example of mojibake, caused by takes bytes interpreted in one
encoding (say, UTF-8) and incorrectly interpreting them under another
encoding (say, Latin-1).
There is one right way, and one half-right way, to handle text in Python
2. They are:
- The right way is to always use Unicode text instead of bytes. Instead
of 'AntalPåLager', use the u prefix to get a Unicode string:
u'AntalPåLager'
- The half-right way is to only use ASCII, and then you can get away
with '' strings without the u prefix. Americans and English almost
always can get away with this, so they often think that Unicode is a
waste of time.
My advise is to change all the strings in your program from '' strings
to u'' strings, and see if the problem is fixed. But it may not be --
I'm not an expert on XML processing, and it may turn out that minidom
complains about the use of Unicode strings. Try it and see.
I expect (but don't know for sure) that what is happening is that you
have an XML file with a tag AntalPåLager, but due to the mojibake
problem, Python is looking for a non-existent tag AntalPÃ¥Lager and
returning an empty list. When you try to index into that list, it's
empty and so you get the exception.
--
Steven
More information about the Tutor
mailing list