[Tutor] XML parsing when elements contain foreign characters

Thu Jan 9 12:42:31 CET 2014

On Thu, Jan 09, 2014 at 09:50:24AM +0100, Garry Bettle wrote:

> I'm trying to parse some XML and I'm struggling to reference elements that
> contain foreign characters.

I see from your use of print that you're using Python 2. That means that 
strings '' are actually byte-strings, not text-strings. That makes it 
really easy for mojibake to creep into your program.

Even though you define a coding line for your file (UTF-8, well done!) 
that only effects how Python reads the source code, not how it runs the 
code. So when you have this line:

stock=product.getElementsByTagName('AntalPåLager')[0].firstChild.nodeValue

the tag name 'AntalPåLager' is a *byte* string, not the text that you 
include in your file. Let's see what Python does with it in version 2.7. 
This is what I get on my default system:

py> s = 'AntalPåLager'
py> print repr(s)
'AntalP\xc3\xa5Lager'

You might get something different.

What are those two weird escaped bytes doing in there, instead of å ? 
They come about because the string s is treated as bytes rather than 
characters. Python 2 tries really hard to hide this fact from you -- for 
instance, it shows some bytes as ASCII characters A, n, t, a, etc. But 
you can't escape from the fact that they're actually bytes, eventually 
it will cause a problem, and here it is:

> Traceback (most recent call last):
>   File "C:\Python27\Testing Zizzi.py", line 16, in <module>
>
> stock=product.getElementsByTagName('AntalPÃ¥Lager')[0].firstChild.nodeValue
> IndexError: list index out of range

See the tag name printed in the error message? 'AntalPÃ¥Lager'. That is 
a classic example of mojibake, caused by takes bytes interpreted in one 
encoding (say, UTF-8) and incorrectly interpreting them under another 
encoding (say, Latin-1).

There is one right way, and one half-right way, to handle text in Python 
2. They are:

- The right way is to always use Unicode text instead of bytes. Instead 
  of 'AntalPåLager', use the u prefix to get a Unicode string:

  u'AntalPåLager'

- The half-right way is to only use ASCII, and then you can get away 
  with '' strings without the u prefix. Americans and English almost 
  always can get away with this, so they often think that Unicode is a 
  waste of time.

My advise is to change all the strings in your program from '' strings 
to u'' strings, and see if the problem is fixed. But it may not be -- 
I'm not an expert on XML processing, and it may turn out that minidom 
complains about the use of Unicode strings. Try it and see.

I expect (but don't know for sure) that what is happening is that you 
have an XML file with a tag AntalPåLager, but due to the mojibake 
problem, Python is looking for a non-existent tag AntalPÃ¥Lager and 
returning an empty list. When you try to index into that list, it's 
empty and so you get the exception.

-- 
Steven