[lxml] unicode behavior in child.text elements

Feb. 16, 2012

      Hi

I'm reading csv file encoded in cp1250 using csv python module. Then
translate it to xml.
This is the code I use

....
for i in range(len(row)):
        child=etree.SubElement(jednostka, csvHeaders[i])
        child.text=unicode(row[csvHeaders[i]].strip(), 'cp1250')
        print type(child.tag),  child.tag,  type(child.text), child.text
....

the national characters can appear in some child.text but not in all of
them. It depends on the data. But generally all the child.text should be
encoded to unicode (?), but it is not the case. Only the data with national
characters are encoded, the rest is of type str.

Why? Is lxml selective in that case? But that looks strange, example:
<a>
  <b>this is english</b>
</a>
<a>
  <b>źdźbło - polish</b>
</a>

and the lxml type representation of the elements looks like this: all tags
are <str> and it is ok but the 'this is english' text is of type <str> and
'źdźbło - polish' is of type <unicode>. Is it normal?

Finally, after serialization to xml, utf-8 encoded file looks ok, national
characters are ok etc, so maybe it is not a problem but anyway I'm curious
what is going on.

P.

[lxml] unicode behavior in child.text elements

Piotr Oh