unicode behavior in child.text elements
data:image/s3,"s3://crabby-images/175c3/175c3a322cb1ac9cb31e9951254c7c695c996391" alt=""
Hi I'm reading csv file encoded in cp1250 using csv python module. Then translate it to xml. This is the code I use .... for i in range(len(row)): child=etree.SubElement(jednostka, csvHeaders[i]) child.text=unicode(row[csvHeaders[i]].strip(), 'cp1250') print type(child.tag), child.tag, type(child.text), child.text .... the national characters can appear in some child.text but not in all of them. It depends on the data. But generally all the child.text should be encoded to unicode (?), but it is not the case. Only the data with national characters are encoded, the rest is of type str. Why? Is lxml selective in that case? But that looks strange, example: <a> <b>this is english</b> </a> <a> <b>źdźbło - polish</b> </a> and the lxml type representation of the elements looks like this: all tags are <str> and it is ok but the 'this is english' text is of type <str> and 'źdźbło - polish' is of type <unicode>. Is it normal? Finally, after serialization to xml, utf-8 encoded file looks ok, national characters are ok etc, so maybe it is not a problem but anyway I'm curious what is going on. P.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Piotr Oh, 16.02.2012 10:01:
I'm reading csv file encoded in cp1250 using csv python module. Then translate it to xml. This is the code I use
.... for i in range(len(row)): child=etree.SubElement(jednostka, csvHeaders[i]) child.text=unicode(row[csvHeaders[i]].strip(), 'cp1250') print type(child.tag), child.tag, type(child.text), child.text ....
the national characters can appear in some child.text but not in all of them. It depends on the data. But generally all the child.text should be encoded to unicode (?), but it is not the case. Only the data with national characters are encoded, the rest is of type str.
Why? Is lxml selective in that case?
Yes, it happens in explicit code. https://github.com/lxml/lxml/blob/8f0a70f195cf2e89c547a1c47b48c3169ad9d36c/s...
But that looks strange, example: <a> <b>this is english</b> </a> <a> <b>źdźbło - polish</b> </a>
and the lxml type representation of the elements looks like this: all tags are <str> and it is ok but the 'this is english' text is of type <str> and 'źdźbło - polish' is of type <unicode>. Is it normal?
Well, at least this is how ElementTree does it. lxml just follows in a compatible way.
Finally, after serialization to xml, utf-8 encoded file looks ok, national characters are ok etc, so maybe it is not a problem
It is not, no.
but anyway I'm curious what is going on.
This is mostly done for performance reasons in Python 2.x. If a string doesn't need to be decoded, it safes time and memory not do it. Due to the way the Python 2 str type works, pure 7-bit ASCII byte strings are compatible with Unicode strings. In Python 3, you will always get Unicode strings from the API and in Python 3.3, this will even be (more or less) as efficient as what happens under Python 2. Stefan
participants (2)
-
Piotr Oh
-
Stefan Behnel