ElementTree.fromstring(unicode_html)

Sun Jan 27 13:35:53 EST 2008

globophobe wrote:

> In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
> \u3044\r\n'
> 
> I need to turn this into an elementtree, but some of the data is
> japanese whereas the rest is html. This string contains a <br />.

where?  <br /> is an element, not a character.  "\r" and "\n" are 
characters, not elements.

If you want to build a tree where "\r\n" is replaced with a <br /> 
element, you can encode the string as UTF-8, use the replace method to 
insert the element, and then call fromstring.

Alternatively, you can build the tree yourself:

     import xml.etree.ElementTree as ET

     unicode_html = 
u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'

     parts = unicode_html.splitlines()

     elem = ET.Element("data")
     elem.text = parts[0]
     for part in parts[1:]:
         ET.SubElement(elem, "br").tail = part

     print ET.tostring(elem)

</F>