What is the correct way to tostring() except that the outer most tag should be removed?
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag. Something like <foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo> should be returned as abc<em>x<u>y</u>z</em>123<h>f</h>uvw I could do so by calling tostring() then use regex to remove the outermost tag. Alternatively, the code below can be customized. But I feel the 2nd approach is an overkill for this problem. Does anybody know what is the best approach for this problem? Thanks. https://stackoverflow.com/questions/11677411/python-lxml-access-text?answert... -- Regards, Peng
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Peng Yu writes:
inner = 'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
On Sun, May 13, 2018 at 7:08 AM, Thomas Levine <_@thomaslevine.com> wrote:
I just realize that tostring() makes changes to symbols like °. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks. from lxml import etree tree = etree.XML('<foo>25/15°C <bar>abc</bar></foo>') print etree.tostring(tree) The output of the above code is the following. <foo>25/15°C <bar>abc</bar></foo> -- Regards, Peng
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Peng Yu writes:
Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful.
data:image/s3,"s3://crabby-images/ec9bb/ec9bb0329c221dbadd4e1d8867d77b939f812741" alt=""
The python2 equivalent of html.entities is htmllib.entitydefs ( https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs). On 13 May 2018 at 13:15, Thomas Levine <_@thomaslevine.com> wrote:
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:
The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).
I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks. -- Regards, Peng
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None. --Chris
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
I just meant something like-- def strip_outer(html, tag): start = f'<{tag}>' end = f'</{tag}>' assert html.startswith(start) assert html.endswith(end) return html[len(start):(-1)*len(end)] html = '<foo>25/15°C <bar>abc</bar></foo>' print(strip_outer(html, tag='foo')) It's a super naive approach, but it's fast and is guaranteed to work correctly when it does (erroring out and letting you know otherwise). --Chris On Sun, May 13, 2018 at 5:02 AM, Peng Yu <pengyu.ut@gmail.com> wrote:
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Chris Jerdonek writes:
How obvious that should have been! This makes things a bit neater than my previous version. inner = b'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = b'<foo>%s</foo>' % inner html = lxml.html.fromstring(outer) result = html.text.encode('utf-8') + b''.join(map(lxml.html.tostring, html)) assert result == inner
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Peng Yu writes:
inner = 'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
On Sun, May 13, 2018 at 7:08 AM, Thomas Levine <_@thomaslevine.com> wrote:
I just realize that tostring() makes changes to symbols like °. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks. from lxml import etree tree = etree.XML('<foo>25/15°C <bar>abc</bar></foo>') print etree.tostring(tree) The output of the above code is the following. <foo>25/15°C <bar>abc</bar></foo> -- Regards, Peng
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Peng Yu writes:
Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful.
data:image/s3,"s3://crabby-images/ec9bb/ec9bb0329c221dbadd4e1d8867d77b939f812741" alt=""
The python2 equivalent of html.entities is htmllib.entitydefs ( https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs). On 13 May 2018 at 13:15, Thomas Levine <_@thomaslevine.com> wrote:
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:
The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).
I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks. -- Regards, Peng
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None. --Chris
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
I just meant something like-- def strip_outer(html, tag): start = f'<{tag}>' end = f'</{tag}>' assert html.startswith(start) assert html.endswith(end) return html[len(start):(-1)*len(end)] html = '<foo>25/15°C <bar>abc</bar></foo>' print(strip_outer(html, tag='foo')) It's a super naive approach, but it's fast and is guaranteed to work correctly when it does (erroring out and letting you know otherwise). --Chris On Sun, May 13, 2018 at 5:02 AM, Peng Yu <pengyu.ut@gmail.com> wrote:
data:image/s3,"s3://crabby-images/9fe39/9fe392e73575513915bbd3afd30aac334169e657" alt=""
Chris Jerdonek writes:
How obvious that should have been! This makes things a bit neater than my previous version. inner = b'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = b'<foo>%s</foo>' % inner html = lxml.html.fromstring(outer) result = html.text.encode('utf-8') + b''.join(map(lxml.html.tostring, html)) assert result == inner
participants (4)
-
Chris Jerdonek
-
Kev Dwyer
-
Peng Yu
-
Thomas Levine