What is the correct way to tostring() except that the outer most tag should be removed?
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag. Something like <foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo> should be returned as abc<em>x<u>y</u>z</em>123<h>f</h>uvw I could do so by calling tostring() then use regex to remove the outermost tag. Alternatively, the code below can be customized. But I feel the 2nd approach is an overkill for this problem. Does anybody know what is the best approach for this problem? Thanks. https://stackoverflow.com/questions/11677411/python-lxml-access-text?answert... -- Regards, Peng
Peng Yu writes:
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.
Something like
<foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo>
should be returned as
abc<em>x<u>y</u>z</em>123<h>f</h>uvw
inner = 'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.
On Sun, May 13, 2018 at 7:08 AM, Thomas Levine <_@thomaslevine.com> wrote:
Peng Yu writes:
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.
Something like
<foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo>
should be returned as
abc<em>x<u>y</u>z</em>123<h>f</h>uvw
inner = 'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner
While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.
I just realize that tostring() makes changes to symbols like °. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks. from lxml import etree tree = etree.XML('<foo>25/15°C <bar>abc</bar></foo>') print etree.tostring(tree) The output of the above code is the following. <foo>25/15°C <bar>abc</bar></foo> -- Regards, Peng
Peng Yu writes:
I just realize that tostring() makes changes to symbols like =C2=B0. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks.
from lxml import etree tree =3D etree.XML('<foo>25/15=C2=B0C <bar>abc</bar></foo>') print etree.tostring(tree)
The output of the above code is the following.
<foo>25/15°C <bar>abc</bar></foo>
Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful.
The python2 equivalent of html.entities is htmllib.entitydefs ( https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs). On 13 May 2018 at 13:15, Thomas Levine <_@thomaslevine.com> wrote:
Peng Yu writes:
I just realize that tostring() makes changes to symbols like =C2=B0. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks.
from lxml import etree tree =3D etree.XML('<foo>25/15=C2=B0C <bar>abc</bar></foo>') print etree.tostring(tree)
The output of the above code is the following.
<foo>25/15°C <bar>abc</bar></foo>
Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful. _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:
The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).
I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks. -- Regards, Peng
Peng Yu writes:
On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:
The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).
I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks.
Don't use htmllib.entitydefs (to replace the entities); what you have done is better.
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.
Something like
<foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo>
should be returned as
abc<em>x<u>y</u>z</em>123<h>f</h>uvw
I could do so by calling tostring() then use regex to remove the outermost tag.
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None. --Chris
Alternatively, the code below can be customized. But I feel the 2nd approach is an overkill for this problem. Does anybody know what is the best approach for this problem? Thanks.
https://stackoverflow.com/questions/11677411/python-lxml-access-text?answert...
-- Regards, Peng _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
On Sun, May 13, 2018 at 7:31 AM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.
Something like
<foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo>
should be returned as
abc<em>x<u>y</u>z</em>123<h>f</h>uvw
I could do so by calling tostring() then use regex to remove the outermost tag.
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.
Could you show me some working code? See also my reply to Thomas Levine. Thanks. -- Regards, Peng
I just meant something like-- def strip_outer(html, tag): start = f'<{tag}>' end = f'</{tag}>' assert html.startswith(start) assert html.endswith(end) return html[len(start):(-1)*len(end)] html = '<foo>25/15°C <bar>abc</bar></foo>' print(strip_outer(html, tag='foo')) It's a super naive approach, but it's fast and is guaranteed to work correctly when it does (erroring out and letting you know otherwise). --Chris On Sun, May 13, 2018 at 5:02 AM, Peng Yu <pengyu.ut@gmail.com> wrote:
On Sun, May 13, 2018 at 7:31 AM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.
Something like
<foo>abc<em>x<u>y</u>z</em>123<h>f</h>uvw</foo>
should be returned as
abc<em>x<u>y</u>z</em>123<h>f</h>uvw
I could do so by calling tostring() then use regex to remove the outermost tag.
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.
Could you show me some working code? See also my reply to Thomas Levine. Thanks.
-- Regards, Peng
Chris Jerdonek writes:
You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.
How obvious that should have been! This makes things a bit neater than my previous version. inner = b'abc<em>x<u>y</u>z</em>123<h>f</h>uvw' outer = b'<foo>%s</foo>' % inner html = lxml.html.fromstring(outer) result = html.text.encode('utf-8') + b''.join(map(lxml.html.tostring, html)) assert result == inner
participants (4)
-
Chris Jerdonek
-
Kev Dwyer
-
Peng Yu
-
Thomas Levine