Mailman 3 What is the correct way to tostring() except that the outer most tag should be removed? - lxml - The Python XML Toolkit

What is the correct way to tostring() except that the outer most tag should be removed?

older
get line number in exception from...

Peng Yu

13 May 2018 13 May '18

4:58 a.m.

Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag. Something like <foo>abcxyz123<h>f</h>uvw</foo> should be returned as abcxyz123<h>f</h>uvw I could do so by calling tostring() then use regex to remove the outermost tag. Alternatively, the code below can be customized. But I feel the 2nd approach is an overkill for this problem. Does anybody know what is the best approach for this problem? Thanks. https://stackoverflow.com/questions/11677411/python-lxml-access-text?answert... -- Regards, Peng

Show replies by date

Thomas Levine

13 May 13 May

11:08 a.m.

New subject: What is the correct way to tostring() except that the outer most tag should be removed?

Peng Yu writes:

...

Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.

Something like

<foo>abcxyz123<h>f</h>uvw</foo>

should be returned as

abcxyz123<h>f</h>uvw

inner = 'abcxyz123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.

Peng Yu

12:01 p.m.

On Sun, May 13, 2018 at 7:08 AM, Thomas Levine <_@thomaslevine.com> wrote:

...

Peng Yu writes:

...
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.

Something like

<foo>abcxyz123<h>f</h>uvw</foo>

should be returned as

abcxyz123<h>f</h>uvw

inner = 'abcxyz123<h>f</h>uvw' outer = '<foo>%s</foo>' % inner html = lxml.html.fromstring(outer.encode('utf-8')) result = (html.text.encode('utf-8') + b''.join(lxml.html.tostring(child) for child in html.getchildren()) ).decode('utf-8') assert result == inner

While removal of the outer tag seems fundamentally incorrect to me, I have at least once had a good reason to do this before.

I just realize that tostring() makes changes to symbols like °. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks. from lxml import etree tree = etree.XML('<foo>25/15°C <bar>abc</bar></foo>') print etree.tostring(tree) The output of the above code is the following. <foo>25/15°C <bar>abc</bar></foo> -- Regards, Peng

Thomas Levine

12:15 p.m.

New subject: What is the correct way to tostring() except that the outer most tag should be removed?

Peng Yu writes:

...

I just realize that tostring() makes changes to symbols like =C2=B0. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks.

from lxml import etree tree =3D etree.XML('<foo>25/15=C2=B0C <bar>abc</bar></foo>') print etree.tostring(tree)

The output of the above code is the following.

<foo>25/15°C <bar>abc</bar></foo>

Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful.

Kev Dwyer

12:18 p.m.

The python2 equivalent of html.entities is htmllib.entitydefs ( https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs). On 13 May 2018 at 13:15, Thomas Levine <_@thomaslevine.com> wrote:

...

Peng Yu writes:

...
I just realize that tostring() makes changes to symbols like =C2=B0. If I just to strip the outermost tag, without changing anything to the internal text. How to do it? Thanks.

from lxml import etree tree =3D etree.XML('<foo>25/15=C2=B0C <bar>abc</bar></foo>') print etree.tostring(tree)

The output of the above code is the following.

<foo>25/15°C <bar>abc</bar></foo>

Check the lxml documentation for a way to run tostring without XML/HTML entities. Alternatively, replace them afterwards; I don't think it's in Python 2, but the module html.entities may be helpful. _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Peng Yu

12:23 p.m.

On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:

...

The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).

I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks. -- Regards, Peng

Thomas Levine

14 May 14 May

11:33 a.m.

New subject: What is the correct way to tostring() except that the outer most tag should be removed?

Peng Yu writes:

...

On Sun, May 13, 2018 at 8:18 AM, Kev Dwyer <kevin.p.dwyer@gmail.com> wrote:

...
The python2 equivalent of html.entities is htmllib.entitydefs (https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs).

I don't find working example code on how to use htmllib.entitydefs for my case. Could you show me some working example code? Thanks.

Don't use htmllib.entitydefs (to replace the entities); what you have done is better.

Peng Yu

13 May 13 May

12:27 p.m.

...

Check the lxml documentation for a way to run tostring without XML/HTML entities.

OK. I find it. Thanks. from lxml import etree tree = etree.XML('<foo>25/15°C <bar>abc</bar></foo>') print etree.tostring(tree, encoding='utf-8') -- Regards, Peng

Chris Jerdonek

11:31 a.m.

On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:

...

Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.

Something like

<foo>abcxyz123<h>f</h>uvw</foo>

should be returned as

abcxyz123<h>f</h>uvw

I could do so by calling tostring() then use regex to remove the outermost tag.

You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None. --Chris

...

Alternatively, the code below can be customized. But I feel the 2nd approach is an overkill for this problem. Does anybody know what is the best approach for this problem? Thanks.

https://stackoverflow.com/questions/11677411/python-lxml-access-text?answert...

-- Regards, Peng _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Peng Yu

12:02 p.m.

On Sun, May 13, 2018 at 7:31 AM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:

...

On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:

...
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.

Something like

<foo>abcxyz123<h>f</h>uvw</foo>

should be returned as

abcxyz123<h>f</h>uvw

I could do so by calling tostring() then use regex to remove the outermost tag.

You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.

Could you show me some working code? See also my reply to Thomas Levine. Thanks. -- Regards, Peng

Chris Jerdonek

9:15 p.m.

I just meant something like-- def strip_outer(html, tag): start = f'<{tag}>' end = f'</{tag}>' assert html.startswith(start) assert html.endswith(end) return html[len(start):(-1)*len(end)] html = '<foo>25/15°C <bar>abc</bar></foo>' print(strip_outer(html, tag='foo')) It's a super naive approach, but it's fast and is guaranteed to work correctly when it does (erroring out and letting you know otherwise). --Chris On Sun, May 13, 2018 at 5:02 AM, Peng Yu <pengyu.ut@gmail.com> wrote:

...

On Sun, May 13, 2018 at 7:31 AM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:

...
On Sat, May 12, 2018 at 9:58 PM, Peng Yu <pengyu.ut@gmail.com> wrote:

...
Hi, I'd like to get the text that would be got from tostring() except removing the outmost tag.

Something like

<foo>abcxyz123<h>f</h>uvw</foo>

should be returned as

abcxyz123<h>f</h>uvw

I could do so by calling tostring() then use regex to remove the outermost tag.

You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.

Could you show me some working code? See also my reply to Thomas Levine. Thanks.

-- Regards, Peng

Thomas Levine

12:10 p.m.

New subject: What is the correct way to tostring() except that the outer most tag should be removed?

Chris Jerdonek writes:

...

You don't need a regex. You can just do a string slice html[i:j] after measuring the length of the opening and closing tags you expect. I would also do an assertion that the first and last characters that you're removing are what you expect. Depending on the specifics, it might be necessary for you to clear the attributes of the root element and/or set the tail to None.

How obvious that should have been! This makes things a bit neater than my previous version. inner = b'abcxyz123<h>f</h>uvw' outer = b'<foo>%s</foo>' % inner html = lxml.html.fromstring(outer) result = html.text.encode('utf-8') + b''.join(map(lxml.html.tostring, html)) assert result == inner

2337

Age (days ago)

2338

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Chris Jerdonek
Kev Dwyer
Peng Yu
Thomas Levine