Question regarding splitting documents
data:image/s3,"s3://crabby-images/fe7a2/fe7a28fc99e0f79e5158b7941c1cfceeeccf00ca" alt=""
Hi folks, I'm attaching a small sample program. My intent is to split the HTML snippet into smaller html documents using the <h4> tags as the splitting points. Any clues? Thanks, /PA -- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Gerog Kreisler
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi,
Any clues?
S.th. along those lines should work: #------------------------------------------------------- #!/usr/bin/env python3 # -*- coding: utf-8 -*- snippet = """<html><head></head><body> <h1>Title</h1> <h2>Author</h2> <h3>Part 1</h3> <h4>Chapter 1</h4> <p>Lore ipsum1</p> <p>Lore ipsum2</p> <p>Lore ipsum3</p> <p>Lore ipsum4</p> <p>Lore ipsum5</p> <h4>Chapter 2</h4> <p>Lore ipsum6</p> <p>Lore ipsum7</p> <p>Lore ipsum8</p> <p>Lore ipsum9</p> <p>Lore ipsum10</p> <h4>Chapter 3</h4> <p>Lore ipsum11</p> <p>Lore ipsum12</p> <p>Lore ipsum13</p> <p>Lore ipsum14</p> <p>Lore ipsum15</p> <h4>Chapter 4</h4> <h4>Chapter 5</h4> <p>Lore ipsum16</p> </body></html> """ import lxml.html from lxml import etree from lxml.builder import E def mk_chunk(parent, start, end): result = E.body() start_index = parent.index(start) if end is None: end_index = len(parent) + 1 else: end_index = parent.index(end) result.extend(parent[start_index:end_index]) return result html = lxml.html.fromstring(snippet) body = html[1] print ("body has %d elements" % len(body)) # not interested in first h4 elem for splitting h4_elems = body.xpath('./h4')[1:] # the original html document will be shortened in-place docs = [html] while h4_elems: start_elem = h4_elems.pop(0) try: end_elem = h4_elems[0] except IndexError: end_elem = None docs.append(mk_chunk(body, start_elem, end_elem)) if end_elem is not None: docs.append(mk_chunk(body, start=end_elem, end=None)) for doc in docs: print (etree.tostring(doc, pretty_print=True)) #------------------------------------------------------- Your exact requirements aren't fully clear (e.g. need html head and h1-h3 headers in all the small docs?) but this should be easy enough to adapt. Of course, there's other solutions like using xpath() from the h4 header elements to get at the wanted siblings, or whatnot... Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
data:image/s3,"s3://crabby-images/fe7a2/fe7a28fc99e0f79e5158b7941c1cfceeeccf00ca" alt=""
Hi Holger thanks a ton. This is exactly what I want. What I do not understand is why I can't access the elements in body as body[x:y] directly... Would be so convenient... Best, /PA On 4 May 2016 at 12:50, Holger Joukl <Holger.Joukl@lbbw.de> wrote:
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Gerog Kreisler
data:image/s3,"s3://crabby-images/fe7a2/fe7a28fc99e0f79e5158b7941c1cfceeeccf00ca" alt=""
Hi Stefan, More context on my setup: - python3 on MacOSX. Latest DMG from python.org - paag:tmp paag$ python3 Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> quit() - lxml installed with pip3 install lxml beginning of this week. Attached is the modified version of my sample script. When I execute it, I get the following output: body has 24 elements list of h4 elements [<Element h4 at 0x1039389f8>, <Element h4 at 0x103938a48>, <Element h4 at 0x103938a98>, <Element h4 at 0x103938ae8>, <Element h4 at 0x103938b38>] [3, 9, 15, 21, 22] modified list of h4 [<Element h1 at 0x1039389f8>, <Element h4 at 0x103938a48>, <Element h4 at 0x103938a98>, <Element h4 at 0x103938ae8>, <Element h4 at 0x103938b38>, None] [0, 9, 15, 21, 22, 24] <Element body at 0x1039389a8>[0:9] <<-- ?? should be 0:9 b'<body><h1>Title</h1>\n<h2>Author</h2>\n<h3>Part 1</h3>\n<h4>Chapter 1</h4>\n<p>Lore ipsum1</p>\n<p>Lore ipsum2</p>\n<p>Lore ipsum3</p>\n<p>Lore ipsum4</p>\n<p>Lore ipsum5</p>\n</body>' <Element body at 0x1039389a8>[0:6] <<-- ?? should be 9:15 b'<body><h4>Chapter 2</h4>\n<p>Lore ipsum6</p>\n<p>Lore ipsum7</p>\n<p>Lore ipsum8</p>\n<p>Lore ipsum9</p>\n<p>Lore ipsum10</p>\n</body>' <Element body at 0x1039389a8>[0:6] <<-- ?? should be 15:21 b'<body><h4>Chapter 3</h4>\n<p>Lore ipsum11</p>\n<p>Lore ipsum12</p>\n<p>Lore ipsum13</p>\n<p>Lore ipsum14</p>\n<p>Lore ipsum15</p>\n</body>' <Element body at 0x1039389a8>[0:1] <<-- ?? should be 21:22 b'<body><h4>Chapter 4</h4>\n</body>' <Element body at 0x1039389a8>[0:3] <<-- ?? should be 22:24 b'<body><h4>Chapter 5</h4>\n<p>Lore ipsum16</p>\n</body>' I'm really puzzled at the indices I get within the mk_chunk function and why they aren't the indices I get in the main function. Both refer to body, right?😵 /PA On 5 May 2016 at 10:40, Stefan Behnel <stefan_ml@behnel.de> wrote:
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Gerog Kreisler
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Pedro Andres Aranda Gutierrez schrieb am 05.05.2016 um 11:32:
The problem lies in this code: result = E.body() result.extend(parent[findex:lindex]) Here, you remove the elements from the parents and append them to another parent element. This changes the content of the parent and thus the indices of its children. See the explanation here: http://lxml.de/tutorial.html#elements-are-lists Stefan
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi,
Any clues?
S.th. along those lines should work: #------------------------------------------------------- #!/usr/bin/env python3 # -*- coding: utf-8 -*- snippet = """<html><head></head><body> <h1>Title</h1> <h2>Author</h2> <h3>Part 1</h3> <h4>Chapter 1</h4> <p>Lore ipsum1</p> <p>Lore ipsum2</p> <p>Lore ipsum3</p> <p>Lore ipsum4</p> <p>Lore ipsum5</p> <h4>Chapter 2</h4> <p>Lore ipsum6</p> <p>Lore ipsum7</p> <p>Lore ipsum8</p> <p>Lore ipsum9</p> <p>Lore ipsum10</p> <h4>Chapter 3</h4> <p>Lore ipsum11</p> <p>Lore ipsum12</p> <p>Lore ipsum13</p> <p>Lore ipsum14</p> <p>Lore ipsum15</p> <h4>Chapter 4</h4> <h4>Chapter 5</h4> <p>Lore ipsum16</p> </body></html> """ import lxml.html from lxml import etree from lxml.builder import E def mk_chunk(parent, start, end): result = E.body() start_index = parent.index(start) if end is None: end_index = len(parent) + 1 else: end_index = parent.index(end) result.extend(parent[start_index:end_index]) return result html = lxml.html.fromstring(snippet) body = html[1] print ("body has %d elements" % len(body)) # not interested in first h4 elem for splitting h4_elems = body.xpath('./h4')[1:] # the original html document will be shortened in-place docs = [html] while h4_elems: start_elem = h4_elems.pop(0) try: end_elem = h4_elems[0] except IndexError: end_elem = None docs.append(mk_chunk(body, start_elem, end_elem)) if end_elem is not None: docs.append(mk_chunk(body, start=end_elem, end=None)) for doc in docs: print (etree.tostring(doc, pretty_print=True)) #------------------------------------------------------- Your exact requirements aren't fully clear (e.g. need html head and h1-h3 headers in all the small docs?) but this should be easy enough to adapt. Of course, there's other solutions like using xpath() from the h4 header elements to get at the wanted siblings, or whatnot... Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
data:image/s3,"s3://crabby-images/fe7a2/fe7a28fc99e0f79e5158b7941c1cfceeeccf00ca" alt=""
Hi Holger thanks a ton. This is exactly what I want. What I do not understand is why I can't access the elements in body as body[x:y] directly... Would be so convenient... Best, /PA On 4 May 2016 at 12:50, Holger Joukl <Holger.Joukl@lbbw.de> wrote:
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Gerog Kreisler
data:image/s3,"s3://crabby-images/fe7a2/fe7a28fc99e0f79e5158b7941c1cfceeeccf00ca" alt=""
Hi Stefan, More context on my setup: - python3 on MacOSX. Latest DMG from python.org - paag:tmp paag$ python3 Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> quit() - lxml installed with pip3 install lxml beginning of this week. Attached is the modified version of my sample script. When I execute it, I get the following output: body has 24 elements list of h4 elements [<Element h4 at 0x1039389f8>, <Element h4 at 0x103938a48>, <Element h4 at 0x103938a98>, <Element h4 at 0x103938ae8>, <Element h4 at 0x103938b38>] [3, 9, 15, 21, 22] modified list of h4 [<Element h1 at 0x1039389f8>, <Element h4 at 0x103938a48>, <Element h4 at 0x103938a98>, <Element h4 at 0x103938ae8>, <Element h4 at 0x103938b38>, None] [0, 9, 15, 21, 22, 24] <Element body at 0x1039389a8>[0:9] <<-- ?? should be 0:9 b'<body><h1>Title</h1>\n<h2>Author</h2>\n<h3>Part 1</h3>\n<h4>Chapter 1</h4>\n<p>Lore ipsum1</p>\n<p>Lore ipsum2</p>\n<p>Lore ipsum3</p>\n<p>Lore ipsum4</p>\n<p>Lore ipsum5</p>\n</body>' <Element body at 0x1039389a8>[0:6] <<-- ?? should be 9:15 b'<body><h4>Chapter 2</h4>\n<p>Lore ipsum6</p>\n<p>Lore ipsum7</p>\n<p>Lore ipsum8</p>\n<p>Lore ipsum9</p>\n<p>Lore ipsum10</p>\n</body>' <Element body at 0x1039389a8>[0:6] <<-- ?? should be 15:21 b'<body><h4>Chapter 3</h4>\n<p>Lore ipsum11</p>\n<p>Lore ipsum12</p>\n<p>Lore ipsum13</p>\n<p>Lore ipsum14</p>\n<p>Lore ipsum15</p>\n</body>' <Element body at 0x1039389a8>[0:1] <<-- ?? should be 21:22 b'<body><h4>Chapter 4</h4>\n</body>' <Element body at 0x1039389a8>[0:3] <<-- ?? should be 22:24 b'<body><h4>Chapter 5</h4>\n<p>Lore ipsum16</p>\n</body>' I'm really puzzled at the indices I get within the mk_chunk function and why they aren't the indices I get in the main function. Both refer to body, right?😵 /PA On 5 May 2016 at 10:40, Stefan Behnel <stefan_ml@behnel.de> wrote:
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Gerog Kreisler
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Pedro Andres Aranda Gutierrez schrieb am 05.05.2016 um 11:32:
The problem lies in this code: result = E.body() result.extend(parent[findex:lindex]) Here, you remove the elements from the parents and append them to another parent element. This changes the content of the parent and thus the indices of its children. See the explanation here: http://lxml.de/tutorial.html#elements-are-lists Stefan
participants (3)
-
Holger Joukl
-
Pedro Andres Aranda Gutierrez
-
Stefan Behnel