Help Parsing an HTML File
Paul McGuire
ptmcg at austin.rr.com
Sat Feb 16 23:06:52 EST 2008
On Feb 15, 3:28 pm, egonslo... at gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
Pyparsing was mentioned earlier, here is a sample with some annotating
comments.
I'm a little worried when you say the file "fairly resembles HTML-
EXAMPLE." With parsers, the devil is in the details, and if you have
scrambled this format - the HTML attributes are especially suspicious
- then the parser will need to be cleaned up to match the real input.
If the file being parsed really has proper HTML attributes (of the
form <tag attrname="attrvalue">), then you could simplify the code to
use the pyparsing method makeHTMLTags. But the example I wrote
matches the example you posted.
-- Paul
# encoding=utf-8
from pyparsing import *
data = """
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
... snip ...
"""
# define <XXX> and </XXX> tags
CL = CaselessLiteral
h1,h2,cmnt,br = \
map(Suppress,
map(CL,["<%s>" % s for s in "h1 h2 comment br".split()]))
h1end,h2end,cmntEnd,divEnd = \
map(Suppress,
map(CL,["</%s>" % s for s in "h1 h2 comment div".split()]))
# h1,h1end = makeHTMLTags("h1")
# define special format for <div>, incl. optional quoted string
"attribute"
div = "<" + CL("div") + Optional(QuotedString('"'))("name") + ">"
div.setParseAction(
lambda toks: "name" in toks and toks.name.title() or "DIV")
# define <xxx>body</xxx> entries
h1Entry = h1 + SkipTo(h1end) + h1end
h2Entry = h2 + SkipTo(h2end) + h2end
comment = cmnt + SkipTo(cmntEnd) + cmntEnd
divEntry = div + SkipTo(divEnd) + divEnd
# just return nested tokens
grammar = (OneOrMore(Group(h1Entry +
(Group(h2Entry +
(OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)
results = grammar.parseString(data)
from pprint import pprint
pprint(results.asList())
print
# return nested tokens, with dict
grammar = Dict(OneOrMore(Group( h1Entry +
Dict(Group(h2Entry +
Dict(OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)
results = grammar.parseString(data)
print results.dump()
Prints:
[['Ros\xe9H1-1',
['Ros\xe9H2-1',
['DIV', 'Ros\xe9DIV-1'],
['Segment1', 'Ros\xe9SegmentDIV1-1'],
['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]],
['PinkH1-2',
['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]],
['BlackH1-3',
['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]],
['YellowH1-4',
['YellowH2-4',
['DIV', 'YellowDIV2-4'],
['Segment1', 'YellowSegmentDIV1-4'],
['Segment2', 'YellowSegmentDIV2-4']]]]
[['Ros\xe9H1-1', ['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1',
'Ros\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]], ['PinkH1-2', ['PinkH2-2',
['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]],
['BlackH1-3', ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]], ['YellowH1-4', ['YellowH2-4', ['DIV',
'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2',
'YellowSegmentDIV2-4']]]]
- BlackH1-3: [['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]]
- BlackH2-3: [['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]
- DIV: BlackDIV2-3
- Segment1: BlackSegmentDIV1-3
- PinkH1-2: [['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]]
- PinkH2-2: [['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]
- DIV: PinkDIV2-2
- Segment1: PinkSegmentDIV1-2
- RoséH1-1: [['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]]
- RoséH2-1: [['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]
- DIV: RoséDIV-1
- Segment1: RoséSegmentDIV1-1
- Segment2: RoséSegmentDIV2-1
- Segment3: RoséSegmentDIV3-1
- YellowH1-4: [['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]]
- YellowH2-4: [['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]
- DIV: YellowDIV2-4
- Segment1: YellowSegmentDIV1-4
- Segment2: YellowSegmentDIV2-4
More information about the Python-list
mailing list