Is there a HTML parser who can reconstruct the original html EXACTLY?

A.T.Hofkamp hat at se-162.se.wtb.tue.nl
Wed Jan 23 11:54:35 EST 2008


On 2008-01-23, kliu <ioscas at gmail.com> wrote:
> On Jan 23, 7:39 pm, "A.T.Hofkamp" <h... at se-162.se.wtb.tue.nl> wrote:
>> On 2008-01-23, ios... at gmail.com <ios... at gmail.com> wrote:
>>
>> >     Hi, I am looking for a HTML parser who can parse a given page into
>> > a DOM tree,  and can reconstruct the exact original html sources.
>>
>> Why not keep a copy of the original data instead?
>>
>> That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
>> to original source text.
>
> Thank u for your reply. but what I really need is the mapping between
> each DOM nodes and
> the corresponding original source segment.

Why do you think there is a simple one-to-one relation between nodes in some
abstract DOM tree, and pieces of source?, For example, the outermost tag
<HTML>...</HTML> is not an explicit point in the tree. If if it is, what piece
of source should be attached to it? Everything? Just the text before and after
it? If so, what about the source text of the second tag? Last but not least,
what do you intend to do with the source-text before the <HTML> and after
the </HTML> tags?

In other words, you are going to have a huge problem deciding what
"corresponding original source segment" means for each tag. This is exactly the
reason why current tools do not do what you want.

If you really want this, you probably have to do it yourself mostly from
scratch (ie starting with a parsing framework and writing a custom parser
yourself). That usually boils down to attaching source text to tokens in the
lexical parsing phase. If you have a good understanding of the meaning of
"corresponding original source segment", AND you have perfect HTML, this is
doable, but not very nice.

There exist parsers that can do what you want IF YOU HAVE PERFECT HTML, but
using those tools implies a very steep learning curve of about 2-3 months under
the assumption that you know functional languages (if you don't, add 2-3 months
or so steep learning curve :) ).


If you don't have perfect HTML, you are probably more or less lost. Most tools
cannot deal with that situation, and those that can do smart re-shuffling to
make things parsable, which means there is really no one-to-one mapping any
more (after re-shuffling).


In other words, I think you really don't want what you want, at least not in
the way that you consider now.


Please give us information about your goal, so we can think about alternative
approaches to solve your problem.

sincerely,
Albert




More information about the Python-list mailing list