HTML Parser which allows low-keyed local changes (upon serialization)
no-spam at non-existing.invalid
Mon Feb 1 14:54:59 CET 2010
> Stefan Behnel wrote:
>> Robert, 31.01.2010 20:57:
>>> I tried lxml, but after walking and making changes in the element tree,
>>> I'm forced to do a full serialization of the whole document
>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>>> original HTML code. makes it rather unreadable.
>> What do you mean? Could you give an example? lxml certainly does not
>> destroy anything it parsed, unless you tell it to do so.
> of course it does not destroy during parsing.(?)
> I mean: I want to walk with a Python script through the parsed tree HTML
> and modify here and there things (auto alt tags from DB/similar, link
> corrections, text sections/translated sentences... due to HTML code and
> content checks.)
> Then I want to output the changed tree - but as close to the original
> format as far as possible. No changes to my white space identation,
> etc.. Only lokal changes, where really tags where changed.
> Thats similiar like that what a good HTML editor does: After you made
> little changes, it doesn't reformat/re-spit-out your whole code layout
> from tree/attribute logic only. you have lokal changes only.
> But a simple HTML editor like that in Mozilla-Seamonkey outputs a whole
> new HTML, produces the HTML from logical tree only (regarding his (ugly)
> style), destroys my whitspace layout and much more - forgetting
> anything about the original layout.
> Such a "good HTML editor" must somehow track the original positions of
> the tags in the file. And during each logical change in the tree it must
> tracks the file position changes/offsets. That thing seems to miss in
> lxml and BeautifulSoup which I tried so far.
> This is a frequent need I have. Nobody else's?
> Seems I need to write my own or patch BS to do that extra tracking?
basic feature(s) of such parser perhaps:
* can it tell for each tag object in the parsed tree, at what
original file position start:end it resided? even a basic need:
tell me the line number e.g. (for warning/analysis reports e.g.)
(* do the tree objects auto track/know if they were changed. (for
convenience; a tree copy may serve this otherwise .. )
the creation of a output with local changes whould be rather
simple from that ...
More information about the Python-list