HTML Parser which allows low-keyed local changes (upon serialization)

Robert no-spam at non-existing.invalid
Mon Feb 1 08:36:37 EST 2010


Stefan Behnel wrote:
> Robert, 31.01.2010 20:57:
>> I tried lxml, but after walking and making changes in the element tree,
>> I'm forced to do a full serialization of the whole document
>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>> original HTML code. makes it rather unreadable.
> 
> What do you mean? Could you give an example? lxml certainly does not
> destroy anything it parsed, unless you tell it to do so.
> 

of course it does not destroy during parsing.(?)

I mean: I want to walk with a Python script through the parsed 
tree HTML and modify here and there things  (auto alt tags from 
DB/similar, link corrections, text sections/translated 
sentences... due to HTML code and content checks.)

Then I want to output the changed tree - but as close to the 
original format as far as possible. No changes to my white space 
identation, etc..  Only lokal changes, where really tags where 
changed.

Thats similiar like that what a good HTML editor does: After you 
made little changes, it doesn't reformat/re-spit-out your whole 
code layout from tree/attribute logic only. you have lokal changes 
only.
But a simple HTML editor like that in Mozilla-Seamonkey outputs a 
whole new HTML, produces the HTML from logical tree only 
(regarding his (ugly) style), destroys my whitspace layout and 
much more  - forgetting anything about the original layout.

Such a "good HTML editor" must somehow track the original 
positions of the tags in the file. And during each logical change 
in the tree it must tracks the file position changes/offsets. That 
thing seems to miss in lxml and BeautifulSoup which I tried so far.

This is a frequent need I have. Nobody else's?

Seems I need to write my own or patch BS to do that extra tracking?


Robert



More information about the Python-list mailing list