HTML Parser which allows low-keyed local changes (upon serialization)
stefan_ml at behnel.de
Mon Feb 1 15:05:39 CET 2010
Robert, 01.02.2010 14:36:
> Stefan Behnel wrote:
>> Robert, 31.01.2010 20:57:
>>> I tried lxml, but after walking and making changes in the element tree,
>>> I'm forced to do a full serialization of the whole document
>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>>> original HTML code. makes it rather unreadable.
>> What do you mean? Could you give an example? lxml certainly does not
>> destroy anything it parsed, unless you tell it to do so.
> of course it does not destroy during parsing.(?)
I meant "parsed" in the sense of "has parsed and is now working on".
> I mean: I want to walk with a Python script through the parsed tree HTML
> and modify here and there things (auto alt tags from DB/similar, link
> corrections, text sections/translated sentences... due to HTML code and
> content checks.)
Sure, perfectly valid use case.
> Then I want to output the changed tree - but as close to the original
> format as far as possible. No changes to my white space identation,
> etc.. Only lokal changes, where really tags where changed.
That's up to you. If you only apply local changes that do not change any
surrounding whitespace, you'll be fine.
> Thats similiar like that what a good HTML editor does: After you made
> little changes, it doesn't reformat/re-spit-out your whole code layout
> from tree/attribute logic only. you have lokal changes only.
HTML editors don't work that way. They always "re-spit-out" the whole code
when you click on "save". They certainly don't track the original file
position of tags. What they preserve is the content, including whitespace
(or not, if they reformat the code, but that's usually an *option*).
> Such a "good HTML editor" must somehow track the original positions of
> the tags in the file. And during each logical change in the tree it must
> tracks the file position changes/offsets.
Sorry, but that's nonsense. The file position of a tag is determined by
whitespace, i.e. line endings and indentation. lxml does not alter that,
unless you tell it do do so.
Since you keep claiming that it *does* alter it, please come up with a
reproducible example that shows a) what you do in your code, b) what your
input is and c) what unexpected output it creates. Do not forget to include
the version number of lxml and libxml2 that you are using, as well as a
comment on /how/ the output differs from what you expected.
My stab in the dark is that you forgot to copy the tail text of elements
that you replace by new content, and that you didn't properly indent new
content that you added. But that's just that, a stab in the dark. You
didn't provide enough information for even an educated guess.
More information about the Python-list