HTML Parser which allows low-keyed local changes (upon serialization)

M.-A. Lemburg mal at
Mon Feb 1 19:06:37 CET 2010

Robert wrote:
> I think you confused the logical level of what I meant with "file
> position":
> Of course its not about (necessarily) writing back to the same open file
> (OS-level), but regarding the whole serializiation string (wherever it
> is finally written to - I typically write the auto-converted HTML files
> to a 2nd test folder first, and want use "diff -u ..." to see
> human-readable what changed happened - which again is only reasonable if
> the original layout is preserved as good as possible )
> lxml and BeautifulSoup e.g. : load&parse a HTML file to a tree,
> immediately serialize the tree without changes => you see big
> differences of original and serialized files with quite any file.
> The main issue: those libs seem to not track any info about the original
> string/file positions of the objects they parse. The just forget the
> past. Thus they cannot by principle do what I want it seems ...
> Or does anybody see attributes of the tree objects - which I overlooked?
> Or a lib which can do or at least enable better this
> source-back-connected editing?

You'd have to write your own parse (or extend the example HTML
one we include), but mxTextTools allows you to work on original
code quite easily: it tags parts of the input string with objects.

You can then have those objects manipulate the underlying text as
necessary and write back the text using the original formatting
plus your local changes.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Feb 01 2010)
>>> Python/Zope Consulting and Support ...
>>> mxODBC.Zope.Database.Adapter ...   
>>> mxODBC, mxDateTime, mxTextTools ...

::: Try our new mxODBC.Connect Python Database Interface for free ! :::: Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Python-list mailing list