HTML alignment of Offsets and XPath

Hi lxml experts, I have a question about character offsets and Xpath that would not be necessary if everything in the world were XML: tl;dr: what is the best way to translate back and forth between a character (or byte) offset in the string from of the HTML and the Xpath-plus-relative-offset in the rendered DOM? Context: We're working on an HTML highlighting tool that is intended to allow users to modify/create text selections generated by automatic named entity recognition algorithms. Many parts of this are working, and a key element uses your wonderful lxml.html.clean.Cleaner [1] After cleansing a page to make `clean_html`, we also generate a tag-stripped form that has exactly the same byte offsets by replacing tags with whitespace of the same byte length. We call this `clean_visible`. This allows named entity recognizers, such as LingPipe, Basis Rosette, Stanford CoreNLP, Clear Forest, etc to recognize names of things (people, companies, locations, etc) in the natural language. The resulting offsets then are correct for both the HTML string and the tag-stripped form. Some NER tools can parse HTML to do an even better job of their task, and many do not. The user facing components are in a not-yet-released FOSS javascript component called "HTML highlighter" that operates *in the browser* on the live DOM rendered from the clean_html. Using ideas similar to those in Rangy [2], it figures out xpath+offset for start and end of the user's selection. This form of offset is really all that JavaScript can handle. To make this whole thing work perfectly, we need to construct a python service that translates between absolute offsets in the HTML string and xpath+offsets in the corresponding DOM. Can lxml help with this? Thanks for any guidance. (and please no flames about how the whole world should be in XML, because... that's not this world :-) John [1] https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_p... [2] https://github.com/timdown/rangy -- ______________________________ John R. Frank <jrf@diffeo.com>
participants (1)
-
John R. Frank