From docutils.org.timehorse at neverbox.com Mon Mar 2 18:33:46 2009 From: docutils.org.timehorse at neverbox.com (Jeffrey C. Jacobs) Date: Mon, 2 Mar 2009 17:33:46 +0000 (UTC) Subject: [Doc-SIG] Diffing reStructuredText documents that only differ by formatting Message-ID: I am wondering if there is a way to diff 2 versions of a reStructuredText document that differ only by line breaks within paragraphs such that those differences do not trigger a diff entry. In other words, I wonder if there is a tool out there where: this is one reStructuredText paragraph Is considered equivalent to: this is one reStructuredText paragraph Does anyone have any ideas how this can be accomplished, especially with respect to VCS differences, e.g. svn? Thanks! Jeffrey. From blais at furius.ca Mon Mar 2 21:21:01 2009 From: blais at furius.ca (Martin Blais) Date: Mon, 02 Mar 2009 15:21:01 -0500 Subject: [Doc-SIG] Diffing reStructuredText documents that only differ by formatting In-Reply-To: References: Message-ID: <1236025261.4431.1303225557@webmail.messagingengine.com> On Mon, 2 Mar 2009 17:33:46 +0000 (UTC), "Jeffrey C. Jacobs" said: > I am wondering if there is a way to diff 2 versions of a reStructuredText > document that differ only by line breaks within paragraphs such that > those > differences do not trigger a diff entry. In other words, I wonder if > there is a > tool out there where: > > this is one > reStructuredText > paragraph > > Is considered equivalent to: > > this is one reStructuredText paragraph > > Does anyone have any ideas how this can be accomplished, especially with > respect > to VCS differences, e.g. svn? If the differences are only whitespace, xxdiff has an option to keep those gray in the GUI. tangerine:~/p/xxdiff/src$ xxdiff --list-resource | grep Hunk Accel.ToggleIgnorePerHunkWhitespace: "" IgnorePerHunkWhitespace: False tangerine:~/p/xxdiff/src$ Otherwise you can write a 40 lines Python script to parse GNU diff output and filter out those changes from the diff hunks. From gael.varoquaux at normalesup.org Mon Mar 2 21:42:54 2009 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 2 Mar 2009 21:42:54 +0100 Subject: [Doc-SIG] Diffing reStructuredText documents that only differ by formatting In-Reply-To: References: Message-ID: <20090302204254.GA5045@phare.normalesup.org> On Mon, Mar 02, 2009 at 05:33:46PM +0000, Jeffrey C. Jacobs wrote: > I am wondering if there is a way to diff 2 versions of a reStructuredText > document that differ only by line breaks within paragraphs such that those > differences do not trigger a diff entry. In other words, I wonder if there is a > tool out there where: > this is one > reStructuredText > paragraph > Is considered equivalent to: > this is one reStructuredText paragraph wdiff. Ga?l From docutils.org.timehorse at neverbox.com Tue Mar 3 18:57:12 2009 From: docutils.org.timehorse at neverbox.com (Jeffrey C. Jacobs) Date: Tue, 3 Mar 2009 17:57:12 +0000 (UTC) Subject: [Doc-SIG] Diffing reStructuredText documents that only differ by formatting References: <20090302204254.GA5045@phare.normalesup.org> Message-ID: Gael Varoquaux normalesup.org> writes: > > wdiff. Thanks for the suggestions! Unfortunately, one thing I forgot to mention was that the concatenations should not span different paragraphs. Thus: Hello! World! is not the same as: Hello! World! Since the first represents 2 paragraphs, but the second only 1. Instead, I propose the following python script that diffs the docutil trees instead of the original text files. I don't know how it could tell whether the 2 imputs are reStructuredText documents vs. regular text documents and only perform the doc-tree step if rst, and am welcome to suggestions for improvements but so far this does a good job of what I am trying to achieve. Such a tool could be handy to rst documenters in cases where a document may have a bunch of lines through years of editing that go beyond 80 columns and thus the file is edited to bring it back in line, which produces massive standard diffs when the result really should more or less be the same document. This script could be used to confirm that the two versions of documents are more or less the same. ---------- #!/usr/bin/python import sys import subprocess import tempfile import docutils.core import os import re # Regexp for removing inconsequential characters trimwhite = re.compile(r'(?)\n\s*(?![< ])', re.M + re.U + re.L) webspace = re.compile(r'(?<=[.?!):])\s{2,}(?=[\w\d(])', re.M + re.U + re.L) repl = r' ' if __name__ == '__main__': # To Do: verify that document 1 and document 2 are both # reStructuredText documents # Last 2 parameters are the left hand side and right hand side file lhs, rhs = sys.argv[-2:] # Parse the left and right file into docutils tree strings lhss1 = docutils.core.publish_string(file(lhs).read()) rhss2 = docutils.core.publish_string(file(rhs).read()) # Concatenate multi-line text that lies within a node lhss1, lhsr1 = trimwhite.subn(repl, lhss1) rhss2, rhsr2 = trimwhite.subn(repl, rhss2) #sys.stdout.write('Removed returns (left, right): %d, %d\n' % # (lhsr1, rhsr2)) # Trim multiple white spaces between full-stop (.?!) and the next phrase lhss1, lhsr1 = webspace.subn(repl, lhss1) rhss2, rhsr2 = webspace.subn(repl, rhss2) #sys.stdout.write('Removed double space (left, right): %d, %d\n' % # (lhsr1, rhsr2)) # Make sure the last line is properly terminated lhss1 += '\n' rhss2 += '\n' # Allocate temporary files to hold the left and right doc-trees lhsh1, lhst1 = tempfile.mkstemp(text=True) rhsh2, rhst2 = tempfile.mkstemp(text=True) # Open the left and write temp files for writing lhso1 = os.fdopen(lhsh1, 'w') rhso2 = os.fdopen(rhsh2, 'w') # Write the doc-trees to the temp files lhso1.write(lhss1) rhso2.write(rhss2) # Close the temp files lhso1.close() rhso2.close() # Spawn [UNIX] diff and wait for it to complete # Stdout and Stderr are passed directly to this application sp = subprocess.Popen(['diff'] + sys.argv[1:-2] + [lhst1, rhst2]) sp.wait() # Delete the temp files os.remove(lhst1) os.remove(rhst2) From blais at furius.ca Tue Mar 3 18:59:59 2009 From: blais at furius.ca (Martin Blais) Date: Tue, 03 Mar 2009 12:59:59 -0500 Subject: [Doc-SIG] Diffing reStructuredText documents that only differ by formatting In-Reply-To: References: <20090302204254.GA5045@phare.normalesup.org> Message-ID: <1236103199.14398.1303417921@webmail.messagingengine.com> On Tue, 3 Mar 2009 17:57:12 +0000 (UTC), "Jeffrey C. Jacobs" said: > Gael Varoquaux normalesup.org> writes: > > > > wdiff. > > Thanks for the suggestions! Unfortunately, one thing I forgot to mention > was that the concatenations should not span different paragraphs. Thus: > > Hello! World! > > is not the same as: > > Hello! > > World! > > Since the first represents 2 paragraphs, but the second only 1. > > Instead, I propose the following python script that diffs the docutil > trees instead of the original text files. I don't know how it could tell > whether the 2 imputs are reStructuredText documents vs. regular text > documents and only perform the doc-tree step if rst, and am welcome to > suggestions for improvements but so far this does a good job of what I am > trying to achieve. Such a tool could be handy to rst documenters in > cases where a document may have a bunch of lines through years of editing > that go beyond 80 columns and thus the file is edited to bring it back in > line, which produces massive standard diffs when the result really should > more or less be the same document. This script could be used to confirm > that the two versions of documents are more or less the same. This is great. BTW if you want to inspect your diffs graphically, you can tell xxdiff to use your program to compute the differences. It'll work if your program outputs POSIX diffs (which it likely does, because you're invoking GNU diff). From docutils.org.timehorse at neverbox.com Wed Mar 4 16:22:20 2009 From: docutils.org.timehorse at neverbox.com (Jeffrey C. Jacobs) Date: Wed, 4 Mar 2009 15:22:20 +0000 (UTC) Subject: [Doc-SIG] =?utf-8?q?Ambiguity_in_default_output_for_publish=5Fstr?= =?utf-8?q?ing?= Message-ID: The two reStructuredText files: -------- This paragraph has a very funny **indent** after that word, right? -------- and: -------- his paragraph has a very funny **indent after that word, right?** -------- are theoretically different. The first puts strong emphasis only on the word **indent**, which is followed by exactly 4 spaces, where as the other puts strong emphasis on the entire expression "indent after that word, right?", where there is a line feed between "indent" and "after". However, when publish_string is called to output the tree for both of these expressions, they both return: This paragraph has a very funny indent after that word, right? which is not different. As far as I can tell, the internal node structure is correct, it's just when the node structure is displayed in string form, the default function of publish_string. Since this output is a serialization of the node structure, it seems that the output to publish_string should not be ambiguous in terms of what it truly represents. Or, is there a better way to represent the internal doc tree unambiguously as a string? From g.brandl at gmx.net Wed Mar 4 19:06:43 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Wed, 04 Mar 2009 19:06:43 +0100 Subject: [Doc-SIG] Ambiguity in default output for publish_string In-Reply-To: References: Message-ID: Jeffrey C. Jacobs schrieb: > However, when publish_string is called to output the tree for both of > these expressions, they both return: > > > > This paragraph has a very funny > > indent > after that word, right? > > which is not different. As far as I can tell, the internal node structure > is correct, it's just when the node structure is displayed in string form, > the default function of publish_string. Since this output is a > serialization of the node structure, it seems that the output to > publish_string should not be ambiguous in terms of what it truly > represents. Or, is there a better way to represent the internal doc tree > unambiguously as a string? What you see there is the "pseudo-XML" output format, which is nice for a quick view but not unambiguous. Try publish_string(..., writer_name='xml') for real XML output which is unambiguous in all cases. Georg From scott+doc-sig at scottdial.com Tue Mar 10 17:55:52 2009 From: scott+doc-sig at scottdial.com (Scott Dial) Date: Tue, 10 Mar 2009 12:55:52 -0400 Subject: [Doc-SIG] (Issue #4711) Wide literals in the table of contents overflow in documentation Message-ID: <49B69B98.3050509@scottdial.com> I posted this bug a few months ago onto the tracker and it didn't garner any attention, perhaps because it is a bit of a nitpick. However, it drives me nuts everytime I see it come up in the python docs, so I bring it up here again in hopes of resolving it. I copy the report here: """ There is a problem with the table contents with respect to literals that cannot be word-wrapped. I see this issue here: http://docs.python.org/dev/2.6/library/multiprocessing.html The line in the table of contents that reads "The multiprocessing.sharedctypes module" is broken in that the literal "multiprocessing.sharedctypes" overflows into the right-hand side. It also ends up underneath the contents on the right, which makes it extra hard to know what that entry was about. This instance may be browser specific, but I think it brings up a more general question of what should be done with such long literals and how overflow should be handled. And perhaps even whether it is wise to have set the width of that div to such a narrow and specific value (230px). """ -- Scott Dial scott at scottdial.com scodial at cs.indiana.edu