[Python-bugs-list] [Bug #124051] ndiff bug: "?" lines are out-of-sync

Thu, 7 Dec 2000 15:04:40 -0800

Bug #124051, was updated on 2000-Dec-01 07:17
Here is a current snapshot of the bug.

Project: Python
Category: demos and tools
Status: Closed
Resolution: Invalid
Bug Group: Not a Bug
Priority: 5
Submitted by: flight
Assigned to : tim_one
Summary: ndiff bug: "?" lines are out-of-sync

Details: I wonder if this result (the "?" line) of ndiff is intentional:

clapton:1> cat a
Millionen für so 'n Kamelrennen sind
clapton:2> cat b
Millionen für so "n Kamelrennen sind
clapton:3> /tmp/ndiff.py -q a b
- Millionen für so 'n Kamelrennen sind
+ Millionen für so "n Kamelrennen sind
?                  ^

clapton:4> cat c
Millionen deren für so "n Kamelrennen sind
clapton:5> /tmp/ndiff.py -q a c
- Millionen für so 'n Kamelrennen sind
+ Millionen deren für so "n Kamelrennen sind
?           ++++++ -     +

Instead of a - and a subsequent +, I would expect to find here a ^, too.

Follow-Ups:

Date: 2000-Dec-03 19:11
By: tim_one

Comment:
A caret means that the character in the line two above and in the same column was replaced by the character in the line one above and in the same column.  That's why you get a caret in the first example but not the second:  the replacement involves two distinct columns.

If you did get a caret in the second example, where would it go?  If under the single quote from the line two above, it would look the single quote got replaced by the ü in für; if under the double quote from the line one above, like the first e in Kamelrennen got replaced by a double quote.  Both readings would be wrong.

Edit sequences aren't unique, and in the absence of an obvious and non-ambiguous way to show replacements across columns, ndiff settles for a *correct* sequence ("deren " was inserted, "'" was deleted, '"' was inserted).  In this respect ndiff is functioning as designed, so it's not a bug.

-------------------------------------------------------

Date: 2000-Dec-07 02:38
By: flight

Comment:
[Is such a long comment still appropriate for the SF BTS ?]

Tim, could you please explain the meaning of the remaining symbols (plus,
minus) as well ? I think their meaning is far from being intuitive, then.

> A caret means that the character in the line two above and in the same
> column was replaced by the character in the line one above and in the same
> column.

How about this example, then ? Why is there a caret ?

freefly;44> cat a
1 2 3 5
freefly;45> cat b
1 3 4 5
freefly;46> ./ndiff.py -q a b
- 1 2 3 5
+ 1 3 4 5
?   -^+

Sorry, but i have the impression that the format used in the edit lines is
indeed ambigous by definition.

> That's why you get a caret in the first example but not the
> second: the replacement involves two distinct columns.

> Edit sequences aren't unique, and in the absence of an obvious and
> non-ambiguous way to show replacements across columns, ndiff settles for a
> *correct* sequence ("deren " was inserted, "'" was deleted, '"' was
> inserted).  In this respect ndiff is functioning as designed, so it's not a
> bug.

Please describe the intended meaning of '+' and '-', and I will give you an
counter-example that ndiff.py doesn't output a correct sequence for.

I think it's especially annoying that the edit line doesn't reflect the
information that the algorithm used in fancy_replace generates (if you run
my first example, the algorithm will in fact record an 'replace' event, but
the output routine will degenerate this into an 'insert' and a 'delete'
event.

Resp. uniqueness and ambiguity: It depends on the definition of an edit
line. You won't find a definition that keeps the edit line in sync
(column-wise) with both the pre and the post lines.

If you try to keep the edit line in sync (column-wise) with the pre line,
that's fine for '^' (meaning: character in this column has been changed) and
'-' (meaning: character in this column has been removed), but you won't be
able to record '+' events, since there's no column in the pre line where a
'+' event might be recorded.

(Similarly, if you tried to keep the edit line in sync with the post line.)

- one two three four five six seven
+ one three fxur 123456 five 987 six seven
?    ----    +  +^+++++      ++++

One way to work around this would be to output two edit lines: A pre-edit
line would be synced (column-wise) with the pre line, and it would record
all '-' and '^' events. A post-edit line would record all '+' and '^'
events, and would be in sync with the post line. Unambigous and quite
intuitive:

  - one two three four five six seven
  ?    ----        ^                 
  + one three fxur 123456 five 987 six seven
  ?            ^  +++++++     ++++

A second way to define an unambigous edit line format (but not really
friendly to eyeball inspection) would be to use the pre-edit line described
above, and, in a second step to merge the '+' sequences at the respective
places. This format would allow for easy automatic extraction of all the
information generated by fancy_replace. In fact this is what I expected too
see.

- one two three four five six seven
+ one three fxur 123456 five 987 six seven
?    ----        ^  +++++++     ++++          

A third way would be to insert spaces or some other placeholder in the pre
line in the columns with 'insert' events and in the post line in the columns
with 'delete' events. Easy for eyeball inspection, but it doesn't ouput the
original lines.

- one two three four_______ five____ six seven
+ one     three fxur 123456 five 987 six seven
?    ------      ^  +++++++     ++++          

A final way would be to use a format like wdiff, where the insert and
replace tags are placed in the line:

one[- two-] three four{+ 123456+} five{+ 987+} six seven

If you ask me, either of these formats is better than the one currently
used, which is only reliable for short lines with small differences.

-------------------------------------------------------

Date: 2000-Dec-07 15:04
By: tim_one

Comment:
I suggest you're over-thinking this:  as the docs say, "Lines beginning with "? " attempt to guide the eye to intraline differences, and were not present in either input file."  "Guide the eye" is all they're designed to do.  I find them very effective for that purpose.

> could you please explain the meaning of the remaining
> symbols (plus, minus) as well ? I think their meaning
> is far from being intuitive, then.

They're not documented because they're not important:  if they manage to jerk your eyeball to the parts of the lines that changed, I'm happy.  In fact, a "-" means the character in the same column two lines above was deleted, and a "+" means the character in the same column one line above was inserted (although it says nothing about *where* it was inserted wrt the line two lines above).  This works great for the usual cases:  somebody deletes a word or two (and gets a "?" line with a bunch of ----- under the position(s) of the deleted word(s)), or adds a word or two (and gets a "?" line with a bunch of +++++ under the position(s) of the inserted word(s)).

> Sorry, but i have the impression that the format used in
> the edit lines is indeed ambigous by definition.

Sure.  It's ambiguous in that it gives no clue as to where insertions took place wrt to the "before" line.  What you're missing is that I don't care <wink>.

> How about this example, then ? Why is there a caret ? 
>
> freefly;44> cat a 
> 1 2 3 5 
> freefly;45> cat b 
> 1 3 4 5 
> freefly;46> ./ndiff.py -q a b 
> - 1 2 3 5 
> + 1 3 4 5 
> ?   -^+ 

The caret is an artifact of that ndiff refuses to match on "junk" characters unless they're adjacent to a non-junk match, and that a blank is considered to be a junk character for intraline marking.  In other words, ndiff doesn't "see" that the blanks match here.  You can step thru the code to see how that happens.  The sequence is nevertheless correct, although it indicates a replacement of a blank by a blank (which is legit but unnecessary).  I wouldn't object to adding code to suppress the caret in this case.

About synching, ndiff isn't trying to keep the edit line in synch with either the "before" or "after" lines.  "Guide the eye" is all it's after.

Your format with two "?" lines is attractive at first sight.  I'm not sure how well people would like it in practice (I have a lot of feedback on how ndiff actually works today, and I don't want to damage it in favor of an untested-in-practice hypothetical).  I can easily predict that people would object to otherwise-empty "?" lines in the cases where a word was simply inserted, or simply deleted.  They will also object to having two "?" lines when a single character is merely changed.  But if cases like those get special-cased to cut it back to one "?" line, then people will get confused by that very special-casing.  It's straightforward cases like these where ndiff works best as-is, and I don't want to lose that since the straightforward cases are the most common.

Your format that isn't friendly to eyeball inspection is a non-starter (ndiff's *purpose* is to be friendly to human eyeballs!  it's not a goal of ndiff output to be friendly to machine processing, except to allow trivially easy exact reconstruction of both "before" and "after" files).

Ditto for the format that doesn't reproduce the original source lines exactly.

> If you ask me, either of these formats is better than
> the one currently used, which is only reliable for
> short lines with small differences.

"Reliability" in your sense was not one of ndiff's design goals.  For purposes of guiding the eye to changes in the common cases, the length of the line doesn't matter, and this whole subsystem won't trigger at all unless a line pair has a "similarity score" of at least 0.75 (which ensures that changes are "small" relative to the length of the line).

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=124051&group_id=5470