Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Diff needs improvement

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


lars at aronsson

Nov 18, 2009, 9:42 AM

Post #1 of 8 (872 views)
Permalink
Diff needs improvement

When a new paragraph was inserted, diff doesn't discover that the
previous first paragraph is now the second. The diff reports much
larger changes than actually happened. Why is that? How can it be
fixed?

I'm talking about Wikipedia now. Are there different
implementations of diff in various instances of MediaWiki?
How is it implemented? Using UNIX/Linux diff, wdiff, or some other
algorithm?

Here is an example, where a bullet list of works (discography) was
enhanced,
http://sv.wikipedia.org/w/index.php?title=Staffan_M%C3%A5rtensson&diff=10522416&oldid=10304813

As you can see, Brahms Clarinet Sonatas were pushed from 1st to
2nd position, but is reported by diff as a total change. Instead
the record label (Channel Sound) is reported as unchanged text.
Yes, the phrase "med Erik Lanninger" was also changed to "Med E
Lanninger", but that is a much smaller change than the one
reported.

At my website runeberg.org, where scanned books are proofread,
I have implemented the diff function using wdiff with some
extra features. An example is shown here,
http://runeberg.org/rc.pl?action=diff&src=nfbf/0734

Since a common edit is to change "word" to "<b>word</b>", I want
changes in XML-like markup to be reported separately, which you
can see is the case at the bottom of that diff. But wdiff looks
strictly at whitespace, so I had to modify this. The quite naive
and non-optimized (but working) Perl code looks like this (yes,
versions are maintained by plain old RCS):

# A change from "foo bar" to "<b>foo bar" is seen by wdiff as a
# change of the word "foo" into "<b>foo". But we want to see this
# as the addition of the HTML/XML tag "<b>". To this effect, we
# pad spaces around all "<" and ">" in the original text versions,
# i.e. " <b> foo bar" before calling wdiff. The output from wdiff
# will be " <span><b></span> foo bar", where the padding spaces
# are outside of the <span> tags. This has to be taken into
# consideration when removing the space padding, below.

my $cmd = "umask 2"
. " && co -p1.$rev1 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp1"
. " && co -p1.$rev2 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp2"
. " && wdiff -n -s -w '<span class=\"del\">' -x '</span>' "
. " -y '<span class=\"ins\">' -z '</span>' $tmp1 $tmp2 |";
if (open(FILE, $cmd)) {
local $/ = undef;
$diff = <FILE>;
close(FILE);
} else {
debug_log("rc.pl: Failed with $cmd");
}
$diff = html_encode($diff);


Hope this was helpful.


--
Lars Aronsson (lars [at] aronsson)
Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 18, 2009, 10:59 AM

Post #2 of 8 (851 views)
Permalink
Re: Diff needs improvement [In reply to]

On Wed, Nov 18, 2009 at 12:42 PM, Lars Aronsson <lars [at] aronsson> wrote:
> I'm talking about Wikipedia now. Are there different
> implementations of diff in various instances of MediaWiki?
> How is it implemented? Using UNIX/Linux diff, wdiff, or some other
> algorithm?

It's implemented out of the box in PHP, with a PHP extension written
in C++ available for better speed.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/diff/
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/wikidiff2/

wdiff isn't reliably available -- it's not installed by default on all
(any?) Linux distros, and it's very unlikely to be installed on
non-Linux servers. Moreover, even where it's installed, shared hosts
often don't give PHP scripts the right to execute external programs --
that breaks out of PHP's sandboxes, and many shared hosts rely on
those instead of Unix permissions.

Given all that, we need a PHP implementation of some kind. And once
we have that, and need a faster version in C++ or such, I guess the
logic goes that we may as well use the same algorithm for the sake of
consistency. I don't know, wikidiff2 was written several months
before I started MediaWiki development.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


church.of.emacs.ml at googlemail

Nov 18, 2009, 11:04 AM

Post #3 of 8 (846 views)
Permalink
Re: Diff needs improvement [In reply to]

Lars Aronsson wrote:
> When a new paragraph was inserted, diff doesn't discover that the
> previous first paragraph is now the second. The diff reports much
> larger changes than actually happened. Why is that? How can it be
> fixed?

This is one of those ancient bugs:
https://bugzilla.wikimedia.org/show_bug.cgi?id=5072

> I'm talking about Wikipedia now. Are there different
> implementations of diff in various instances of MediaWiki?
> How is it implemented? Using UNIX/Linux diff, wdiff, or some other
> algorithm?

MediaWiki seems to be using its own PHP diff called "DifferenceEngine"
(includes/diff/DifferenceEngine.php, in the same directory there is also
a Diff.php which includes a class "WikiDiff3"). However, it is possible
to user other Diff Engines like GNU Diff/Diff3.

The config file of Wikimedia's setup suggest that Wikipedia is using the
wikidiff2 engine
http://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php
http://www.mediawiki.org/wiki/Extension:Wikidiff2

Regards,

Church of emacs
Attachments: signature.asc (0.25 KB)


Simetrical+wikilist at gmail

Nov 18, 2009, 11:19 AM

Post #4 of 8 (850 views)
Permalink
Re: Diff needs improvement [In reply to]

On Wed, Nov 18, 2009 at 2:22 PM, Nikola Smolenski <smolensk [at] eunet> wrote:
> From what I recall I've seen while browsing through its source, wdiff just
> transforms every space in a file into a newline, then runs diff on it. This
> could be simulated in PHP too.

diff isn't reliably available either. It won't be present on Windows,
and will often be inaccessible on Unix (because of exec() being
disabled or such).

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


smolensk at eunet

Nov 18, 2009, 11:22 AM

Post #5 of 8 (847 views)
Permalink
Re: Diff needs improvement [In reply to]

Дана Wednesday 18 November 2009 19:59:58 Aryeh Gregor написа:
> wdiff isn't reliably available -- it's not installed by default on all
> (any?) Linux distros, and it's very unlikely to be installed on
> non-Linux servers. Moreover, even where it's installed, shared hosts
> often don't give PHP scripts the right to execute external programs --
> that breaks out of PHP's sandboxes, and many shared hosts rely on
> those instead of Unix permissions.

From what I recall I've seen while browsing through its source, wdiff just
transforms every space in a file into a newline, then runs diff on it. This
could be simulated in PHP too.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 18, 2009, 11:25 AM

Post #6 of 8 (846 views)
Permalink
Re: Diff needs improvement [In reply to]

On Wed, Nov 18, 2009 at 2:28 PM, Nikola Smolenski <smolensk [at] eunet> wrote:
> But the reformatted text could be diffed the same way ordinary text is now.

Yes, of course we can change the diff algorithm if we want. It's in
our SVN repo. That doesn't really have to do with diff or wdiff,
though.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


smolensk at eunet

Nov 18, 2009, 11:28 AM

Post #7 of 8 (844 views)
Permalink
Re: Diff needs improvement [In reply to]

Дана Wednesday 18 November 2009 20:19:45 Aryeh Gregor написа:
> On Wed, Nov 18, 2009 at 2:22 PM, Nikola Smolenski <smolensk [at] eunet> wrote:
> > From what I recall I've seen while browsing through its source, wdiff
> > just transforms every space in a file into a newline, then runs diff on
> > it. This could be simulated in PHP too.
>
> diff isn't reliably available either. It won't be present on Windows,
> and will often be inaccessible on Unix (because of exec() being
> disabled or such).

But the reformatted text could be diffed the same way ordinary text is now.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


smolensk at eunet

Nov 18, 2009, 11:59 AM

Post #8 of 8 (848 views)
Permalink
Re: Diff needs improvement [In reply to]

Дана Wednesday 18 November 2009 20:25:17 Aryeh Gregor написа:
> On Wed, Nov 18, 2009 at 2:28 PM, Nikola Smolenski <smolensk [at] eunet> wrote:
> > But the reformatted text could be diffed the same way ordinary text is
> > now.
>
> Yes, of course we can change the diff algorithm if we want. It's in
> our SVN repo. That doesn't really have to do with diff or wdiff,
> though.

We don't need to change the diff algorithm, we could simply preformat the text
the same way wdiff does.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.