Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Foundation

Wikipedia meets git

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded


jamesmikedupont at googlemail

Oct 15, 2009, 11:55 AM

Post #1 of 37 (10866 views)
Permalink
Wikipedia meets git

Hallo,
I have gotten the wikipedia article for Kosovo in git.
It is fast, distributed, highly compressed, redundant, branchable and usable.

The blame function will show you who edited what version.

Here Blame on the up to date kosovo article!
http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml
git

I have checked in all the code to produce this here :
https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer

thanks,
mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


gmaxwell at gmail

Oct 15, 2009, 1:16 PM

Post #2 of 37 (10687 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Thu, Oct 15, 2009 at 2:55 PM, jamesmikedupont [at] googlemail
<jamesmikedupont [at] googlemail> wrote:
> Hallo,
> I have gotten the wikipedia article for Kosovo in git.
> It is fast, distributed, highly compressed, redundant, branchable and usable.
>
> The blame function will show you who edited what version.
>
> Here Blame on the up to date kosovo article!
> http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml
> git
>
> I have checked in all the code to produce this here :
> https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer

It is cool that you get the complete history.

But— it's a bit uncool that its about 14mbytes when the article is
100k; understandable given that the expanded uncompressed history is
about 337mbytes...

I repacked the repository using
git-pack-objects --progress --window=40000 --depth=40000
--compression=9 --all --delta-base-offset

(git-repack doesn't repack, really)

And now have 4168915 2009-10-15 16:12
KosovoWikipedia-ae859bbf9446ddcde4b17e09c99c28dcf594da89.pack, which
is more reasonable.

The number of revisions to a single article is a little bit outside of
the normal usage of git. ;)

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 15, 2009, 1:38 PM

Post #3 of 37 (10680 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Thu, Oct 15, 2009 at 10:16 PM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
> It is cool that you get the complete history.
>
> But— it's a bit uncool that its about 14mbytes when the article is
> 100k; understandable given that the expanded uncompressed history is
> about 337mbytes...

I have the uncompressed history here at 550mb.
du -h history/
556M history/


if I bzip this, it is
29M 2009-10-15 22:35 total.tar.bz

14 mb is still smaller, and the upload is faster!
>
> The number of revisions to a single article is a little bit outside of
> the normal usage of git. ;)

There are ways to optimize all of this. Most users will not want to
download the full history.

This is just one days of work using git, we will be able to optimize this all.

I will be able to find other example of large repositories..

http://laserjock.wordpress.com/2008/05/09/bzr-git-and-hg-performance-on-the-linux-tree/


mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


gmaxwell at gmail

Oct 15, 2009, 2:33 PM

Post #4 of 37 (10675 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont [at] googlemail
<jamesmikedupont [at] googlemail> wrote:
> There are ways to optimize all of this. Most users will not want to
> download the full history.

Then why are you using git?

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 15, 2009, 9:40 PM

Post #5 of 37 (10688 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont [at] googlemail
> <jamesmikedupont [at] googlemail> wrote:
>> There are ways to optimize all of this. Most users will not want to
>> download the full history.
>
> Then why are you using git?

I am not most users. I am using git because I think it is the best way
forward to implement many of the ideas discussed in the strategy wiki.

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 15, 2009, 9:45 PM

Post #6 of 37 (10681 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 6:40 AM, jamesmikedupont [at] googlemail
<jamesmikedupont [at] googlemail> wrote:
> On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
>> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont [at] googlemail
>> <jamesmikedupont [at] googlemail> wrote:
>>> There are ways to optimize all of this. Most users will not want to
>>> download the full history.
>>
>> Then why are you using git?
>
> I am not most users. I am using git because I think it is the best way
> forward to implement many of the ideas discussed in the strategy wiki.


if you want only the last 3 revisions checked out , it takes about 10
seconds and produces 300k of data.

git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git

du -h gittest/
252K gittest/

Log file :

Initialized empty Git repository in
/home_data2/2009/10/KosovoWikipedia/gittest/KosovoWikipedia/.git/
remote: Counting objects: 21, done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 21 (delta 3), reused 20 (delta 3)
Receiving objects: 100% (21/21), 40.98 KiB, done.
Resolving deltas: 100% (3/3), done.

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 15, 2009, 10:18 PM

Post #7 of 37 (10699 views)
Permalink
Re: Wikipedia meets git [In reply to]

>> On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
>>> Then why are you using git?

It turns out there are a few wikis built on top of git :

1. the git-wiki :
http://atonie.org/2008/02/git-wiki
http://github.com/jeffbski/git-wiki
git-wiki is a wiki that relies on git to keep pages' history and
Sinatra to serve them. (ruby)

Supports these markups :
* Creole= Creole is a Creole-to-HTML converter for Creole, the
lightwight markup
language (http://wikicreole.org/).
* Markdown= Discount Markdown Processor for Ruby
http://github.com/rtomayko/rdiscount
* Textile = RedCloth is a module for using the Textile markup
language in Ruby. http://redcloth.org/

2. gitit
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/gitit
gitit: Wiki using happstack, git or darcs, and pandoc. (haskell)

3.ikiwiki
http://ikiwiki.info/ Ikiwiki is a wiki compiler.
http://ikiwiki.info/ikiwiki/formatting/

4. wigit : the php git wiki
http://el-tramo.be/software/wigit

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


joshuagay at gmail

Oct 15, 2009, 10:19 PM

Post #8 of 37 (10700 views)
Permalink
Re: Wikipedia meets git [In reply to]

This is very awesome. I am in the early stages of trying to scope out a
small side project to do a mediawiki <-> git bridge; it is very
challenging. Being able to download the complete edit history in this
fashion is extremely useful. Thank you very much for sharing this work.

-Josh

On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont [at] googlemail <
jamesmikedupont [at] googlemail> wrote:

> On Fri, Oct 16, 2009 at 6:40 AM, jamesmikedupont [at] googlemail
> <jamesmikedupont [at] googlemail> wrote:
> > On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell [at] gmail>
> wrote:
> >> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont [at] googlemail
> >> <jamesmikedupont [at] googlemail> wrote:
> >>> There are ways to optimize all of this. Most users will not want to
> >>> download the full history.
> >>
> >> Then why are you using git?
> >
> > I am not most users. I am using git because I think it is the best way
> > forward to implement many of the ideas discussed in the strategy wiki.
>
>
> if you want only the last 3 revisions checked out , it takes about 10
> seconds and produces 300k of data.
>
> git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git
>
> du -h gittest/
> 252K gittest/
>
> Log file :
>
> Initialized empty Git repository in
> /home_data2/2009/10/KosovoWikipedia/gittest/KosovoWikipedia/.git/
> remote: Counting objects: 21, done.
> remote: Compressing objects: 100% (10/10), done.
> remote: Total 21 (delta 3), reused 20 (delta 3)
> Receiving objects: 100% (21/21), 40.98 KiB, done.
> Resolving deltas: 100% (3/3), done.
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>



--
I am running a marathon for the Leukemia & Lymphoma Society. Can you help me
reach my fundraising goals? Visit
http://pages.teamintraining.org/ma/pfchangs10/joshuagay
_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


denny.vrandecic at kit

Oct 16, 2009, 12:45 AM

Post #9 of 37 (10669 views)
Permalink
Re: Wikipedia meets git [In reply to]

That is pretty cool. But wouldn't it make more sense to have a more-
fine grained blame, like the one in wikitrust, down to the character
level?

cheers,
denny


On Oct 15, 2009, at 20:55, jamesmikedupont [at] googlemail wrote:

> Hallo,
> I have gotten the wikipedia article for Kosovo in git.
> It is fast, distributed, highly compressed, redundant, branchable
> and usable.
>
> The blame function will show you who edited what version.
>
> Here Blame on the up to date kosovo article!
> http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml
> git
>
> I have checked in all the code to produce this here :
> https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer
>
> thanks,
> mike
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 16, 2009, 1:30 AM

Post #10 of 37 (10675 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
<denny.vrandecic [at] kit> wrote:
> That is pretty cool. But wouldn't it make more sense to have a more-
> fine grained blame, like the one in wikitrust, down to the character
> level?

I don't know all these wikitools, but if the feature is missing from
git, then it will benefit all projects using it.

My fascination with using a real distribute version control system is
that it provides the features that we are missing from the mediawiki.

We can use standard tools to do good things, and not have to reinvent
the world all the time.

We don't need to have a centralized repository and only one point of
view, using a real VCS means that we can multiple hosts, multiple
points of view and a failsafe system.

My next steps are to work on the reader tool in creating latex output
and espeak output of the articles, I am adding in the unicode
character support right now. I would like to get that up to speed, to
use PDF / Audio rendering of the articles.

I will continue to just work with selected articles and improve the
import feature. It should be easy to have an import tool feed by an
rss feed for some articles that imports them on a regular basis.

Mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


denny.vrandecic at kit

Oct 16, 2009, 3:33 AM

Post #11 of 37 (10671 views)
Permalink
Re: Wikipedia meets git [In reply to]

Just another pointer, here is a distributed MediaWiki system developed
at INRIA. I haven't looked into it yet too deep, but their evaluation
looked very promising.

<http://m3p.gforge.inria.fr/pmwiki/pmwiki.php>

Best,
denny

On Oct 16, 2009, at 10:30, jamesmikedupont [at] googlemail wrote:

> On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
> <denny.vrandecic [at] kit> wrote:
>> That is pretty cool. But wouldn't it make more sense to have a more-
>> fine grained blame, like the one in wikitrust, down to the character
>> level?
>
> I don't know all these wikitools, but if the feature is missing from
> git, then it will benefit all projects using it.
>
> My fascination with using a real distribute version control system is
> that it provides the features that we are missing from the mediawiki.
>
> We can use standard tools to do good things, and not have to reinvent
> the world all the time.
>
> We don't need to have a centralized repository and only one point of
> view, using a real VCS means that we can multiple hosts, multiple
> points of view and a failsafe system.
>
> My next steps are to work on the reader tool in creating latex output
> and espeak output of the articles, I am adding in the unicode
> character support right now. I would like to get that up to speed, to
> use PDF / Audio rendering of the articles.
>
> I will continue to just work with selected articles and improve the
> import feature. It should be easy to have an import tool feed by an
> rss feed for some articles that imports them on a regular basis.
>
> Mike
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 16, 2009, 3:39 AM

Post #12 of 37 (10667 views)
Permalink
Re: Wikipedia meets git [In reply to]

> On Oct 16, 2009, at 10:30, jamesmikedupont [at] googlemail wrote:

I have make two simple vlogs about what and why i did this

http://www.youtube.com/watch?v=jc9jo1ZFLqk

http://www.youtube.com/watch?v=7WfRuEuvIso

Mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 16, 2009, 4:40 AM

Post #13 of 37 (10646 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
<denny.vrandecic [at] kit> wrote:
> That is pretty cool. But wouldn't it make more sense to have a more-
> fine grained blame, like the one in wikitrust, down to the character
> level?

Can you please provide some example pages of wikitrust?
they seem to be AWOL:

In the meantime, you can look at our list of colored pages,
http://wikitrust.soe.ucsc.edu/index.php/Colored_pages -> Page not found

Thanks,
mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


gerard.meijssen at gmail

Oct 16, 2009, 5:08 AM

Post #14 of 37 (10668 views)
Permalink
Re: Wikipedia meets git [In reply to]

Hoi,
After a minute of googling I find http://wikitrust.soe.ucsc.edu/home .. I am
sure it is there for you as well.
Thanks,
GerardM

2009/10/16 jamesmikedupont [at] googlemail <jamesmikedupont [at] googlemail>

> On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
> <denny.vrandecic [at] kit> wrote:
> > That is pretty cool. But wouldn't it make more sense to have a more-
> > fine grained blame, like the one in wikitrust, down to the character
> > level?
>
> Can you please provide some example pages of wikitrust?
> they seem to be AWOL:
>
> In the meantime, you can look at our list of colored pages,
> http://wikitrust.soe.ucsc.edu/index.php/Colored_pages -> Page not found
>
> Thanks,
> mike
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 16, 2009, 5:17 AM

Post #15 of 37 (10653 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 2:08 PM, Gerard Meijssen
<gerard.meijssen [at] gmail> wrote:
> Hoi,
> After a minute of googling I find http://wikitrust.soe.ucsc.edu/home .. I am
> sure it is there for you as well.


Yes the page is there, it seems to be a good idea.

only I am missing some html pages so that we can see what it looks
like, a wordlevel blame.
the colorized pages are missing.

On this page: http://wikitrust.soe.ucsc.edu/home
it says : "In the meantime, you can look at our list of colored pages,
or look at screenshots of English Wikipedia pages analyzed by
WikiTrust. " and the colored pages link to
http://wikitrust.soe.ucsc.edu/index.php/Colored_pages which are
missing....

mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


wikimail at inbox

Oct 16, 2009, 7:31 AM

Post #16 of 37 (10675 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont [at] googlemail
> if you want only the last 3 revisions checked out , it takes about 10
> seconds and produces 300k of data.

10 seconds? That's horrible. Have you tried using svn?

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 16, 2009, 7:37 AM

Post #17 of 37 (10674 views)
Permalink
Re: Wikipedia meets git [In reply to]

I did not mean that literally,
let me check the exact time for you : 1.258s

time git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git
Initialized empty Git repository in
/home_data2/2009/10/KosovoWikipedia/gittest2/KosovoWikipedia/.git/
remote: Counting objects: 21, done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 21 (delta 3), reused 20 (delta 3)
Receiving objects: 100% (21/21), 40.99 KiB, done.
Resolving deltas: 100% (3/3), done.

real 0m1.258s
user 0m0.024s
sys 0m0.024s

On Fri, Oct 16, 2009 at 4:31 PM, Anthony <wikimail [at] inbox> wrote:
> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont [at] googlemail
>> if you want only the last 3 revisions checked out , it takes about 10
>> seconds and produces 300k of data.
>
> 10 seconds?  That's horrible.  Have you tried using svn?
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


gmaxwell at gmail

Oct 17, 2009, 1:18 AM

Post #18 of 37 (10639 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail [at] inbox> wrote:
> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont [at] googlemail
>> if you want only the last 3 revisions checked out , it takes about 10
>> seconds and produces 300k of data.
>
> 10 seconds?  That's horrible.  Have you tried using svn?

On a reasonably fast network it actually only about 10 seconds to pull
the entire edit history from his repo, it would take less if the
history has been repacked as I described— but that kind of tight
repacking makes it take longer when you only want a portion of the
history.

Still— much of the neat things that can be done by having the article
in git are only possible if you have the complete history, for
example: generating a blame map needs the entire history.

It would be nice if the git archival format was more efficient for the
kinds of changes made in Wikipedia articles: Source code changes tends
to have short lines and changes tend to change a significant portion
of the lines, while edits on Wikipedia are far more likely to change
only part of a very long line (really, a paragraph).... so working
with line level deltas is efficient for source code while inefficient
for Wikipedia data.

On this repository a git fast-export --all | lzma -9 produces a
900kbyte output (505783 bytes if you want to be silly and use
PAQ8HP12, which is pretty much the state of the art for English text,
instead of LZMA). These methods don't provide fast random access but
it's still clear that there is a lot of room for improvement. ;) I'm
not sure if anyone is working on improved compression for git for
these kinds of documents.

Getting the entire history of a frequently edited article like this
down to ~1-2mb is roughly where I think it's reasonable for someone
doing continued non-trivial work on the article to fetch the entire
history and thus gain access to functionality that needs most of the
history.

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 17, 2009, 1:40 AM

Post #19 of 37 (10640 views)
Permalink
Re: Wikipedia meets git [In reply to]

I have

On Sat, Oct 17, 2009 at 10:18 AM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
> On Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail [at] inbox> wrote:
>> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont [at] googlemail
>>> if you want only the last 3 revisions checked out , it takes about 10
>>> seconds and produces 300k of data.
>>
>> 10 seconds?  That's horrible.  Have you tried using svn?
>

> Still— much of the neat things that can be done by having the article
> in git are only possible if you have the complete history, for
> example: generating a blame map needs the entire history.

yes, and if you just want to view and edit then you need one revision.
if you want to do more, you can pull the history.

>
> It would be nice if the git archival format was more efficient for the
> kinds of changes made in Wikipedia articles: Source code changes tends
> to have short lines and changes tend to change a significant portion
> of the lines, while edits on Wikipedia are far more likely to change
> only part of a very long line (really, a paragraph).... so working
> with line level deltas is efficient for source code while inefficient
> for Wikipedia data.

I have started to work on the blame code
to bring it down to the char level and learn about it.
I am willing to invest some time to learn how to make git better for WMF.
it is much more interesting than hacking php code.

Also, I have been able to use the wm-render code on the git archive, you
can see the results of new version of my reader script here : 2 hours
of reading the full article :

http://www.archive.org/details/KosovoWikipediaArticlesVideo

I am thinking to store the wikipedia articles in the intermediate xml
parse tree format from mw-render, if that would help the diff toos.

Another idea would be to allow editing of the articles with open
office for example, and provide tracibility in the document structure
back to the original article. it could be marked up with blame
information, even more, the blame information could be embedded in
each word, with an xml attribute. that would allow for exact tracking
of where the edits come from.

mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


wikimail at inbox

Oct 17, 2009, 7:05 AM

Post #20 of 37 (10616 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Sat, Oct 17, 2009 at 4:40 AM, jamesmikedupont [at] googlemail
<jamesmikedupont [at] googlemail> wrote:
>> It would be nice if the git archival format was more efficient for the
>> kinds of changes made in Wikipedia articles: Source code changes tends
>> to have short lines and changes tend to change a significant portion
>> of the lines, while edits on Wikipedia are far more likely to change
>> only part of a very long line (really, a paragraph).... so working
>> with line level deltas is efficient for source code while inefficient
>> for Wikipedia data.
>
> I have started to work on the blame code
> to bring it down to the char level and learn about it.

Char level would probably make it too inefficient to merge deltas.
Treating a period followed by a space as a line separator would
probably be more efficient.

The key to efficiency is to use skip deltas, though. You build a
binary tree so accessing any revision requires the application of only
log(n) deltas.

I asked whether or not you tried svn, because svn already uses skip deltas.

Is the idea that the entire file would need to be transferred over the
Internet, though? If so, I guess you wouldn't want to use skip deltas
- they greatly increase access time to early revisions, but at a
slight space penalty.

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jayvdb at gmail

Oct 17, 2009, 8:04 AM

Post #21 of 37 (10627 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Sun, Oct 18, 2009 at 1:05 AM, Anthony <wikimail [at] inbox> wrote:
> On Sat, Oct 17, 2009 at 4:40 AM, jamesmikedupont [at] googlemail
> <jamesmikedupont [at] googlemail> wrote:
>>> It would be nice if the git archival format was more efficient for the
>>> kinds of changes made in Wikipedia articles: Source code changes tends
>>> to have short lines and changes tend to change a significant portion
>>> of the lines, while edits on Wikipedia are far more likely to change
>>> only part of a very long line (really, a paragraph).... so working
>>> with line level deltas is efficient for source code while inefficient
>>> for Wikipedia data.
>>
>> I have started to work on the blame code
>> to bring it down to the char level and learn about it.
>
> Char level would probably make it too inefficient to merge deltas.
> Treating a period followed by a space as a line separator would
> probably be more efficient.
>
> The key to efficiency is to use skip deltas, though.  You build a
> binary tree so accessing any revision requires the application of only
> log(n) deltas.
>
> I asked whether or not you tried svn, because svn already uses skip deltas.

svn would be daft, for so many reasons.

> Is the idea that the entire file would need to be transferred over the
> Internet, though?  If so, I guess you wouldn't want to use skip deltas
> - they greatly increase access time to early revisions, but at a
> slight space penalty.

With git, parts of the checkout can be shallow clones.

--
John Vandenberg

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


wikimail at inbox

Oct 17, 2009, 8:23 AM

Post #22 of 37 (10609 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Sat, Oct 17, 2009 at 11:04 AM, John Vandenberg <jayvdb [at] gmail> wrote:
> On Sun, Oct 18, 2009 at 1:05 AM, Anthony <wikimail [at] inbox> wrote:
>> I asked whether or not you tried svn, because svn already uses skip deltas.
>
> svn would be daft, for so many reasons.

Doesn't mean you can't learn from it.

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 17, 2009, 9:39 AM

Post #23 of 37 (10617 views)
Permalink
Re: Wikipedia meets git [In reply to]

see my new blogpost word leve blaming for wikipedia via git and perl ...
http://fmtyewtk.blogspot.com/2009/10/mediawiki-git-word-level-blaming-one.html


Next step is ready :

1. I have a single script that will pull a given article and check in
the revisions into git,
it is not perfect, but works.

http://bazaar.launchpad.net/~jamesmikedupont/+junk/wikiatransfer/revision/8
you run it like this,from inside a git repo :

perl GetRevisions.pl "Article_Name"

git blame Article_Name/Article.xml
git push origin master

The code that splits up the line is in Process File, this splits all
spaces into newlines.
that way we get a word level blame.

if ($insidetext)
{
## split all lines on the space
s/(\ )/\\\n/g;


print OUT $_;
}


The Article is here:
http://github.com/h4ck3rm1k3/KosovoWikipedia/blob/master/Wiki/2008_Kosovo_declaration_of_independence/article.xml


here are the blame results.
http://github.com/h4ck3rm1k3/KosovoWikipedia/blob/master/Wiki/2008_Kosovo_declaration_of_independence/wordblame.txt


Problem is that github does not like this amount of processor power
begin used and kills the process, you can do a local git blame.

Now we have the tool to easily create a repository from wikipedia, or
any other export enabled mediawiki.

mike

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jayvdb at gmail

Oct 17, 2009, 9:53 AM

Post #24 of 37 (10614 views)
Permalink
Re: Wikipedia meets git [In reply to]

On Sun, Oct 18, 2009 at 3:39 AM, jamesmikedupont [at] googlemail
<jamesmikedupont [at] googlemail> wrote:
> see my new blogpost word leve blaming for wikipedia via git and perl ...
> http://fmtyewtk.blogspot.com/2009/10/mediawiki-git-word-level-blaming-one.html
> ...
> Problem is that github does not like this amount of processor power
> begin used and kills the process, you can do a local git blame.
>
> Now we have the tool to easily create a repository from wikipedia, or
> any other export enabled mediawiki.

Fantastic!

If you need more processing power, the toolserver may be willing to
give you an account in order to host it, if you can keep the repo
small enough, especially if you can provide a wikiblame tool which is
usable.

http://meta.wikimedia.org/wiki/Toolserver

https://wiki.toolserver.org/view/Account_approval_process

--
John Vandenberg

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


jamesmikedupont at googlemail

Oct 17, 2009, 10:11 AM

Post #25 of 37 (10614 views)
Permalink
Re: Wikipedia meets git [In reply to]

Thanks,
I will apply for an account when It is ready for integration.

this is still in experimentation mode.
The git replaces the mysql database.

But there is alot more work to do to make this viable.

thanks for all your encouragement and support.

mike


On Sat, Oct 17, 2009 at 6:53 PM, John Vandenberg <jayvdb [at] gmail> wrote:
> On Sun, Oct 18, 2009 at 3:39 AM, jamesmikedupont [at] googlemail
> <jamesmikedupont [at] googlemail> wrote:
>> see my new blogpost word leve blaming for wikipedia via git and perl ...
>> http://fmtyewtk.blogspot.com/2009/10/mediawiki-git-word-level-blaming-one.html
>> ...
>> Problem is that github does not like this amount of processor power
>> begin used and kills the process, you can do a local git blame.
>>
>> Now we have the tool to easily create a repository from wikipedia, or
>> any other export enabled mediawiki.
>
> Fantastic!
>
> If you need more processing power, the toolserver may be willing to
> give you an account in order to host it, if you can keep the repo
> small enough, especially if you can provide a wikiblame tool which is
> usable.
>
> http://meta.wikimedia.org/wiki/Toolserver
>
> https://wiki.toolserver.org/view/Account_approval_process
>
> --
> John Vandenberg
>
> _______________________________________________
> foundation-l mailing list
> foundation-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
foundation-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.