Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

XML dumps/Media mirrors update

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


ariel at wikimedia

May 17, 2012, 4:09 AM

Post #1 of 27 (2005 views)
Permalink
XML dumps/Media mirrors update

We now have three mirror sites, yay! The full list is linked to from
http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors

Summarizing, we have:

C3L (Brazil) with the last 5 good known dumps,
Masaryk University (Czech Republic) with the last 5 known good dumps,
Your.org (USA) with the complete archive of dumps, and

for the latest version of uploaded media, Your.org with http/ftp/rsync
access.

Thanks to Carlos, Kevin and Yenya respectively at the above sites for
volunteering space, time and effort to make this happen.

As people noticed earlier, a series of media tarballs per-project
(excluding commons) is being generated. As soon as the first run of
these is complete we'll announce its location and start generating them
on a semi-regular basis.

As we've been getting the bugs out of the mirroring setup, it is getting
easier to add new locations. Know anyone interested? Please let us
know; we would love to have them.

Ariel


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

May 17, 2012, 4:13 AM

Post #2 of 27 (1996 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

Good work. We are approaching finally to an indestructible corpus of
knowledge.

2012/5/17 Ariel T. Glenn <ariel [at] wikimedia>

> We now have three mirror sites, yay! The full list is linked to from
> http://dumps.wikimedia.org/ and is also available at
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>
> Summarizing, we have:
>
> C3L (Brazil) with the last 5 good known dumps,
> Masaryk University (Czech Republic) with the last 5 known good dumps,
> Your.org (USA) with the complete archive of dumps, and
>
> for the latest version of uploaded media, Your.org with http/ftp/rsync
> access.
>
> Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> volunteering space, time and effort to make this happen.
>
> As people noticed earlier, a series of media tarballs per-project
> (excluding commons) is being generated. As soon as the first run of
> these is complete we'll announce its location and start generating them
> on a semi-regular basis.
>
> As we've been getting the bugs out of the mirroring setup, it is getting
> easier to add new locations. Know anyone interested? Please let us
> know; we would love to have them.
>
> Ariel
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Emilio J. Rodrguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cdiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 17, 2012, 4:30 AM

Post #3 of 27 (1994 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

Hi,
I am thinking about how to collect articles deleted based on the "not
notable" criteria,
is there any way we can extract them from the mysql binlogs? how are
these mirrors working? I would be interested in setting up a mirror of
deleted data, at least that which is not spam/vandalism based on tags.
mike

On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> We now have three mirror sites, yay! The full list is linked to from
> http://dumps.wikimedia.org/ and is also available at
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>
> Summarizing, we have:
>
> C3L (Brazil) with the last 5 good known dumps,
> Masaryk University (Czech Republic) with the last 5 known good dumps,
> Your.org (USA) with the complete archive of dumps, and
>
> for the latest version of uploaded media, Your.org with http/ftp/rsync
> access.
>
> Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> volunteering space, time and effort to make this happen.
>
> As people noticed earlier, a series of media tarballs per-project
> (excluding commons) is being generated. As soon as the first run of
> these is complete we'll announce its location and start generating them
> on a semi-regular basis.
>
> As we've been getting the bugs out of the mirroring setup, it is getting
> easier to add new locations. Know anyone interested? Please let us
> know; we would love to have them.
>
> Ariel
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

May 17, 2012, 5:23 AM

Post #4 of 27 (1990 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

There's a few other reasons articles get deleted: copyright issues,
personal identifying data, etc. This makes maintaning the sort of
mirror you propose problematic, although a similar mirror is here:
http://deletionpedia.dbatley.com/w/index.php?title=Main_Page

The dumps contain only data publically available at the time of the run,
without deleted data.

The articles aren't permanently deleted of course. The revisions texts
live on in the database, so a query on toolserver, for example, could be
used to get at them, but that would need to be for research purposes.

Ariel

Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
> Hi,
> I am thinking about how to collect articles deleted based on the "not
> notable" criteria,
> is there any way we can extract them from the mysql binlogs? how are
> these mirrors working? I would be interested in setting up a mirror of
> deleted data, at least that which is not spam/vandalism based on tags.
> mike
>
> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> > We now have three mirror sites, yay! The full list is linked to from
> > http://dumps.wikimedia.org/ and is also available at
> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
> >
> > Summarizing, we have:
> >
> > C3L (Brazil) with the last 5 good known dumps,
> > Masaryk University (Czech Republic) with the last 5 known good dumps,
> > Your.org (USA) with the complete archive of dumps, and
> >
> > for the latest version of uploaded media, Your.org with http/ftp/rsync
> > access.
> >
> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> > volunteering space, time and effort to make this happen.
> >
> > As people noticed earlier, a series of media tarballs per-project
> > (excluding commons) is being generated. As soon as the first run of
> > these is complete we'll announce its location and start generating them
> > on a semi-regular basis.
> >
> > As we've been getting the bugs out of the mirroring setup, it is getting
> > easier to add new locations. Know anyone interested? Please let us
> > know; we would love to have them.
> >
> > Ariel
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


platonides at gmail

May 17, 2012, 9:20 AM

Post #5 of 27 (1993 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

On 17/05/12 14:23, Ariel T. Glenn wrote:
> There's a few other reasons articles get deleted: copyright issues,
> personal identifying data, etc. This makes maintaning the sort of
> mirror you propose problematic, although a similar mirror is here:
> http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>
> The dumps contain only data publically available at the time of the run,
> without deleted data.
>
> The articles aren't permanently deleted of course.

And is a much better way to retrieve them than the binlogs
(which are only kept for a short time anyway).

> The revisions texts live on in the database,

> so a query on toolserver, for example, could be used to get at them,
> but that would need to be for research purposes.

Not really.
You could get a list of deleted titles/authors from the toolserver, but
not the page contents, which for some strange reason are not replicated
there (not even available to the roots).

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

May 21, 2012, 11:18 AM

Post #6 of 27 (1978 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

You can create a script that uses Special:Export to export all articles in
the deletion categories just before they are deleted.

Then import them into your "Deletionpedia".

2012/5/17 Mike Dupont <jamesmikedupont [at] googlemail>

> Hi,
> I am thinking about how to collect articles deleted based on the "not
> notable" criteria,
> is there any way we can extract them from the mysql binlogs? how are
> these mirrors working? I would be interested in setting up a mirror of
> deleted data, at least that which is not spam/vandalism based on tags.
> mike
>
> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia>
> wrote:
> > We now have three mirror sites, yay! The full list is linked to from
> > http://dumps.wikimedia.org/ and is also available at
> >
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
> >
> > Summarizing, we have:
> >
> > C3L (Brazil) with the last 5 good known dumps,
> > Masaryk University (Czech Republic) with the last 5 known good dumps,
> > Your.org (USA) with the complete archive of dumps, and
> >
> > for the latest version of uploaded media, Your.org with http/ftp/rsync
> > access.
> >
> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> > volunteering space, time and effort to make this happen.
> >
> > As people noticed earlier, a series of media tarballs per-project
> > (excluding commons) is being generated. As soon as the first run of
> > these is complete we'll announce its location and start generating them
> > on a semi-regular basis.
> >
> > As we've been getting the bugs out of the mirroring setup, it is getting
> > easier to add new locations. Know anyone interested? Please let us
> > know; we would love to have them.
> >
> > Ariel
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Emilio J. Rodrguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cdiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 21, 2012, 11:24 AM

Post #7 of 27 (1974 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

Well I whould be happy for items like this :
http://en.wikipedia.org/wiki/Template:Db-a7
would it be possible to extract them easily?
mike

On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> There's a few other reasons articles get deleted: copyright issues,
> personal identifying data, etc. This makes maintaning the sort of
> mirror you propose problematic, although a similar mirror is here:
> http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>
> The dumps contain only data publically available at the time of the run,
> without deleted data.
>
> The articles aren't permanently deleted of course. The revisions texts
> live on in the database, so a query on toolserver, for example, could be
> used to get at them, but that would need to be for research purposes.
>
> Ariel
>
> 17-05-2012, , 13:30 +0200, / Mike Dupont :
>> Hi,
>> I am thinking about how to collect articles deleted based on the "not
>> notable" criteria,
>> is there any way we can extract them from the mysql binlogs? how are
>> these mirrors working? I would be interested in setting up a mirror of
>> deleted data, at least that which is not spam/vandalism based on tags.
>> mike
>>
>> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
>> > We now have three mirror sites, yay! The full list is linked to from
>> > http://dumps.wikimedia.org/ and is also available at
>> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>> >
>> > Summarizing, we have:
>> >
>> > C3L (Brazil) with the last 5 good known dumps,
>> > Masaryk University (Czech Republic) with the last 5 known good dumps,
>> > Your.org (USA) with the complete archive of dumps, and
>> >
>> > for the latest version of uploaded media, Your.org with http/ftp/rsync
>> > access.
>> >
>> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
>> > volunteering space, time and effort to make this happen.
>> >
>> > As people noticed earlier, a series of media tarballs per-project
>> > (excluding commons) is being generated. As soon as the first run of
>> > these is complete we'll announce its location and start generating them
>> > on a semi-regular basis.
>> >
>> > As we've been getting the bugs out of the mirroring setup, it is getting
>> > easier to add new locations. Know anyone interested? Please let us
>> > know; we would love to have them.
>> >
>> > Ariel
>> >
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > Wikitech-l [at] lists
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

May 21, 2012, 12:11 PM

Post #8 of 27 (1972 views)
Permalink
Re: XML dumps/Media mirrors update [In reply to]

Create a script that makes a request to Special:Export using this category
as feed
https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion

More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export

2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>

> Well I whould be happy for items like this :
> http://en.wikipedia.org/wiki/Template:Db-a7
> would it be possible to extract them easily?
> mike
>
> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
> wrote:
> > There's a few other reasons articles get deleted: copyright issues,
> > personal identifying data, etc. This makes maintaning the sort of
> > mirror you propose problematic, although a similar mirror is here:
> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
> >
> > The dumps contain only data publically available at the time of the run,
> > without deleted data.
> >
> > The articles aren't permanently deleted of course. The revisions texts
> > live on in the database, so a query on toolserver, for example, could be
> > used to get at them, but that would need to be for research purposes.
> >
> > Ariel
> >
> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
> >> Hi,
> >> I am thinking about how to collect articles deleted based on the "not
> >> notable" criteria,
> >> is there any way we can extract them from the mysql binlogs? how are
> >> these mirrors working? I would be interested in setting up a mirror of
> >> deleted data, at least that which is not spam/vandalism based on tags.
> >> mike
> >>
> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia>
> wrote:
> >> > We now have three mirror sites, yay! The full list is linked to from
> >> > http://dumps.wikimedia.org/ and is also available at
> >> >
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
> >> >
> >> > Summarizing, we have:
> >> >
> >> > C3L (Brazil) with the last 5 good known dumps,
> >> > Masaryk University (Czech Republic) with the last 5 known good dumps,
> >> > Your.org (USA) with the complete archive of dumps, and
> >> >
> >> > for the latest version of uploaded media, Your.org with http/ftp/rsync
> >> > access.
> >> >
> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> >> > volunteering space, time and effort to make this happen.
> >> >
> >> > As people noticed earlier, a series of media tarballs per-project
> >> > (excluding commons) is being generated. As soon as the first run of
> >> > these is complete we'll announce its location and start generating
> them
> >> > on a semi-regular basis.
> >> >
> >> > As we've been getting the bugs out of the mirroring setup, it is
> getting
> >> > easier to add new locations. Know anyone interested? Please let us
> >> > know; we would love to have them.
> >> >
> >> > Ariel
> >> >
> >> >
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > Wikitech-l [at] lists
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 21, 2012, 12:21 PM

Post #9 of 27 (1977 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Thanks! and run that 1 time per day, they dont get deleted that quickly.
mike

On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
> Create a script that makes a request to Special:Export using this category
> as feed
> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>
> More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>
>
> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>
>> Well I whould be happy for items like this :
>> http://en.wikipedia.org/wiki/Template:Db-a7
>> would it be possible to extract them easily?
>> mike
>>
>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>> wrote:
>> > There's a few other reasons articles get deleted: copyright issues,
>> > personal identifying data, etc.  This makes maintaning the sort of
>> > mirror you propose problematic, although a similar mirror is here:
>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>> >
>> > The dumps contain only data publically available at the time of the run,
>> > without deleted data.
>> >
>> > The articles aren't permanently deleted of course.  The revisions texts
>> > live on in the database, so a query on toolserver, for example, could be
>> > used to get at them, but that would need to be for research purposes.
>> >
>> > Ariel
>> >
>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
>> >> Hi,
>> >> I am thinking about how to collect articles deleted based on the "not
>> >> notable" criteria,
>> >> is there any way we can extract them from the mysql binlogs? how are
>> >> these mirrors working? I would be interested in setting up a mirror of
>> >> deleted data, at least that which is not spam/vandalism based on tags.
>> >> mike
>> >>
>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia>
>> >> wrote:
>> >> > We now have three mirror sites, yay!  The full list is linked to from
>> >> > http://dumps.wikimedia.org/ and is also available at
>> >> >
>> >> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>> >> >
>> >> > Summarizing, we have:
>> >> >
>> >> > C3L (Brazil) with the last 5 good known dumps,
>> >> > Masaryk University (Czech Republic) with the last 5 known good dumps,
>> >> > Your.org (USA) with the complete archive of dumps, and
>> >> >
>> >> > for the latest version of uploaded media, Your.org with
>> >> > http/ftp/rsync
>> >> > access.
>> >> >
>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
>> >> > volunteering space, time and effort to make this happen.
>> >> >
>> >> > As people noticed earlier, a series of media tarballs per-project
>> >> > (excluding commons) is being generated.  As soon as the first run of
>> >> > these is complete we'll announce its location and start generating
>> >> > them
>> >> > on a semi-regular basis.
>> >> >
>> >> > As we've been getting the bugs out of the mirroring setup, it is
>> >> > getting
>> >> > easier to add new locations.  Know anyone interested?  Please let us
>> >> > know; we would love to have them.
>> >> >
>> >> > Ariel
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Wikitech-l mailing list
>> >> > Wikitech-l [at] lists
>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > Wikitech-l [at] lists
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
>
> --
> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> Pre-doctoral student at the University of Cádiz (Spain)
> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
> Personal website: https://sites.google.com/site/emijrp/
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 28, 2012, 12:40 PM

Post #10 of 27 (1930 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

first version of the Script is ready , it gets the versions, puts them
in a zip and puts that on archive.org
https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py

here is an example output :
http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip

I will cron this, and it should give a start of saving deleted data.
Articles will be exported once a day, even if they they were exported
yesterday as long as they are in one of the categories.

mike

On Mon, May 21, 2012 at 7:21 PM, Mike Dupont
<jamesmikedupont [at] googlemail> wrote:
> Thanks! and run that 1 time per day, they dont get deleted that quickly.
> mike
>
> On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>> Create a script that makes a request to Special:Export using this category
>> as feed
>> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>>
>> More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>
>>
>> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>>
>>> Well I whould be happy for items like this :
>>> http://en.wikipedia.org/wiki/Template:Db-a7
>>> would it be possible to extract them easily?
>>> mike
>>>
>>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>>> wrote:
>>> > There's a few other reasons articles get deleted: copyright issues,
>>> > personal identifying data, etc.  This makes maintaning the sort of
>>> > mirror you propose problematic, although a similar mirror is here:
>>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>> >
>>> > The dumps contain only data publically available at the time of the run,
>>> > without deleted data.
>>> >
>>> > The articles aren't permanently deleted of course.  The revisions texts
>>> > live on in the database, so a query on toolserver, for example, could be
>>> > used to get at them, but that would need to be for research purposes.
>>> >
>>> > Ariel
>>> >
>>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
>>> >> Hi,
>>> >> I am thinking about how to collect articles deleted based on the "not
>>> >> notable" criteria,
>>> >> is there any way we can extract them from the mysql binlogs? how are
>>> >> these mirrors working? I would be interested in setting up a mirror of
>>> >> deleted data, at least that which is not spam/vandalism based on tags.
>>> >> mike
>>> >>
>>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel [at] wikimedia>
>>> >> wrote:
>>> >> > We now have three mirror sites, yay!  The full list is linked to from
>>> >> > http://dumps.wikimedia.org/ and is also available at
>>> >> >
>>> >> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>>> >> >
>>> >> > Summarizing, we have:
>>> >> >
>>> >> > C3L (Brazil) with the last 5 good known dumps,
>>> >> > Masaryk University (Czech Republic) with the last 5 known good dumps,
>>> >> > Your.org (USA) with the complete archive of dumps, and
>>> >> >
>>> >> > for the latest version of uploaded media, Your.org with
>>> >> > http/ftp/rsync
>>> >> > access.
>>> >> >
>>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
>>> >> > volunteering space, time and effort to make this happen.
>>> >> >
>>> >> > As people noticed earlier, a series of media tarballs per-project
>>> >> > (excluding commons) is being generated.  As soon as the first run of
>>> >> > these is complete we'll announce its location and start generating
>>> >> > them
>>> >> > on a semi-regular basis.
>>> >> >
>>> >> > As we've been getting the bugs out of the mirroring setup, it is
>>> >> > getting
>>> >> > easier to add new locations.  Know anyone interested?  Please let us
>>> >> > know; we would love to have them.
>>> >> >
>>> >> > Ariel
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > Wikitech-l mailing list
>>> >> > Wikitech-l [at] lists
>>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Wikitech-l mailing list
>>> > Wikitech-l [at] lists
>>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> James Michael DuPont
>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> Wikitech-l [at] lists
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>>
>> --
>> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>> Pre-doctoral student at the University of Cádiz (Spain)
>> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
>> Personal website: https://sites.google.com/site/emijrp/
>>
>>
>> _______________________________________________
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


admin at alphacorp

May 28, 2012, 6:52 PM

Post #11 of 27 (1929 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

This is quite nice, though the item's metadata is too little :)

On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont [at] googlemail
> wrote:

> first version of the Script is ready , it gets the versions, puts them
> in a zip and puts that on archive.org
> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
>
> here is an example output :
> http://archive.org/details/wikipedia-delete-2012-05
>
> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip
>
> I will cron this, and it should give a start of saving deleted data.
> Articles will be exported once a day, even if they they were exported
> yesterday as long as they are in one of the categories.
>
> mike
>
> On Mon, May 21, 2012 at 7:21 PM, Mike Dupont
> <jamesmikedupont [at] googlemail> wrote:
> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
> > mike
> >
> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
> >> Create a script that makes a request to Special:Export using this
> category
> >> as feed
> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
> >>
> >> More info
> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
> >>
> >>
> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
> >>>
> >>> Well I whould be happy for items like this :
> >>> http://en.wikipedia.org/wiki/Template:Db-a7
> >>> would it be possible to extract them easily?
> >>> mike
> >>>
> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
> >>> wrote:
> >>> > There's a few other reasons articles get deleted: copyright issues,
> >>> > personal identifying data, etc. This makes maintaning the sort of
> >>> > mirror you propose problematic, although a similar mirror is here:
> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
> >>> >
> >>> > The dumps contain only data publically available at the time of the
> run,
> >>> > without deleted data.
> >>> >
> >>> > The articles aren't permanently deleted of course. The revisions
> texts
> >>> > live on in the database, so a query on toolserver, for example,
> could be
> >>> > used to get at them, but that would need to be for research purposes.
> >>> >
> >>> > Ariel
> >>> >
> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
> έγραψε:
> >>> >> Hi,
> >>> >> I am thinking about how to collect articles deleted based on the
> "not
> >>> >> notable" criteria,
> >>> >> is there any way we can extract them from the mysql binlogs? how are
> >>> >> these mirrors working? I would be interested in setting up a mirror
> of
> >>> >> deleted data, at least that which is not spam/vandalism based on
> tags.
> >>> >> mike
> >>> >>
> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
> ariel [at] wikimedia>
> >>> >> wrote:
> >>> >> > We now have three mirror sites, yay! The full list is linked to
> from
> >>> >> > http://dumps.wikimedia.org/ and is also available at
> >>> >> >
> >>> >> >
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
> >>> >> >
> >>> >> > Summarizing, we have:
> >>> >> >
> >>> >> > C3L (Brazil) with the last 5 good known dumps,
> >>> >> > Masaryk University (Czech Republic) with the last 5 known good
> dumps,
> >>> >> > Your.org (USA) with the complete archive of dumps, and
> >>> >> >
> >>> >> > for the latest version of uploaded media, Your.org with
> >>> >> > http/ftp/rsync
> >>> >> > access.
> >>> >> >
> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
> for
> >>> >> > volunteering space, time and effort to make this happen.
> >>> >> >
> >>> >> > As people noticed earlier, a series of media tarballs per-project
> >>> >> > (excluding commons) is being generated. As soon as the first run
> of
> >>> >> > these is complete we'll announce its location and start generating
> >>> >> > them
> >>> >> > on a semi-regular basis.
> >>> >> >
> >>> >> > As we've been getting the bugs out of the mirroring setup, it is
> >>> >> > getting
> >>> >> > easier to add new locations. Know anyone interested? Please let
> us
> >>> >> > know; we would love to have them.
> >>> >> >
> >>> >> > Ariel
> >>> >> >
> >>> >> >
> >>> >> > _______________________________________________
> >>> >> > Wikitech-l mailing list
> >>> >> > Wikitech-l [at] lists
> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > Wikitech-l mailing list
> >>> > Wikitech-l [at] lists
> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>
> >>>
> >>>
> >>> --
> >>> James Michael DuPont
> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
> >>>
> >>> _______________________________________________
> >>> Wikitech-l mailing list
> >>> Wikitech-l [at] lists
> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >>
> >>
> >>
> >> --
> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> >> Pre-doctoral student at the University of Cádiz (Spain)
> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
> >> Personal website: https://sites.google.com/site/emijrp/
> >>
> >>
> >> _______________________________________________
> >> Xmldatadumps-l mailing list
> >> Xmldatadumps-l [at] lists
> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >>
> >
> >
> >
> > --
> > James Michael DuPont
> > Member of Free Libre Open Source Software Kosova http://flossk.org
> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Regards,
Hydriz

We've created the greatest collection of shared knowledge in history. Help
protect Wikipedia. Donate now: http://donate.wikimedia.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 28, 2012, 8:08 PM

Post #12 of 27 (1931 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Well, I have now updated the script to include the xml dump in raw
format. I will have to add more information the achive.org item, at
least a basic readme.
other thing is that the wikipybot does not support the full history it
seems, so that I will have to move over to the wikiteam version and
rework it,
I just spent 2 hours on this so i am pretty happy for the first version.

mike

On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
> This is quite nice, though the item's metadata is too little :)
>
> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont [at] googlemail
>> wrote:
>
>> first version of the Script is ready , it gets the versions, puts them
>> in a zip and puts that on archive.org
>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
>>
>> here is an example output :
>> http://archive.org/details/wikipedia-delete-2012-05
>>
>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip
>>
>> I will cron this, and it should give a start of saving deleted data.
>> Articles will be exported once a day, even if they they were exported
>> yesterday as long as they are in one of the categories.
>>
>> mike
>>
>> On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont
>> <jamesmikedupont [at] googlemail> wrote:
>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>> > mike
>> >
>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>> >> Create a script that makes a request to Special:Export using this
>> category
>> >> as feed
>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>> >>
>> >> More info
>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>> >>
>> >>
>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>> >>>
>> >>> Well I whould be happy for items like this :
>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>> >>> would it be possible to extract them easily?
>> >>> mike
>> >>>
>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>> >>> wrote:
>> >>> > There's a few other reasons articles get deleted: copyright issues,
>> >>> > personal identifying data, etc.  This makes maintaning the sort of
>> >>> > mirror you propose problematic, although a similar mirror is here:
>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>> >>> >
>> >>> > The dumps contain only data publically available at the time of the
>> run,
>> >>> > without deleted data.
>> >>> >
>> >>> > The articles aren't permanently deleted of course.  The revisions
>> texts
>> >>> > live on in the database, so a query on toolserver, for example,
>> could be
>> >>> > used to get at them, but that would need to be for research purposes.
>> >>> >
>> >>> > Ariel
>> >>> >
>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
>> έγραψε:
>> >>> >> Hi,
>> >>> >> I am thinking about how to collect articles deleted based on the
>> "not
>> >>> >> notable" criteria,
>> >>> >> is there any way we can extract them from the mysql binlogs? how are
>> >>> >> these mirrors working? I would be interested in setting up a mirror
>> of
>> >>> >> deleted data, at least that which is not spam/vandalism based on
>> tags.
>> >>> >> mike
>> >>> >>
>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>> ariel [at] wikimedia>
>> >>> >> wrote:
>> >>> >> > We now have three mirror sites, yay!  The full list is linked to
>> from
>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>> >>> >> >
>> >>> >> >
>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>> >>> >> >
>> >>> >> > Summarizing, we have:
>> >>> >> >
>> >>> >> > C3L (Brazil) with the last 5 good known dumps,
>> >>> >> > Masaryk University (Czech Republic) with the last 5 known good
>> dumps,
>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>> >>> >> >
>> >>> >> > for the latest version of uploaded media, Your.org with
>> >>> >> > http/ftp/rsync
>> >>> >> > access.
>> >>> >> >
>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
>> for
>> >>> >> > volunteering space, time and effort to make this happen.
>> >>> >> >
>> >>> >> > As people noticed earlier, a series of media tarballs per-project
>> >>> >> > (excluding commons) is being generated.  As soon as the first run
>> of
>> >>> >> > these is complete we'll announce its location and start generating
>> >>> >> > them
>> >>> >> > on a semi-regular basis.
>> >>> >> >
>> >>> >> > As we've been getting the bugs out of the mirroring setup, it is
>> >>> >> > getting
>> >>> >> > easier to add new locations.  Know anyone interested?  Please let
>> us
>> >>> >> > know; we would love to have them.
>> >>> >> >
>> >>> >> > Ariel
>> >>> >> >
>> >>> >> >
>> >>> >> > _______________________________________________
>> >>> >> > Wikitech-l mailing list
>> >>> >> > Wikitech-l [at] lists
>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > Wikitech-l mailing list
>> >>> > Wikitech-l [at] lists
>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> James Michael DuPont
>> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>> >>>
>> >>> _______________________________________________
>> >>> Wikitech-l mailing list
>> >>> Wikitech-l [at] lists
>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>> >> Pre-doctoral student at the University of Cádiz (Spain)
>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
>> >> Personal website: https://sites.google.com/site/emijrp/
>> >>
>> >>
>> >> _______________________________________________
>> >> Xmldatadumps-l mailing list
>> >> Xmldatadumps-l [at] lists
>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>> >>
>> >
>> >
>> >
>> > --
>> > James Michael DuPont
>> > Member of Free Libre Open Source Software Kosova http://flossk.org
>> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> Regards,
> Hydriz
>
> We've created the greatest collection of shared knowledge in history. Help
> protect Wikipedia. Donate now: http://donate.wikimedia.org
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 29, 2012, 11:26 PM

Post #13 of 27 (1924 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Ok, I merged the code from wikteam and have a full history dump script
that uploads to archive.org,
next step is to fix the bucket metadata in the script
mike

On Tue, May 29, 2012 at 3:08 AM, Mike Dupont
<jamesmikedupont [at] googlemail> wrote:
> Well, I have now updated the script to include  the xml dump in raw
> format. I will have to add more information the achive.org item, at
> least a basic readme.
> other thing is that the wikipybot does not support the full history it
> seems, so that I will have to move over to the wikiteam version and
> rework it,
> I just spent 2 hours on this so i am pretty happy for the first version.
>
> mike
>
> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
>> This is quite nice, though the item's metadata is too little :)
>>
>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont [at] googlemail
>>> wrote:
>>
>>> first version of the Script is ready , it gets the versions, puts them
>>> in a zip and puts that on archive.org
>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
>>>
>>> here is an example output :
>>> http://archive.org/details/wikipedia-delete-2012-05
>>>
>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip
>>>
>>> I will cron this, and it should give a start of saving deleted data.
>>> Articles will be exported once a day, even if they they were exported
>>> yesterday as long as they are in one of the categories.
>>>
>>> mike
>>>
>>> On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont
>>> <jamesmikedupont [at] googlemail> wrote:
>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>>> > mike
>>> >
>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>>> >> Create a script that makes a request to Special:Export using this
>>> category
>>> >> as feed
>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>>> >>
>>> >> More info
>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>> >>
>>> >>
>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>> >>>
>>> >>> Well I whould be happy for items like this :
>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>>> >>> would it be possible to extract them easily?
>>> >>> mike
>>> >>>
>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>>> >>> wrote:
>>> >>> > There's a few other reasons articles get deleted: copyright issues,
>>> >>> > personal identifying data, etc.  This makes maintaning the sort of
>>> >>> > mirror you propose problematic, although a similar mirror is here:
>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>> >>> >
>>> >>> > The dumps contain only data publically available at the time of the
>>> run,
>>> >>> > without deleted data.
>>> >>> >
>>> >>> > The articles aren't permanently deleted of course.  The revisions
>>> texts
>>> >>> > live on in the database, so a query on toolserver, for example,
>>> could be
>>> >>> > used to get at them, but that would need to be for research purposes.
>>> >>> >
>>> >>> > Ariel
>>> >>> >
>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
>>> έγραψε:
>>> >>> >> Hi,
>>> >>> >> I am thinking about how to collect articles deleted based on the
>>> "not
>>> >>> >> notable" criteria,
>>> >>> >> is there any way we can extract them from the mysql binlogs? how are
>>> >>> >> these mirrors working? I would be interested in setting up a mirror
>>> of
>>> >>> >> deleted data, at least that which is not spam/vandalism based on
>>> tags.
>>> >>> >> mike
>>> >>> >>
>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>>> ariel [at] wikimedia>
>>> >>> >> wrote:
>>> >>> >> > We now have three mirror sites, yay!  The full list is linked to
>>> from
>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>>> >>> >> >
>>> >>> >> >
>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>>> >>> >> >
>>> >>> >> > Summarizing, we have:
>>> >>> >> >
>>> >>> >> > C3L (Brazil) with the last 5 good known dumps,
>>> >>> >> > Masaryk University (Czech Republic) with the last 5 known good
>>> dumps,
>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>>> >>> >> >
>>> >>> >> > for the latest version of uploaded media, Your.org with
>>> >>> >> > http/ftp/rsync
>>> >>> >> > access.
>>> >>> >> >
>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
>>> for
>>> >>> >> > volunteering space, time and effort to make this happen.
>>> >>> >> >
>>> >>> >> > As people noticed earlier, a series of media tarballs per-project
>>> >>> >> > (excluding commons) is being generated.  As soon as the first run
>>> of
>>> >>> >> > these is complete we'll announce its location and start generating
>>> >>> >> > them
>>> >>> >> > on a semi-regular basis.
>>> >>> >> >
>>> >>> >> > As we've been getting the bugs out of the mirroring setup, it is
>>> >>> >> > getting
>>> >>> >> > easier to add new locations.  Know anyone interested?  Please let
>>> us
>>> >>> >> > know; we would love to have them.
>>> >>> >> >
>>> >>> >> > Ariel
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > _______________________________________________
>>> >>> >> > Wikitech-l mailing list
>>> >>> >> > Wikitech-l [at] lists
>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > Wikitech-l mailing list
>>> >>> > Wikitech-l [at] lists
>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> James Michael DuPont
>>> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>> >>>
>>> >>> _______________________________________________
>>> >>> Wikitech-l mailing list
>>> >>> Wikitech-l [at] lists
>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>>> >> Pre-doctoral student at the University of Cádiz (Spain)
>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
>>> >> Personal website: https://sites.google.com/site/emijrp/
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Xmldatadumps-l mailing list
>>> >> Xmldatadumps-l [at] lists
>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > James Michael DuPont
>>> > Member of Free Libre Open Source Software Kosova http://flossk.org
>>> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>
>>>
>>>
>>> --
>>> James Michael DuPont
>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> Wikitech-l [at] lists
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>
>>
>>
>> --
>> Regards,
>> Hydriz
>>
>> We've created the greatest collection of shared knowledge in history. Help
>> protect Wikipedia. Donate now: http://donate.wikimedia.org
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

May 29, 2012, 11:26 PM

Post #14 of 27 (1925 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

https://github.com/h4ck3rm1k3/wikiteam code here

On Wed, May 30, 2012 at 6:26 AM, Mike Dupont
<jamesmikedupont [at] googlemail> wrote:
> Ok, I merged the code from wikteam and have a full history dump script
> that uploads to archive.org,
> next step is to fix the bucket metadata in the script
> mike
>
> On Tue, May 29, 2012 at 3:08 AM, Mike  Dupont
> <jamesmikedupont [at] googlemail> wrote:
>> Well, I have now updated the script to include  the xml dump in raw
>> format. I will have to add more information the achive.org item, at
>> least a basic readme.
>> other thing is that the wikipybot does not support the full history it
>> seems, so that I will have to move over to the wikiteam version and
>> rework it,
>> I just spent 2 hours on this so i am pretty happy for the first version.
>>
>> mike
>>
>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
>>> This is quite nice, though the item's metadata is too little :)
>>>
>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont [at] googlemail
>>>> wrote:
>>>
>>>> first version of the Script is ready , it gets the versions, puts them
>>>> in a zip and puts that on archive.org
>>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
>>>>
>>>> here is an example output :
>>>> http://archive.org/details/wikipedia-delete-2012-05
>>>>
>>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip
>>>>
>>>> I will cron this, and it should give a start of saving deleted data.
>>>> Articles will be exported once a day, even if they they were exported
>>>> yesterday as long as they are in one of the categories.
>>>>
>>>> mike
>>>>
>>>> On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont
>>>> <jamesmikedupont [at] googlemail> wrote:
>>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>>>> > mike
>>>> >
>>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>>>> >> Create a script that makes a request to Special:Export using this
>>>> category
>>>> >> as feed
>>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>>>> >>
>>>> >> More info
>>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>>> >>
>>>> >>
>>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>>> >>>
>>>> >>> Well I whould be happy for items like this :
>>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>>>> >>> would it be possible to extract them easily?
>>>> >>> mike
>>>> >>>
>>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>>>> >>> wrote:
>>>> >>> > There's a few other reasons articles get deleted: copyright issues,
>>>> >>> > personal identifying data, etc.  This makes maintaning the sort of
>>>> >>> > mirror you propose problematic, although a similar mirror is here:
>>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>>> >>> >
>>>> >>> > The dumps contain only data publically available at the time of the
>>>> run,
>>>> >>> > without deleted data.
>>>> >>> >
>>>> >>> > The articles aren't permanently deleted of course.  The revisions
>>>> texts
>>>> >>> > live on in the database, so a query on toolserver, for example,
>>>> could be
>>>> >>> > used to get at them, but that would need to be for research purposes.
>>>> >>> >
>>>> >>> > Ariel
>>>> >>> >
>>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
>>>> έγραψε:
>>>> >>> >> Hi,
>>>> >>> >> I am thinking about how to collect articles deleted based on the
>>>> "not
>>>> >>> >> notable" criteria,
>>>> >>> >> is there any way we can extract them from the mysql binlogs? how are
>>>> >>> >> these mirrors working? I would be interested in setting up a mirror
>>>> of
>>>> >>> >> deleted data, at least that which is not spam/vandalism based on
>>>> tags.
>>>> >>> >> mike
>>>> >>> >>
>>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>>>> ariel [at] wikimedia>
>>>> >>> >> wrote:
>>>> >>> >> > We now have three mirror sites, yay!  The full list is linked to
>>>> from
>>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>>>> >>> >> >
>>>> >>> >> >
>>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>>>> >>> >> >
>>>> >>> >> > Summarizing, we have:
>>>> >>> >> >
>>>> >>> >> > C3L (Brazil) with the last 5 good known dumps,
>>>> >>> >> > Masaryk University (Czech Republic) with the last 5 known good
>>>> dumps,
>>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>>>> >>> >> >
>>>> >>> >> > for the latest version of uploaded media, Your.org with
>>>> >>> >> > http/ftp/rsync
>>>> >>> >> > access.
>>>> >>> >> >
>>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
>>>> for
>>>> >>> >> > volunteering space, time and effort to make this happen.
>>>> >>> >> >
>>>> >>> >> > As people noticed earlier, a series of media tarballs per-project
>>>> >>> >> > (excluding commons) is being generated.  As soon as the first run
>>>> of
>>>> >>> >> > these is complete we'll announce its location and start generating
>>>> >>> >> > them
>>>> >>> >> > on a semi-regular basis.
>>>> >>> >> >
>>>> >>> >> > As we've been getting the bugs out of the mirroring setup, it is
>>>> >>> >> > getting
>>>> >>> >> > easier to add new locations.  Know anyone interested?  Please let
>>>> us
>>>> >>> >> > know; we would love to have them.
>>>> >>> >> >
>>>> >>> >> > Ariel
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > _______________________________________________
>>>> >>> >> > Wikitech-l mailing list
>>>> >>> >> > Wikitech-l [at] lists
>>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > Wikitech-l mailing list
>>>> >>> > Wikitech-l [at] lists
>>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> James Michael DuPont
>>>> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> Wikitech-l mailing list
>>>> >>> Wikitech-l [at] lists
>>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>>>> >> Pre-doctoral student at the University of Cádiz (Spain)
>>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
>>>> >> Personal website: https://sites.google.com/site/emijrp/
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> Xmldatadumps-l mailing list
>>>> >> Xmldatadumps-l [at] lists
>>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > James Michael DuPont
>>>> > Member of Free Libre Open Source Software Kosova http://flossk.org
>>>> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>
>>>>
>>>>
>>>> --
>>>> James Michael DuPont
>>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> Wikitech-l [at] lists
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Hydriz
>>>
>>> We've created the greatest collection of shared knowledge in history. Help
>>> protect Wikipedia. Donate now: http://donate.wikimedia.org
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> Wikitech-l [at] lists
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


sterkebak at gmail

May 30, 2012, 1:52 AM

Post #15 of 27 (1942 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

I'm still intressted in running a mirror also, like noted on Meta and send
out earlier per mail also.

I'm just wondering, why is there no rsync possibility from the main server?
Its "strange" when we need to rsync from a mirror.

--
*Kind regards,

Huib Laurens**

Certified cPanel Specialist
Certified Kaspersky Specialist
**
WickedWay Webhosting, webhosting the wicked way!

www.wickedway.nl - www.wickedway.be* .
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


admin at alphacorp

May 30, 2012, 1:54 AM

Post #16 of 27 (1934 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync
dataset1001.wikimedia.org::

However, the system limits the rsyncers to only mirrors, to prevent others
from rsyncing directly from Wikimedia.

On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail> wrote:

> I'm still intressted in running a mirror also, like noted on Meta and send
> out earlier per mail also.
>
> I'm just wondering, why is there no rsync possibility from the main server?
> Its "strange" when we need to rsync from a mirror.
>
> --
> *Kind regards,
>
> Huib Laurens**
>
> Certified cPanel Specialist
> Certified Kaspersky Specialist
> **
> WickedWay Webhosting, webhosting the wicked way!
>
> www.wickedway.nl - www.wickedway.be* .
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Regards,
Hydriz

We've created the greatest collection of shared knowledge in history. Help
protect Wikipedia. Donate now: http://donate.wikimedia.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


sterkebak at gmail

May 30, 2012, 1:58 AM

Post #17 of 27 (1936 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Ok, cool.

And how will I get wikimedia to allow our IP to rsync?

Best,

Huib

On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin [at] alphacorp>wrote:

> Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync
> dataset1001.wikimedia.org::
>
> However, the system limits the rsyncers to only mirrors, to prevent others
> from rsyncing directly from Wikimedia.
>
> On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail> wrote:
>
> > I'm still intressted in running a mirror also, like noted on Meta and
> send
> > out earlier per mail also.
> >
> > I'm just wondering, why is there no rsync possibility from the main
> server?
> > Its "strange" when we need to rsync from a mirror.
> >
> > --
> > *Kind regards,
> >
> > Huib Laurens**
> >
> > Certified cPanel Specialist
> > Certified Kaspersky Specialist
> > **
> > WickedWay Webhosting, webhosting the wicked way!
> >
> > www.wickedway.nl - www.wickedway.be* .
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Regards,
> Hydriz
>
> We've created the greatest collection of shared knowledge in history. Help
> protect Wikipedia. Donate now: http://donate.wikimedia.org
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Kind regards,

Huib Laurens
WickedWay.nl

Webhosting the wicked way.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


admin at alphacorp

May 30, 2012, 1:59 AM

Post #18 of 27 (1932 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Ariel will do that :)

BTW just dig around inside their puppet configuration repository on Gerrit
and you can know more :)

On Wed, May 30, 2012 at 4:58 PM, Huib Laurens <sterkebak [at] gmail> wrote:

> Ok, cool.
>
> And how will I get wikimedia to allow our IP to rsync?
>
> Best,
>
> Huib
>
> On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin [at] alphacorp
> >wrote:
>
> > Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync
> > dataset1001.wikimedia.org::
> >
> > However, the system limits the rsyncers to only mirrors, to prevent
> others
> > from rsyncing directly from Wikimedia.
> >
> > On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail>
> wrote:
> >
> > > I'm still intressted in running a mirror also, like noted on Meta and
> > send
> > > out earlier per mail also.
> > >
> > > I'm just wondering, why is there no rsync possibility from the main
> > server?
> > > Its "strange" when we need to rsync from a mirror.
> > >
> > > --
> > > *Kind regards,
> > >
> > > Huib Laurens**
> > >
> > > Certified cPanel Specialist
> > > Certified Kaspersky Specialist
> > > **
> > > WickedWay Webhosting, webhosting the wicked way!
> > >
> > > www.wickedway.nl - www.wickedway.be* .
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
> >
> >
> > --
> > Regards,
> > Hydriz
> >
> > We've created the greatest collection of shared knowledge in history.
> Help
> > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Kind regards,
>
> Huib Laurens
> WickedWay.nl
>
> Webhosting the wicked way.
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Regards,
Hydriz

We've created the greatest collection of shared knowledge in history. Help
protect Wikipedia. Donate now: http://donate.wikimedia.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


sterkebak at gmail

May 30, 2012, 2:16 AM

Post #19 of 27 (1937 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Ok.

I mailed Ariel about this, if all goes will I can have the mirror up and
running by Friday.

Best,
Huib

On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia <admin [at] alphacorp>wrote:

> Ariel will do that :)
>
> BTW just dig around inside their puppet configuration repository on Gerrit
> and you can know more :)
>
> On Wed, May 30, 2012 at 4:58 PM, Huib Laurens <sterkebak [at] gmail> wrote:
>
> > Ok, cool.
> >
> > And how will I get wikimedia to allow our IP to rsync?
> >
> > Best,
> >
> > Huib
> >
> > On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin [at] alphacorp
> > >wrote:
> >
> > > Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync
> > > dataset1001.wikimedia.org::
> > >
> > > However, the system limits the rsyncers to only mirrors, to prevent
> > others
> > > from rsyncing directly from Wikimedia.
> > >
> > > On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail>
> > wrote:
> > >
> > > > I'm still intressted in running a mirror also, like noted on Meta and
> > > send
> > > > out earlier per mail also.
> > > >
> > > > I'm just wondering, why is there no rsync possibility from the main
> > > server?
> > > > Its "strange" when we need to rsync from a mirror.
> > > >
> > > > --
> > > > *Kind regards,
> > > >
> > > > Huib Laurens**
> > > >
> > > > Certified cPanel Specialist
> > > > Certified Kaspersky Specialist
> > > > **
> > > > WickedWay Webhosting, webhosting the wicked way!
> > > >
> > > > www.wickedway.nl - www.wickedway.be* .
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Hydriz
> > >
> > > We've created the greatest collection of shared knowledge in history.
> > Help
> > > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
> >
> >
> > --
> > Kind regards,
> >
> > Huib Laurens
> > WickedWay.nl
> >
> > Webhosting the wicked way.
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Regards,
> Hydriz
>
> We've created the greatest collection of shared knowledge in history. Help
> protect Wikipedia. Donate now: http://donate.wikimedia.org
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Kind regards,

Huib Laurens
WickedWay.nl

Webhosting the wicked way.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


admin at alphacorp

May 30, 2012, 2:18 AM

Post #20 of 27 (1925 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Do you have a url that you can reveal so that some of us can have a sneak
peak? :P

On Wed, May 30, 2012 at 5:16 PM, Huib Laurens <sterkebak [at] gmail> wrote:

> Ok.
>
> I mailed Ariel about this, if all goes will I can have the mirror up and
> running by Friday.
>
> Best,
> Huib
>
> On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia <admin [at] alphacorp
> >wrote:
>
> > Ariel will do that :)
> >
> > BTW just dig around inside their puppet configuration repository on
> Gerrit
> > and you can know more :)
> >
> > On Wed, May 30, 2012 at 4:58 PM, Huib Laurens <sterkebak [at] gmail>
> wrote:
> >
> > > Ok, cool.
> > >
> > > And how will I get wikimedia to allow our IP to rsync?
> > >
> > > Best,
> > >
> > > Huib
> > >
> > > On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin [at] alphacorp
> > > >wrote:
> > >
> > > > Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync
> > > > dataset1001.wikimedia.org::
> > > >
> > > > However, the system limits the rsyncers to only mirrors, to prevent
> > > others
> > > > from rsyncing directly from Wikimedia.
> > > >
> > > > On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail>
> > > wrote:
> > > >
> > > > > I'm still intressted in running a mirror also, like noted on Meta
> and
> > > > send
> > > > > out earlier per mail also.
> > > > >
> > > > > I'm just wondering, why is there no rsync possibility from the main
> > > > server?
> > > > > Its "strange" when we need to rsync from a mirror.
> > > > >
> > > > > --
> > > > > *Kind regards,
> > > > >
> > > > > Huib Laurens**
> > > > >
> > > > > Certified cPanel Specialist
> > > > > Certified Kaspersky Specialist
> > > > > **
> > > > > WickedWay Webhosting, webhosting the wicked way!
> > > > >
> > > > > www.wickedway.nl - www.wickedway.be* .
> > > > > _______________________________________________
> > > > > Wikitech-l mailing list
> > > > > Wikitech-l [at] lists
> > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Hydriz
> > > >
> > > > We've created the greatest collection of shared knowledge in history.
> > > Help
> > > > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > >
> > >
> > >
> > > --
> > > Kind regards,
> > >
> > > Huib Laurens
> > > WickedWay.nl
> > >
> > > Webhosting the wicked way.
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
> >
> >
> > --
> > Regards,
> > Hydriz
> >
> > We've created the greatest collection of shared knowledge in history.
> Help
> > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Kind regards,
>
> Huib Laurens
> WickedWay.nl
>
> Webhosting the wicked way.
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Regards,
Hydriz

We've created the greatest collection of shared knowledge in history. Help
protect Wikipedia. Donate now: http://donate.wikimedia.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


sterkebak at gmail

May 30, 2012, 2:35 AM

Post #21 of 27 (1924 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Sure :)

http://mirror.fr.wickedway.nl

later on we will duplicate this mirror to a Dutch mirror also :)

Best,
Huib

On Wed, May 30, 2012 at 11:18 AM, Hydriz Wikipedia <admin [at] alphacorp>wrote:

> Do you have a url that you can reveal so that some of us can have a sneak
> peak? :P
>
> On Wed, May 30, 2012 at 5:16 PM, Huib Laurens <sterkebak [at] gmail> wrote:
>
> > Ok.
> >
> > I mailed Ariel about this, if all goes will I can have the mirror up and
> > running by Friday.
> >
> > Best,
> > Huib
> >
> > On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia <admin [at] alphacorp
> > >wrote:
> >
> > > Ariel will do that :)
> > >
> > > BTW just dig around inside their puppet configuration repository on
> > Gerrit
> > > and you can know more :)
> > >
> > > On Wed, May 30, 2012 at 4:58 PM, Huib Laurens <sterkebak [at] gmail>
> > wrote:
> > >
> > > > Ok, cool.
> > > >
> > > > And how will I get wikimedia to allow our IP to rsync?
> > > >
> > > > Best,
> > > >
> > > > Huib
> > > >
> > > > On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <
> admin [at] alphacorp
> > > > >wrote:
> > > >
> > > > > Eh, mirrors rsync directly from dataset1001.wikimedia.org, see
> rsync
> > > > > dataset1001.wikimedia.org::
> > > > >
> > > > > However, the system limits the rsyncers to only mirrors, to prevent
> > > > others
> > > > > from rsyncing directly from Wikimedia.
> > > > >
> > > > > On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak [at] gmail
> >
> > > > wrote:
> > > > >
> > > > > > I'm still intressted in running a mirror also, like noted on Meta
> > and
> > > > > send
> > > > > > out earlier per mail also.
> > > > > >
> > > > > > I'm just wondering, why is there no rsync possibility from the
> main
> > > > > server?
> > > > > > Its "strange" when we need to rsync from a mirror.
> > > > > >
> > > > > > --
> > > > > > *Kind regards,
> > > > > >
> > > > > > Huib Laurens**
> > > > > >
> > > > > > Certified cPanel Specialist
> > > > > > Certified Kaspersky Specialist
> > > > > > **
> > > > > > WickedWay Webhosting, webhosting the wicked way!
> > > > > >
> > > > > > www.wickedway.nl - www.wickedway.be* .
> > > > > > _______________________________________________
> > > > > > Wikitech-l mailing list
> > > > > > Wikitech-l [at] lists
> > > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Hydriz
> > > > >
> > > > > We've created the greatest collection of shared knowledge in
> history.
> > > > Help
> > > > > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > > > > _______________________________________________
> > > > > Wikitech-l mailing list
> > > > > Wikitech-l [at] lists
> > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kind regards,
> > > >
> > > > Huib Laurens
> > > > WickedWay.nl
> > > >
> > > > Webhosting the wicked way.
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Hydriz
> > >
> > > We've created the greatest collection of shared knowledge in history.
> > Help
> > > protect Wikipedia. Donate now: http://donate.wikimedia.org
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
> >
> >
> > --
> > Kind regards,
> >
> > Huib Laurens
> > WickedWay.nl
> >
> > Webhosting the wicked way.
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Regards,
> Hydriz
>
> We've created the greatest collection of shared knowledge in history. Help
> protect Wikipedia. Donate now: http://donate.wikimedia.org
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Kind regards,

Huib Laurens
WickedWay.nl

Webhosting the wicked way.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jamesmikedupont at googlemail

Jun 1, 2012, 4:28 PM

Post #22 of 27 (1911 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

I have run cron archiving now every 30 minutes,
http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
it is amazing how fast the stuff gets deleted on wikipedia.
what about the proposed deletes are there categories for that?
thanks
mike

On Wed, May 30, 2012 at 6:26 AM, Mike Dupont
<jamesmikedupont [at] googlemail> wrote:
> https://github.com/h4ck3rm1k3/wikiteam code here
>
> On Wed, May 30, 2012 at 6:26 AM, Mike  Dupont
> <jamesmikedupont [at] googlemail> wrote:
>> Ok, I merged the code from wikteam and have a full history dump script
>> that uploads to archive.org,
>> next step is to fix the bucket metadata in the script
>> mike
>>
>> On Tue, May 29, 2012 at 3:08 AM, Mike  Dupont
>> <jamesmikedupont [at] googlemail> wrote:
>>> Well, I have now updated the script to include  the xml dump in raw
>>> format. I will have to add more information the achive.org item, at
>>> least a basic readme.
>>> other thing is that the wikipybot does not support the full history it
>>> seems, so that I will have to move over to the wikiteam version and
>>> rework it,
>>> I just spent 2 hours on this so i am pretty happy for the first version.
>>>
>>> mike
>>>
>>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
>>>> This is quite nice, though the item's metadata is too little :)
>>>>
>>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont [at] googlemail
>>>>> wrote:
>>>>
>>>>> first version of the Script is ready , it gets the versions, puts them
>>>>> in a zip and puts that on archive.org
>>>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
>>>>>
>>>>> here is an example output :
>>>>> http://archive.org/details/wikipedia-delete-2012-05
>>>>>
>>>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012-05-28T21:34:02.302183.zip
>>>>>
>>>>> I will cron this, and it should give a start of saving deleted data.
>>>>> Articles will be exported once a day, even if they they were exported
>>>>> yesterday as long as they are in one of the categories.
>>>>>
>>>>> mike
>>>>>
>>>>> On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont
>>>>> <jamesmikedupont [at] googlemail> wrote:
>>>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>>>>> > mike
>>>>> >
>>>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>>>>> >> Create a script that makes a request to Special:Export using this
>>>>> category
>>>>> >> as feed
>>>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
>>>>> >>
>>>>> >> More info
>>>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>>>> >>
>>>>> >>
>>>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>>>> >>>
>>>>> >>> Well I whould be happy for items like this :
>>>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>>>>> >>> would it be possible to extract them easily?
>>>>> >>> mike
>>>>> >>>
>>>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel [at] wikimedia>
>>>>> >>> wrote:
>>>>> >>> > There's a few other reasons articles get deleted: copyright issues,
>>>>> >>> > personal identifying data, etc.  This makes maintaning the sort of
>>>>> >>> > mirror you propose problematic, although a similar mirror is here:
>>>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>>>> >>> >
>>>>> >>> > The dumps contain only data publically available at the time of the
>>>>> run,
>>>>> >>> > without deleted data.
>>>>> >>> >
>>>>> >>> > The articles aren't permanently deleted of course.  The revisions
>>>>> texts
>>>>> >>> > live on in the database, so a query on toolserver, for example,
>>>>> could be
>>>>> >>> > used to get at them, but that would need to be for research purposes.
>>>>> >>> >
>>>>> >>> > Ariel
>>>>> >>> >
>>>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
>>>>> έγραψε:
>>>>> >>> >> Hi,
>>>>> >>> >> I am thinking about how to collect articles deleted based on the
>>>>> "not
>>>>> >>> >> notable" criteria,
>>>>> >>> >> is there any way we can extract them from the mysql binlogs? how are
>>>>> >>> >> these mirrors working? I would be interested in setting up a mirror
>>>>> of
>>>>> >>> >> deleted data, at least that which is not spam/vandalism based on
>>>>> tags.
>>>>> >>> >> mike
>>>>> >>> >>
>>>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>>>>> ariel [at] wikimedia>
>>>>> >>> >> wrote:
>>>>> >>> >> > We now have three mirror sites, yay!  The full list is linked to
>>>>> from
>>>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>>>>> >>> >> >
>>>>> >>> >> > Summarizing, we have:
>>>>> >>> >> >
>>>>> >>> >> > C3L (Brazil) with the last 5 good known dumps,
>>>>> >>> >> > Masaryk University (Czech Republic) with the last 5 known good
>>>>> dumps,
>>>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>>>>> >>> >> >
>>>>> >>> >> > for the latest version of uploaded media, Your.org with
>>>>> >>> >> > http/ftp/rsync
>>>>> >>> >> > access.
>>>>> >>> >> >
>>>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
>>>>> for
>>>>> >>> >> > volunteering space, time and effort to make this happen.
>>>>> >>> >> >
>>>>> >>> >> > As people noticed earlier, a series of media tarballs per-project
>>>>> >>> >> > (excluding commons) is being generated.  As soon as the first run
>>>>> of
>>>>> >>> >> > these is complete we'll announce its location and start generating
>>>>> >>> >> > them
>>>>> >>> >> > on a semi-regular basis.
>>>>> >>> >> >
>>>>> >>> >> > As we've been getting the bugs out of the mirroring setup, it is
>>>>> >>> >> > getting
>>>>> >>> >> > easier to add new locations.  Know anyone interested?  Please let
>>>>> us
>>>>> >>> >> > know; we would love to have them.
>>>>> >>> >> >
>>>>> >>> >> > Ariel
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > _______________________________________________
>>>>> >>> >> > Wikitech-l mailing list
>>>>> >>> >> > Wikitech-l [at] lists
>>>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > _______________________________________________
>>>>> >>> > Wikitech-l mailing list
>>>>> >>> > Wikitech-l [at] lists
>>>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> James Michael DuPont
>>>>> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>>> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>>> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> Wikitech-l mailing list
>>>>> >>> Wikitech-l [at] lists
>>>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>>>>> >> Pre-doctoral student at the University of Cádiz (Spain)
>>>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam
>>>>> >> Personal website: https://sites.google.com/site/emijrp/
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> Xmldatadumps-l mailing list
>>>>> >> Xmldatadumps-l [at] lists
>>>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > James Michael DuPont
>>>>> > Member of Free Libre Open Source Software Kosova http://flossk.org
>>>>> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>>> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James Michael DuPont
>>>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>> _______________________________________________
>>>>> Wikitech-l mailing list
>>>>> Wikitech-l [at] lists
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Hydriz
>>>>
>>>> We've created the greatest collection of shared knowledge in history. Help
>>>> protect Wikipedia. Donate now: http://donate.wikimedia.org
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> Wikitech-l [at] lists
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> James Michael DuPont
>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


orenbochman at gmail

Jun 5, 2012, 5:44 AM

Post #23 of 27 (1902 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?


-----Original Message-----
From: wikitech-l-bounces [at] lists [mailto:wikitech-l-bounces [at] lists] On Behalf Of Mike Dupont
Sent: Saturday, June 02, 2012 1:28 AM
To: Wikimedia developers; wikiteam-discuss [at] googlegroups
Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
it is amazing how fast the stuff gets deleted on wikipedia.
what about the proposed deletes are there categories for that?
thanks
mike

On Wed, May 30, 2012 at 6:26 AM, Mike Dupont <jamesmikedupont [at] googlemail> wrote:
> https://github.com/h4ck3rm1k3/wikiteam code here
>
> On Wed, May 30, 2012 at 6:26 AM, Mike Dupont
> <jamesmikedupont [at] googlemail> wrote:
>> Ok, I merged the code from wikteam and have a full history dump
>> script that uploads to archive.org, next step is to fix the bucket
>> metadata in the script mike
>>
>> On Tue, May 29, 2012 at 3:08 AM, Mike Dupont
>> <jamesmikedupont [at] googlemail> wrote:
>>> Well, I have now updated the script to include the xml dump in raw
>>> format. I will have to add more information the achive.org item, at
>>> least a basic readme.
>>> other thing is that the wikipybot does not support the full history
>>> it seems, so that I will have to move over to the wikiteam version
>>> and rework it, I just spent 2 hours on this so i am pretty happy for
>>> the first version.
>>>
>>> mike
>>>
>>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
>>>> This is quite nice, though the item's metadata is too little :)
>>>>
>>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont
>>>> <jamesmikedupont [at] googlemail
>>>>> wrote:
>>>>
>>>>> first version of the Script is ready , it gets the versions, puts
>>>>> them in a zip and puts that on archive.org
>>>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de
>>>>> leted.py
>>>>>
>>>>> here is an example output :
>>>>> http://archive.org/details/wikipedia-delete-2012-05
>>>>>
>>>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a
>>>>> rchive2012-05-28T21:34:02.302183.zip
>>>>>
>>>>> I will cron this, and it should give a start of saving deleted data.
>>>>> Articles will be exported once a day, even if they they were
>>>>> exported yesterday as long as they are in one of the categories.
>>>>>
>>>>> mike
>>>>>
>>>>> On Mon, May 21, 2012 at 7:21 PM, Mike Dupont
>>>>> <jamesmikedupont [at] googlemail> wrote:
>>>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>>>>> > mike
>>>>> >
>>>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>>>>> >> Create a script that makes a request to Special:Export using
>>>>> >> this
>>>>> category
>>>>> >> as feed
>>>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de
>>>>> >> letion
>>>>> >>
>>>>> >> More info
>>>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>>>> >>
>>>>> >>
>>>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>>>> >>>
>>>>> >>> Well I whould be happy for items like this :
>>>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>>>>> >>> would it be possible to extract them easily?
>>>>> >>> mike
>>>>> >>>
>>>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn
>>>>> >>> <ariel [at] wikimedia>
>>>>> >>> wrote:
>>>>> >>> > There's a few other reasons articles get deleted: copyright
>>>>> >>> > issues, personal identifying data, etc. This makes
>>>>> >>> > maintaning the sort of mirror you propose problematic, although a similar mirror is here:
>>>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>>>> >>> >
>>>>> >>> > The dumps contain only data publically available at the time
>>>>> >>> > of the
>>>>> run,
>>>>> >>> > without deleted data.
>>>>> >>> >
>>>>> >>> > The articles aren't permanently deleted of course. The
>>>>> >>> > revisions
>>>>> texts
>>>>> >>> > live on in the database, so a query on toolserver, for
>>>>> >>> > example,
>>>>> could be
>>>>> >>> > used to get at them, but that would need to be for research purposes.
>>>>> >>> >
>>>>> >>> > Ariel
>>>>> >>> >
>>>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike
>>>>> >>> > Dupont
>>>>> έγραψε:
>>>>> >>> >> Hi,
>>>>> >>> >> I am thinking about how to collect articles deleted based
>>>>> >>> >> on the
>>>>> "not
>>>>> >>> >> notable" criteria,
>>>>> >>> >> is there any way we can extract them from the mysql
>>>>> >>> >> binlogs? how are these mirrors working? I would be
>>>>> >>> >> interested in setting up a mirror
>>>>> of
>>>>> >>> >> deleted data, at least that which is not spam/vandalism
>>>>> >>> >> based on
>>>>> tags.
>>>>> >>> >> mike
>>>>> >>> >>
>>>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>>>>> ariel [at] wikimedia>
>>>>> >>> >> wrote:
>>>>> >>> >> > We now have three mirror sites, yay! The full list is
>>>>> >>> >> > linked to
>>>>> from
>>>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum
>>>>> ps#Current_Mirrors
>>>>> >>> >> >
>>>>> >>> >> > Summarizing, we have:
>>>>> >>> >> >
>>>>> >>> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk
>>>>> >>> >> > University (Czech Republic) with the last 5 known good
>>>>> dumps,
>>>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>>>>> >>> >> >
>>>>> >>> >> > for the latest version of uploaded media, Your.org with
>>>>> >>> >> > http/ftp/rsync access.
>>>>> >>> >> >
>>>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the
>>>>> >>> >> > above sites
>>>>> for
>>>>> >>> >> > volunteering space, time and effort to make this happen.
>>>>> >>> >> >
>>>>> >>> >> > As people noticed earlier, a series of media tarballs
>>>>> >>> >> > per-project (excluding commons) is being generated. As
>>>>> >>> >> > soon as the first run
>>>>> of
>>>>> >>> >> > these is complete we'll announce its location and start
>>>>> >>> >> > generating them on a semi-regular basis.
>>>>> >>> >> >
>>>>> >>> >> > As we've been getting the bugs out of the mirroring
>>>>> >>> >> > setup, it is getting easier to add new locations. Know
>>>>> >>> >> > anyone interested? Please let
>>>>> us
>>>>> >>> >> > know; we would love to have them.
>>>>> >>> >> >
>>>>> >>> >> > Ariel
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > _______________________________________________
>>>>> >>> >> > Wikitech-l mailing list
>>>>> >>> >> > Wikitech-l [at] lists
>>>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > _______________________________________________
>>>>> >>> > Wikitech-l mailing list
>>>>> >>> > Wikitech-l [at] lists
>>>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> James Michael DuPont
>>>>> >>> Member of Free Libre Open Source Software Kosova
>>>>> >>> http://flossk.org Contributor FOSM, the CC-BY-SA map of the
>>>>> >>> world http://fosm.org Mozilla Rep
>>>>> >>> https://reps.mozilla.org/u/h4ck3rm1k3
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> Wikitech-l mailing list
>>>>> >>> Wikitech-l [at] lists
>>>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>>>>> >> Pre-doctoral student at the University of Cádiz (Spain)
>>>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers |
>>>>> >> WikiTeam Personal website:
>>>>> >> https://sites.google.com/site/emijrp/
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> Xmldatadumps-l mailing list
>>>>> >> Xmldatadumps-l [at] lists
>>>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > James Michael DuPont
>>>>> > Member of Free Libre Open Source Software Kosova
>>>>> > http://flossk.org Contributor FOSM, the CC-BY-SA map of the
>>>>> > world http://fosm.org Mozilla Rep
>>>>> > https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James Michael DuPont
>>>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>> _______________________________________________
>>>>> Wikitech-l mailing list
>>>>> Wikitech-l [at] lists
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Hydriz
>>>>
>>>> We've created the greatest collection of shared knowledge in
>>>> history. Help protect Wikipedia. Donate now:
>>>> http://donate.wikimedia.org
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> Wikitech-l [at] lists
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> James Michael DuPont
>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


datzrott at alizeepathology

Jun 5, 2012, 5:57 AM

Post #24 of 27 (1908 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

I second this idea. Large archives should always be available using bittorrent. I would actually suggest posting magnet links for them though instead of .torrent files. This way you can leverage the acceptable source feature of magnet links.

https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file

This way we get the best of both worlds: the constant availability of direct downloads, and the reduction in load that p2p filesharing provides.

Thank you,
Derric Atzrott

-----Original Message-----
From: wikitech-l-bounces [at] lists [mailto:wikitech-l-bounces [at] lists] On Behalf Of Oren Bochman
Sent: 05 June 2012 08:44
To: 'Wikimedia developers'
Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?


-----Original Message-----
From: wikitech-l-bounces [at] lists [mailto:wikitech-l-bounces [at] lists] On Behalf Of Mike Dupont
Sent: Saturday, June 02, 2012 1:28 AM
To: Wikimedia developers; wikiteam-discuss [at] googlegroups
Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
it is amazing how fast the stuff gets deleted on wikipedia.
what about the proposed deletes are there categories for that?
thanks
mike

On Wed, May 30, 2012 at 6:26 AM, Mike Dupont <jamesmikedupont [at] googlemail> wrote:
> https://github.com/h4ck3rm1k3/wikiteam code here
>
> On Wed, May 30, 2012 at 6:26 AM, Mike Dupont
> <jamesmikedupont [at] googlemail> wrote:
>> Ok, I merged the code from wikteam and have a full history dump
>> script that uploads to archive.org, next step is to fix the bucket
>> metadata in the script mike
>>
>> On Tue, May 29, 2012 at 3:08 AM, Mike Dupont
>> <jamesmikedupont [at] googlemail> wrote:
>>> Well, I have now updated the script to include the xml dump in raw
>>> format. I will have to add more information the achive.org item, at
>>> least a basic readme.
>>> other thing is that the wikipybot does not support the full history
>>> it seems, so that I will have to move over to the wikiteam version
>>> and rework it, I just spent 2 hours on this so i am pretty happy for
>>> the first version.
>>>
>>> mike
>>>
>>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
>>>> This is quite nice, though the item's metadata is too little :)
>>>>
>>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont
>>>> <jamesmikedupont [at] googlemail
>>>>> wrote:
>>>>
>>>>> first version of the Script is ready , it gets the versions, puts
>>>>> them in a zip and puts that on archive.org
>>>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de
>>>>> leted.py
>>>>>
>>>>> here is an example output :
>>>>> http://archive.org/details/wikipedia-delete-2012-05
>>>>>
>>>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a
>>>>> rchive2012-05-28T21:34:02.302183.zip
>>>>>
>>>>> I will cron this, and it should give a start of saving deleted data.
>>>>> Articles will be exported once a day, even if they they were
>>>>> exported yesterday as long as they are in one of the categories.
>>>>>
>>>>> mike
>>>>>
>>>>> On Mon, May 21, 2012 at 7:21 PM, Mike Dupont
>>>>> <jamesmikedupont [at] googlemail> wrote:
>>>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
>>>>> > mike
>>>>> >
>>>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
>>>>> >> Create a script that makes a request to Special:Export using
>>>>> >> this
>>>>> category
>>>>> >> as feed
>>>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de
>>>>> >> letion
>>>>> >>
>>>>> >> More info
>>>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
>>>>> >>
>>>>> >>
>>>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
>>>>> >>>
>>>>> >>> Well I whould be happy for items like this :
>>>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
>>>>> >>> would it be possible to extract them easily?
>>>>> >>> mike
>>>>> >>>
>>>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn
>>>>> >>> <ariel [at] wikimedia>
>>>>> >>> wrote:
>>>>> >>> > There's a few other reasons articles get deleted: copyright
>>>>> >>> > issues, personal identifying data, etc. This makes
>>>>> >>> > maintaning the sort of mirror you propose problematic, although a similar mirror is here:
>>>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
>>>>> >>> >
>>>>> >>> > The dumps contain only data publically available at the time
>>>>> >>> > of the
>>>>> run,
>>>>> >>> > without deleted data.
>>>>> >>> >
>>>>> >>> > The articles aren't permanently deleted of course. The
>>>>> >>> > revisions
>>>>> texts
>>>>> >>> > live on in the database, so a query on toolserver, for
>>>>> >>> > example,
>>>>> could be
>>>>> >>> > used to get at them, but that would need to be for research purposes.
>>>>> >>> >
>>>>> >>> > Ariel
>>>>> >>> >
>>>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike
>>>>> >>> > Dupont
>>>>> έγραψε:
>>>>> >>> >> Hi,
>>>>> >>> >> I am thinking about how to collect articles deleted based
>>>>> >>> >> on the
>>>>> "not
>>>>> >>> >> notable" criteria,
>>>>> >>> >> is there any way we can extract them from the mysql
>>>>> >>> >> binlogs? how are these mirrors working? I would be
>>>>> >>> >> interested in setting up a mirror
>>>>> of
>>>>> >>> >> deleted data, at least that which is not spam/vandalism
>>>>> >>> >> based on
>>>>> tags.
>>>>> >>> >> mike
>>>>> >>> >>
>>>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
>>>>> ariel [at] wikimedia>
>>>>> >>> >> wrote:
>>>>> >>> >> > We now have three mirror sites, yay! The full list is
>>>>> >>> >> > linked to
>>>>> from
>>>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum
>>>>> ps#Current_Mirrors
>>>>> >>> >> >
>>>>> >>> >> > Summarizing, we have:
>>>>> >>> >> >
>>>>> >>> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk
>>>>> >>> >> > University (Czech Republic) with the last 5 known good
>>>>> dumps,
>>>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
>>>>> >>> >> >
>>>>> >>> >> > for the latest version of uploaded media, Your.org with
>>>>> >>> >> > http/ftp/rsync access.
>>>>> >>> >> >
>>>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the
>>>>> >>> >> > above sites
>>>>> for
>>>>> >>> >> > volunteering space, time and effort to make this happen.
>>>>> >>> >> >
>>>>> >>> >> > As people noticed earlier, a series of media tarballs
>>>>> >>> >> > per-project (excluding commons) is being generated. As
>>>>> >>> >> > soon as the first run
>>>>> of
>>>>> >>> >> > these is complete we'll announce its location and start
>>>>> >>> >> > generating them on a semi-regular basis.
>>>>> >>> >> >
>>>>> >>> >> > As we've been getting the bugs out of the mirroring
>>>>> >>> >> > setup, it is getting easier to add new locations. Know
>>>>> >>> >> > anyone interested? Please let
>>>>> us
>>>>> >>> >> > know; we would love to have them.
>>>>> >>> >> >
>>>>> >>> >> > Ariel
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > _______________________________________________
>>>>> >>> >> > Wikitech-l mailing list
>>>>> >>> >> > Wikitech-l [at] lists
>>>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > _______________________________________________
>>>>> >>> > Wikitech-l mailing list
>>>>> >>> > Wikitech-l [at] lists
>>>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> James Michael DuPont
>>>>> >>> Member of Free Libre Open Source Software Kosova
>>>>> >>> http://flossk.org Contributor FOSM, the CC-BY-SA map of the
>>>>> >>> world http://fosm.org Mozilla Rep
>>>>> >>> https://reps.mozilla.org/u/h4ck3rm1k3
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> Wikitech-l mailing list
>>>>> >>> Wikitech-l [at] lists
>>>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
>>>>> >> Pre-doctoral student at the University of Cádiz (Spain)
>>>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers |
>>>>> >> WikiTeam Personal website:
>>>>> >> https://sites.google.com/site/emijrp/
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> Xmldatadumps-l mailing list
>>>>> >> Xmldatadumps-l [at] lists
>>>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > James Michael DuPont
>>>>> > Member of Free Libre Open Source Software Kosova
>>>>> > http://flossk.org Contributor FOSM, the CC-BY-SA map of the
>>>>> > world http://fosm.org Mozilla Rep
>>>>> > https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James Michael DuPont
>>>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>>>>
>>>>> _______________________________________________
>>>>> Wikitech-l mailing list
>>>>> Wikitech-l [at] lists
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Hydriz
>>>>
>>>> We've created the greatest collection of shared knowledge in
>>>> history. Help protect Wikipedia. Donate now:
>>>> http://donate.wikimedia.org
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> Wikitech-l [at] lists
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> James Michael DuPont
>>> Member of Free Libre Open Source Software Kosova http://flossk.org
>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>>
>>
>>
>> --
>> James Michael DuPont
>> Member of Free Libre Open Source Software Kosova http://flossk.org
>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Jun 5, 2012, 6:15 AM

Post #25 of 27 (1915 views)
Permalink
Re: [Xmldatadumps-l] XML dumps/Media mirrors update [In reply to]

This is a place where volunteers can step in and make it happen without
the need for Wikimedia's infrastructure. (This means I can concentrate
on my already very full plate of things too.)

http://meta.wikimedia.org/wiki/Data_dump_torrents

Have at!

Ariel

Στις 05-06-2012, ημέρα Τρι, και ώρα 08:57 -0400, ο/η Derric Atzrott
έγραψε:
> I second this idea. Large archives should always be available using bittorrent. I would actually suggest posting magnet links for them though instead of .torrent files. This way you can leverage the acceptable source feature of magnet links.
>
> https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file
>
> This way we get the best of both worlds: the constant availability of direct downloads, and the reduction in load that p2p filesharing provides.
>
> Thank you,
> Derric Atzrott
>
> -----Original Message-----
> From: wikitech-l-bounces [at] lists [mailto:wikitech-l-bounces [at] lists] On Behalf Of Oren Bochman
> Sent: 05 June 2012 08:44
> To: 'Wikimedia developers'
> Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
>
> Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?
>
>
> -----Original Message-----
> From: wikitech-l-bounces [at] lists [mailto:wikitech-l-bounces [at] lists] On Behalf Of Mike Dupont
> Sent: Saturday, June 02, 2012 1:28 AM
> To: Wikimedia developers; wikiteam-discuss [at] googlegroups
> Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
>
> I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
> it is amazing how fast the stuff gets deleted on wikipedia.
> what about the proposed deletes are there categories for that?
> thanks
> mike
>
> On Wed, May 30, 2012 at 6:26 AM, Mike Dupont <jamesmikedupont [at] googlemail> wrote:
> > https://github.com/h4ck3rm1k3/wikiteam code here
> >
> > On Wed, May 30, 2012 at 6:26 AM, Mike Dupont
> > <jamesmikedupont [at] googlemail> wrote:
> >> Ok, I merged the code from wikteam and have a full history dump
> >> script that uploads to archive.org, next step is to fix the bucket
> >> metadata in the script mike
> >>
> >> On Tue, May 29, 2012 at 3:08 AM, Mike Dupont
> >> <jamesmikedupont [at] googlemail> wrote:
> >>> Well, I have now updated the script to include the xml dump in raw
> >>> format. I will have to add more information the achive.org item, at
> >>> least a basic readme.
> >>> other thing is that the wikipybot does not support the full history
> >>> it seems, so that I will have to move over to the wikiteam version
> >>> and rework it, I just spent 2 hours on this so i am pretty happy for
> >>> the first version.
> >>>
> >>> mike
> >>>
> >>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <admin [at] alphacorp> wrote:
> >>>> This is quite nice, though the item's metadata is too little :)
> >>>>
> >>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont
> >>>> <jamesmikedupont [at] googlemail
> >>>>> wrote:
> >>>>
> >>>>> first version of the Script is ready , it gets the versions, puts
> >>>>> them in a zip and puts that on archive.org
> >>>>> https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de
> >>>>> leted.py
> >>>>>
> >>>>> here is an example output :
> >>>>> http://archive.org/details/wikipedia-delete-2012-05
> >>>>>
> >>>>> http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a
> >>>>> rchive2012-05-28T21:34:02.302183.zip
> >>>>>
> >>>>> I will cron this, and it should give a start of saving deleted data.
> >>>>> Articles will be exported once a day, even if they they were
> >>>>> exported yesterday as long as they are in one of the categories.
> >>>>>
> >>>>> mike
> >>>>>
> >>>>> On Mon, May 21, 2012 at 7:21 PM, Mike Dupont
> >>>>> <jamesmikedupont [at] googlemail> wrote:
> >>>>> > Thanks! and run that 1 time per day, they dont get deleted that quickly.
> >>>>> > mike
> >>>>> >
> >>>>> > On Mon, May 21, 2012 at 9:11 PM, emijrp <emijrp [at] gmail> wrote:
> >>>>> >> Create a script that makes a request to Special:Export using
> >>>>> >> this
> >>>>> category
> >>>>> >> as feed
> >>>>> >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de
> >>>>> >> letion
> >>>>> >>
> >>>>> >> More info
> >>>>> https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
> >>>>> >>
> >>>>> >>
> >>>>> >> 2012/5/21 Mike Dupont <jamesmikedupont [at] googlemail>
> >>>>> >>>
> >>>>> >>> Well I whould be happy for items like this :
> >>>>> >>> http://en.wikipedia.org/wiki/Template:Db-a7
> >>>>> >>> would it be possible to extract them easily?
> >>>>> >>> mike
> >>>>> >>>
> >>>>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn
> >>>>> >>> <ariel [at] wikimedia>
> >>>>> >>> wrote:
> >>>>> >>> > There's a few other reasons articles get deleted: copyright
> >>>>> >>> > issues, personal identifying data, etc. This makes
> >>>>> >>> > maintaning the sort of mirror you propose problematic, although a similar mirror is here:
> >>>>> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
> >>>>> >>> >
> >>>>> >>> > The dumps contain only data publically available at the time
> >>>>> >>> > of the
> >>>>> run,
> >>>>> >>> > without deleted data.
> >>>>> >>> >
> >>>>> >>> > The articles aren't permanently deleted of course. The
> >>>>> >>> > revisions
> >>>>> texts
> >>>>> >>> > live on in the database, so a query on toolserver, for
> >>>>> >>> > example,
> >>>>> could be
> >>>>> >>> > used to get at them, but that would need to be for research purposes.
> >>>>> >>> >
> >>>>> >>> > Ariel
> >>>>> >>> >
> >>>>> >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike
> >>>>> >>> > Dupont
> >>>>> έγραψε:
> >>>>> >>> >> Hi,
> >>>>> >>> >> I am thinking about how to collect articles deleted based
> >>>>> >>> >> on the
> >>>>> "not
> >>>>> >>> >> notable" criteria,
> >>>>> >>> >> is there any way we can extract them from the mysql
> >>>>> >>> >> binlogs? how are these mirrors working? I would be
> >>>>> >>> >> interested in setting up a mirror
> >>>>> of
> >>>>> >>> >> deleted data, at least that which is not spam/vandalism
> >>>>> >>> >> based on
> >>>>> tags.
> >>>>> >>> >> mike
> >>>>> >>> >>
> >>>>> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
> >>>>> ariel [at] wikimedia>
> >>>>> >>> >> wrote:
> >>>>> >>> >> > We now have three mirror sites, yay! The full list is
> >>>>> >>> >> > linked to
> >>>>> from
> >>>>> >>> >> > http://dumps.wikimedia.org/ and is also available at
> >>>>> >>> >> >
> >>>>> >>> >> >
> >>>>> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum
> >>>>> ps#Current_Mirrors
> >>>>> >>> >> >
> >>>>> >>> >> > Summarizing, we have:
> >>>>> >>> >> >
> >>>>> >>> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk
> >>>>> >>> >> > University (Czech Republic) with the last 5 known good
> >>>>> dumps,
> >>>>> >>> >> > Your.org (USA) with the complete archive of dumps, and
> >>>>> >>> >> >
> >>>>> >>> >> > for the latest version of uploaded media, Your.org with
> >>>>> >>> >> > http/ftp/rsync access.
> >>>>> >>> >> >
> >>>>> >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the
> >>>>> >>> >> > above sites
> >>>>> for
> >>>>> >>> >> > volunteering space, time and effort to make this happen.
> >>>>> >>> >> >
> >>>>> >>> >> > As people noticed earlier, a series of media tarballs
> >>>>> >>> >> > per-project (excluding commons) is being generated. As
> >>>>> >>> >> > soon as the first run
> >>>>> of
> >>>>> >>> >> > these is complete we'll announce its location and start
> >>>>> >>> >> > generating them on a semi-regular basis.
> >>>>> >>> >> >
> >>>>> >>> >> > As we've been getting the bugs out of the mirroring
> >>>>> >>> >> > setup, it is getting easier to add new locations. Know
> >>>>> >>> >> > anyone interested? Please let
> >>>>> us
> >>>>> >>> >> > know; we would love to have them.
> >>>>> >>> >> >
> >>>>> >>> >> > Ariel
> >>>>> >>> >> >
> >>>>> >>> >> >
> >>>>> >>> >> > _______________________________________________
> >>>>> >>> >> > Wikitech-l mailing list
> >>>>> >>> >> > Wikitech-l [at] lists
> >>>>> >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>> > _______________________________________________
> >>>>> >>> > Wikitech-l mailing list
> >>>>> >>> > Wikitech-l [at] lists
> >>>>> >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> --
> >>>>> >>> James Michael DuPont
> >>>>> >>> Member of Free Libre Open Source Software Kosova
> >>>>> >>> http://flossk.org Contributor FOSM, the CC-BY-SA map of the
> >>>>> >>> world http://fosm.org Mozilla Rep
> >>>>> >>> https://reps.mozilla.org/u/h4ck3rm1k3
> >>>>> >>>
> >>>>> >>> _______________________________________________
> >>>>> >>> Wikitech-l mailing list
> >>>>> >>> Wikitech-l [at] lists
> >>>>> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> >>>>> >> Pre-doctoral student at the University of Cádiz (Spain)
> >>>>> >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers |
> >>>>> >> WikiTeam Personal website:
> >>>>> >> https://sites.google.com/site/emijrp/
> >>>>> >>
> >>>>> >>
> >>>>> >> _______________________________________________
> >>>>> >> Xmldatadumps-l mailing list
> >>>>> >> Xmldatadumps-l [at] lists
> >>>>> >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >>>>> >>
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > James Michael DuPont
> >>>>> > Member of Free Libre Open Source Software Kosova
> >>>>> > http://flossk.org Contributor FOSM, the CC-BY-SA map of the
> >>>>> > world http://fosm.org Mozilla Rep
> >>>>> > https://reps.mozilla.org/u/h4ck3rm1k3
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> James Michael DuPont
> >>>>> Member of Free Libre Open Source Software Kosova http://flossk.org
> >>>>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> >>>>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
> >>>>>
> >>>>> _______________________________________________
> >>>>> Wikitech-l mailing list
> >>>>> Wikitech-l [at] lists
> >>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Hydriz
> >>>>
> >>>> We've created the greatest collection of shared knowledge in
> >>>> history. Help protect Wikipedia. Donate now:
> >>>> http://donate.wikimedia.org
> >>>> _______________________________________________
> >>>> Wikitech-l mailing list
> >>>> Wikitech-l [at] lists
> >>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>>
> >>>
> >>>
> >>> --
> >>> James Michael DuPont
> >>> Member of Free Libre Open Source Software Kosova http://flossk.org
> >>> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> >>> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
> >>
> >>
> >>
> >> --
> >> James Michael DuPont
> >> Member of Free Libre Open Source Software Kosova http://flossk.org
> >> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> >> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
> >
> >
> >
> > --
> > James Michael DuPont
> > Member of Free Libre Open Source Software Kosova http://flossk.org
> > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.