Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Foundation

[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded


jamesmikedupont at googlemail

May 18, 2012, 2:14 AM

Post #26 of 28 (54 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Hello People,
I have completed my first set in uploading the osm/fosm dataset (350gb
unpacked) to archive.org
http://osmopenlayers.blogspot.de/2012/05/upload-finished.html

We can do something similar with wikipedia, the bucket size of
archive.org is 10gb, we need to split up the data in a way that it is
useful. I have done this by putting each object on one line and each
file contains the full data records and the parts that belong to the
previous block and next block, so you are able to process the blocks
almost stand alone.

mike

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


emijrp at gmail

May 18, 2012, 2:41 AM

Post #27 of 28 (57 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

There is no such 10GB limit,
http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)

ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
want to join the effort use the mailing list
https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.

2012/5/18 Mike Dupont <jamesmikedupont [at] googlemail>

> Hello People,
> I have completed my first set in uploading the osm/fosm dataset (350gb
> unpacked) to archive.org
> http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
>
> We can do something similar with wikipedia, the bucket size of
> archive.org is 10gb, we need to split up the data in a way that it is
> useful. I have done this by putting each object on one line and each
> file contains the full data records and the parts that belong to the
> previous block and next block, so you are able to process the blocks
> almost stand alone.
>
> mike
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


jamesmikedupont at googlemail

May 18, 2012, 3:12 AM

Post #28 of 28 (61 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

there is no 10gb limit, but it is the recommended bucket size if you
want to split up the file, according to my recent discussion with the
archive.org team, and they have been helping me optimize the storage.
the idea of mine is to make smaller blocks that can be fetched quickly
and that people for example reading an article could just load the
data needed to display would be availab le via json(p) or xml/text
from a file.
we can make the wikipedia in a read only mode hosted totallz on the
archive org without a database server by encoding the search binary
trees in json data stored also on archive org, the clients can perform
the searches themselves.
that is my current research on fosm.org and i hope it can apply to the
wikipedia as well.
mike

On Fri, May 18, 2012 at 9:41 AM, emijrp <emijrp [at] gmail> wrote:
> There is no such 10GB limit,
> http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)
>
> ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
> want to join the effort use the mailing list
> https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.
>
> 2012/5/18 Mike Dupont <jamesmikedupont [at] googlemail>
>
>> Hello People,
>> I have completed my first set in uploading the osm/fosm dataset (350gb
>> unpacked) to archive.org
>> http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
>>
>> We can do something similar with wikipedia, the bucket size of
>> archive.org is 10gb, we need to split up the data in a way that it is
>> useful. I have done this by putting each object on one line and each
>> file contains the full data records and the parts that belong to the
>> previous block and next block, so you are able to process the blocks
>> almost stand alone.
>>
>> mike
>>
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l [at] lists
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
>
> --
> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> Pre-doctoral student at the University of Cádiz (Spain)
> Projects: AVBOT <http://code.google.com/p/avbot/> |
> StatMediaWiki<http://statmediawiki.forja.rediris.es>
> | WikiEvidens <http://code.google.com/p/wikievidens/> |
> WikiPapers<http://wikipapers.referata.com>
> | WikiTeam <http://code.google.com/p/wikiteam/>
> Personal website: https://sites.google.com/site/emijrp/
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.