Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


dvanliere at gmail

Aug 17, 2011, 9:58 AM

Post #1 of 3 (540 views)
Permalink
Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
on a customized stream-based InputFormatReader that allows parsing of both
bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
accompanying InputFormatReader it was not possible to use Hadoop to analyze
the full Wikipedia dump files (see the detailed tutorial / background for an
explanation why that was not possible).

This means:
1) We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files.
2) You can send either one or two revisions to a single mapper so it's
possible to diff two revisions and see what content has been addded /
removed.
3) You can exclude namespaces by supplying a regular expression.
4) We are using Hadoop's Streaming interface which means people can use this
InputFormat Reader using different languages such as Java, Python, Ruby and
PHP.

The source code is available at: https://github.com/whym/wikihadoop
A more detailed tutorial and installation guide is available at:
https://github.com/whym/wikihadoop/wiki


(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/


Best,

Diederik
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


tfinc at wikimedia

Aug 17, 2011, 10:05 AM

Post #2 of 3 (517 views)
Permalink
Re: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files [In reply to]

Very cool!

--tomasz



On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <dvanliere [at] gmail> wrote:
> Hello!
>
> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
> Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
> on a customized stream-based InputFormatReader that allows parsing of both
> bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
> with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
> accompanying InputFormatReader it was not possible to use Hadoop to analyze
> the full Wikipedia dump files (see the detailed tutorial / background for an
> explanation why that was not possible).
>
> This means:
> 1) We can now harness Hadoop's distributed computing capabilities in
> analyzing the full dump files.
> 2) You can send either one or two revisions to a single mapper so it's
> possible to diff two revisions and see what content has been addded /
> removed.
> 3) You can exclude namespaces by supplying a regular expression.
> 4) We are using Hadoop's Streaming interface which means people can use this
> InputFormat Reader using different languages such as Java, Python, Ruby and
> PHP.
>
> The source code is available at: https://github.com/whym/wikihadoop
> A more detailed tutorial and installation guide is available at:
> https://github.com/whym/wikihadoop/wiki
>
>
> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>
> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>
>
> Best,
>
> Diederik
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


asharma at wikimedia

Aug 17, 2011, 10:12 AM

Post #3 of 3 (513 views)
Permalink
Re: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files [In reply to]

Way cool - Look forward to a brown bag on this project - Diederik? :-)

-Alolita

On Wed, Aug 17, 2011 at 10:05 AM, Tomasz Finc <tfinc [at] wikimedia> wrote:
> Very cool!
>
> --tomasz
>
>
>
> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <dvanliere [at] gmail> wrote:
>> Hello!
>>
>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
>> Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
>> on a customized stream-based InputFormatReader that allows parsing of both
>> bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
>> with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
>> accompanying InputFormatReader it was not possible to use Hadoop to analyze
>> the full Wikipedia dump files (see the detailed tutorial / background for an
>> explanation why that was not possible).
>>
>> This means:
>> 1) We can now harness Hadoop's distributed computing capabilities in
>> analyzing the full dump files.
>> 2) You can send either one or two revisions to a single mapper so it's
>> possible to diff two revisions and see what content has been addded /
>> removed.
>> 3) You can exclude namespaces by supplying a regular expression.
>> 4) We are using Hadoop's Streaming interface which means people can use this
>> InputFormat Reader using different languages such as Java, Python, Ruby and
>> PHP.
>>
>> The source code is available at: https://github.com/whym/wikihadoop
>> A more detailed tutorial and installation guide is available at:
>> https://github.com/whym/wikihadoop/wiki
>>
>>
>> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>>
>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>>
>>
>> Best,
>>
>> Diederik
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.