Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

How do I correctly download Wikipedia pages?

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


steven at REMOVE

Nov 25, 2009, 7:45 PM

Post #1 of 6 (1010 views)
Permalink
How do I correctly download Wikipedia pages?

I'm trying to scrape a Wikipedia page from Python. Following instructions
here:

http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://en.wikipedia.org/wiki/Special:Export

I use the URL "http://en.wikipedia.org/wiki/Special:Export/Train" instead
of just "http://en.wikipedia.org/wiki/Train". But instead of getting the
page I expect, and can see in my browser, I get an error page:


>>> import urllib
>>> url = "http://en.wikipedia.org/wiki/Special:Export/Train"
>>> print urllib.urlopen(url).read()
...
Our servers are currently experiencing a technical problem. This is
probably temporary and should be fixed soon
...


(Output is obviously truncated for your sanity and mine.)


Is there a trick to downloading from Wikipedia with urllib?



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


kursat.kutlu at gmail

Nov 25, 2009, 7:58 PM

Post #2 of 6 (984 views)
Permalink
Re: How do I correctly download Wikipedia pages? [In reply to]

Hi,

Try not to be caught if you send multiple requests :)

Have a look at here: http://wolfprojects.altervista.org/changeua.php

Regards
Kutlu

On Nov 26, 5:45 am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.au> wrote:
> I'm trying to scrape a Wikipedia page from Python. Following instructions
> here:
>
> http://en.wikipedia.org/wiki/Wikipedia:Database_downloadhttp://en.wikipedia.org/wiki/Special:Export
>
> I use the URL "http://en.wikipedia.org/wiki/Special:Export/Train" instead
> of just "http://en.wikipedia.org/wiki/Train". But instead of getting the
> page I expect, and can see in my browser, I get an error page:
>
> >>> import urllib
> >>> url = "http://en.wikipedia.org/wiki/Special:Export/Train"
> >>> print urllib.urlopen(url).read()
>
> ...
> Our servers are currently experiencing a technical problem. This is
> probably temporary and should be fixed soon
> ...
>
> (Output is obviously truncated for your sanity and mine.)
>
> Is there a trick to downloading from Wikipedia with urllib?
>
> --
> Steven

--
http://mail.python.org/mailman/listinfo/python-list


apt.shansen at gmail

Nov 25, 2009, 8:04 PM

Post #3 of 6 (983 views)
Permalink
Re: How do I correctly download Wikipedia pages? [In reply to]

2009/11/25 Steven D'Aprano <steven [at] remove>

> I'm trying to scrape a Wikipedia page from Python. Following instructions
> here:
>
>
Have you checked out http://meta.wikimedia.org/wiki/Pywikipediabot?

Its not just via urllib, but I've scraped several MediaWiki-based sites with
the software successfully.

--S


taskinoor.hasan at csebuet

Nov 25, 2009, 8:37 PM

Post #4 of 6 (975 views)
Permalink
Re: How do I correctly download Wikipedia pages? [In reply to]

I fetched a different problem. Whenever I tried to fetch any page from
wikipedia, I received 403. Then I found that wikipedia don't accept the
default user-agent (might be python-urllib2.x or something like this). After
setting my own user-agent, it worked fine. You can try this if you receive
403.

On Thu, Nov 26, 2009 at 10:04 AM, Stephen Hansen <apt.shansen [at] gmail>wrote:

>
>
> 2009/11/25 Steven D'Aprano <steven [at] remove>
>
> I'm trying to scrape a Wikipedia page from Python. Following instructions
>> here:
>>
>>
> Have you checked out http://meta.wikimedia.org/wiki/Pywikipediabot?
>
> Its not just via urllib, but I've scraped several MediaWiki-based sites
> with the software successfully.
>
> --S
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


steven at REMOVE

Nov 25, 2009, 8:59 PM

Post #5 of 6 (932 views)
Permalink
Re: How do I correctly download Wikipedia pages? [In reply to]

On Wed, 25 Nov 2009 19:58:57 -0800, ShoqulKutlu wrote:

> Hi,
>
> Try not to be caught if you send multiple requests :)
>
> Have a look at here: http://wolfprojects.altervista.org/changeua.php

Thanks, that seems to work perfectly.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


cousinstanley at gmail

Nov 26, 2009, 9:38 AM

Post #6 of 6 (972 views)
Permalink
Re: How do I correctly download Wikipedia pages? [In reply to]

> I'm trying to scrape a Wikipedia page from Python.
> ....

On occasion I use a program under Debian Linux
called wikipedia2text that is very handy
for downloading wikipedia pages as plain text files ....

Description: displays Wikipedia articles on the command line

This script fetches Wikipedia articles (currently supports
around 30 Wikipedia languages) and displays them as plain text
in a pager or just sends the text to standard out. Alternatively
it opens the Wikipedia article in a (possibly GUI) web browser
or just shows the URL of the appropriate Wikipedia article.

Example directed through the lynx browser ....

wp2t -b lynx gorilla > gorilla.txt


--
Stanley C. Kitching
Human Being
Phoenix, Arizona

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.