
marco at harddisk
Jan 28, 2010, 11:21 AM
Post #12 of 12
(835 views)
Permalink
|
On Thu, Jan 28, 2010 at 5:02 PM, Tei <oscar.vives [at] gmail> wrote: > On 28 January 2010 15:06, 李琴 <qli [at] ica> wrote: > > Hi all, > > I have built a LocalWiki. Now I want the data of it to keep > consistent > > with the > > Wikipedia and one work I should do is to get the data of update from > > Wikipedia. > > I get the URLs through analyzing the RSS > > ( > http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B4%E6%94%B9&feed=rss > ) > > and get all HTML content of the edit box by analyzing > > these URLs after opening an URL and clicking the ’edit this page’. > .... > > That’s because I visit it too frequently and my IP address is prohibited > > or the network is too slow? > > 李琴 well.. thats webscrapping, that is a poor tecnique, one with lots > of errors that generate lots of trafic. > > One thing a robot must do is read and follow the > http://zh.wikipedia.org/robots.txt file ( probably you sould read it > too) > As a general rule of Internet, a "rude" robot will be banned by the > site admins. > > It would be a good idea to anounce your bot as a bot in the user_agent > string . Good bot beavior is one that read a website like a human. I > don't know, like 10 request minute?. I don't know about this > "Wikipedia" site rules about it. > > What you are suffering could be automatic or manual throttling, since > is detected a abusive number of request from your IP. > > "Wikipedia" seems to provide fulldumps of his wiki, but are unusable > for you, since are giganteous :-/, trying to rebuilt wikipedia on your > PC with a snapshot would be like summoning Tchulu in a teapot. But.. I > don't know, maybe the zh version is smaller, or your resources > powerfull enough. One feels that what you have built has a severe > overload (wastage of resources) and there must be better ways to do > it... > Indeed there are. What you need: 1) the Wikimedia IRC live feed - last time I've looked at it, it was at irc://irc.wikimedia.org/ and then each project had its own channel. 2) A PHP IRC bot framework - Net_SmartIRC is well-written and easy to get started with 3) the page source you can EASILY get either in rendered form http://zh.wikipedia.org/w/index.php?title=TITLE&action=render or in raw form http://zh.wikipedia.org/w/index.php?title=TITLE&action=raw (this is page source). Marco -- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de _______________________________________________ Wikitech-l mailing list Wikitech-l [at] lists https://lists.wikimedia.org/mailman/listinfo/wikitech-l
|