lars at aronsson
Apr 11, 2012, 3:31 AM
On 04/11/2012 01:45 AM, Erik Zachte wrote:
> Here are some numbers on total bot burden:
> http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states
> for March 2012:
> In total 69.5 M page requests (mime type text/html only!) per day are
> considered crawler requests, out of 696 M page requests (10.0%) or 469 M
> external page requests (14.8%). About half (35.1 M) of crawler requests come
> from Google.
The fraction will be larger than average (larger than 10%) for
a) sites with many small pages (Wiktionary) and
b) sites in languages with a smaller audience (Swedish sites).
Bots will index these pages as they are found, but each
of these pages can expect fewer search hits and less human
traffic than long articles (Wikipedia) in languages with many
speakers (English). The bot traffic is like a constant
background noise, and the human traffic is the signal on top.
Sites with many small pages and a small audience will have
a lower signal-to-noise ratio. The long tail of seldom
visited pages is drowning in that noise.
I should disclose that I "work for the competition". I tried
to add books to Wikisource, but its complexity slows me down
so I'm now focusing on my own Scandinavian book scanning
website Project Runeberg, http://runeberg.org/
It has 700,000 scanned book pages, the same size as the
English Wikisource, which is a large number of pages for
a small language audience (mostly Swedish). Yesterday,
April 10, its Apache access log had 291,000 hits, of which
116,000 are HTML pages, but 71,000 match bot/spider/crawler,
leaving only 45,000 human page views. If Swedish Wikisource
which is 1/20 that size would get 10-13 thousand human page
views per day or 1/4 of that web traffic, I'd be surprised.
It is more likely that 71/116 = 61% is bot traffic.
(Are we competitors? Really not. We're both liberating
content. Swedish Wikipedia has more external links
to runeberg.org than to any other website.)
Lars Aronsson (lars [at] aronsson)
Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
Wikitech-l mailing list
Wikitech-l [at] lists