Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech
Re: Page views
 

Index | Next | Previous | View Flat


ezachte at wikimedia

Apr 10, 2012, 4:45 PM


Views: 144
Permalink
Re: Page views [In reply to]

Here are some numbers on total bot burden:

1)
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states
for March 2012:

In total 69.5 M page requests (mime type text/html only!) per day are
considered crawler requests, out of 696 M page requests (10.0%) or 469 M
external page requests (14.8%). About half (35.1 M) of crawler requests come
from Google.

2)
Here are counts from one day log, as sanity check:

zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P
'/wiki/|index.php' | grep -cP ' - |text/html' => 678325

zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P
'/wiki/|index.php' | grep -P ' - |text/html' | grep -ciP
'bot|crawler|spider' => 68027

68027 / 678325 = 10.0% which matches really well with numbers from
SquidReportCrawlers.htm

---

My suggestion for how to filter these bots efficiently in c program (no
costly nuanced regexps) before sending data to webstatscollector:

a) Find 14th field in space delimited log line = user agent (but beware of
false delimiters in logs from varnish, if still applicable)
b) Search this field case insensitive for bot/crawler/spider/http (by
convention only bots have url in agent string)

That will filter out most bot pollution. We still want those records in
sampled log though.

Any thoughts?

Erik Zachte

-----Original Message-----
From: wikitech-l-bounces [at] lists
[mailto:wikitech-l-bounces [at] lists] On Behalf Of emijrp
Sent: Sunday, April 08, 2012 9:21 PM
To: Wikimedia developers
Cc: Diederik van Liere; Lars Aronsson
Subject: Re: [Wikitech-l] Page views

2012/4/8 Erik Zachte <ezachte [at] wikimedia>

> Hi Lars,
>
> You have a point here, especially for smaller projects:
>
> For Swedish Wikisource:
>
> zcat sampled-1000.log-20120404.gz | grep 'GET
> http://sv.wikisource.org' | awk '{print $9, $11,$14}'
>
> returns 20 lines from this 1:1000 sampled squid log file after
> removing javascript/json/robots.txt there are 13 left, which fits
> perfectly with 10,000 to 13,000 per day
>
> however 9 of these are bots!!
>
>
How many of that 1000 sample log were robots (including all languages)?

--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral
student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Subject User Time
Page views lars at aronsson Apr 7, 2012, 5:24 PM
    Re: Page views ezachte at wikimedia Apr 8, 2012, 12:16 PM
        Re: Page views emijrp at gmail Apr 8, 2012, 12:20 PM
        Re: Page views srik.lak at gmail Apr 9, 2012, 12:30 AM
    Re: Page views dvanliere at gmail Apr 9, 2012, 12:28 PM
    Re: Page views ezachte at wikimedia Apr 9, 2012, 3:30 PM
    Re: Page views ezachte at wikimedia Apr 10, 2012, 4:45 PM
        Re: Page views lars at aronsson Apr 11, 2012, 3:31 AM
        Re: Page views dvanliere at gmail Apr 11, 2012, 5:48 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.